MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning

Large language models (LLMs) have recently demonstrated remarkable progress in multi-step reasoning, establishing mathematical problem-solving as a rigorous benchmark for assessing advanced capabilities. While proprietary models like GPT-4o and Claude Sonnet 4 lead performance, their closed-source nature impedes transparency and reproducibility. Addressing these gaps, MiroMind AI Released the MiroMind-M1 series, a fully open-source pipeline—spanning datasets, models, training code, and evaluation scripts—that sets new standards for openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem.

Architectural Foundation and Motivation

MiroMind-M1 is built on the robust Qwen-2.5 backbone, with enhancements geared explicitly for mathematical reasoning. The team adopts a two-stage training protocol:

Supervised Fine-Tuning (SFT):

Reinforcement Learning with Verifiable Rewards (RLVR):

This approach is motivated by both the need for strong mathematical logic and by the lessons learned from leading RLMs: imitating chain-of-thought exemplars improves general reasoning, while reinforcement learning, guided by precise rewards, further refines accuracy and efficiency.

Data Transparency and Quality

A hallmark of the MiroMind-M1 project is the full openness and cleanliness of its training data:

SFT corpus composition:

Stringent deduplication and decontamination:

Preference for long trajectories:

The resulting dataset provides 719K verified training traces—significantly advancing open reproducible research over prior efforts.

Supervised Fine-Tuning: Empirical Excellence

For SFT, MiroMind-SFT-7B is initialized from Qwen2.5-Math-7B and trained with a large context window (max 32,768 tokens) and a no-packing strategy to avoid cross-sample attention contamination. Its performance on key math benchmarks outpaces peer open models:

Model	AIME24	AIME25	MATH500
DeepSeek-R1-Distill	55.5	40.4	92.8
MiMo-7B-SFT	58.7	44.3	93.0
MiroMind-SFT-7B	60.4	45.0	94.6

These results validate the efficacy of the data curation and training design: richer, deeper samples and no-packing lead to consistently superior performance.

CAMPO: Context-Aware Multi-Stage Policy Optimization

A key innovation in MiroMind-M1’s RLVR phase is the CAMPO algorithm. CAMPO addresses two critical RL challenges—training instability and token inefficiency—by:

Multi-stage training with expanding context limits:

Dynamic repetition penalty:

Accurate external verifier:

CAMPO not only stabilizes RL dynamics but also results in models that solve problems with fewer, more relevant tokens—accelerating inference and reducing costs without sacrificing accuracy.

Benchmark Performance: State-of-the-Art Efficiency

MiroMind’s open models achieve highly competitive or state-of-the-art results for open Qwen-2.5-based math models (7B/32B parameters):

Model	AIME24	AIME25	MATH500
DeepSeek-R1-7B	55.5	39.2	–
MiMo-7B-RL	68.2	55.4	95.8
Skywork-OR1-7B	72.2	54.6	–
MiroMind-RL-7B	73.4	57.8	96.7
Skywork-OR1-32B	77.1	68.2	97.5
MiroMind-RL-32B	77.5	65.6	96.4

Notably, MiroMind-M1-RL models not only match or exceed peer accuracy, but do so with greater token efficiency—the 32B model produces shorter, more concise solutions without loss of correctness, thanks to CAMPO’s training.

Full Stack and Reproducibility

Every component of the MiroMind-M1 stack is openly released:

Model weights

Datasets

Training scripts

Evaluation code

Researchers can replicate, audit, and extend MiroMind-M1 from raw data to trained models, advancing reproducibility and accelerating new open research.

Conclusion

MiroMind-M1 demonstrates that with careful data curation, innovative RL algorithms (CAMPO), and radical transparency, open-source language models can rival proprietary systems in advanced mathematical reasoning. This project sets a new bar for reproducibility and collaborative advancement in reasoning LLMs, providing both a high-quality resource and a robust platform for future innovation.

Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning appeared first on MarkTechPost.

Architectural Foundation and Motivation

Data Transparency and Quality

Supervised Fine-Tuning: Empirical Excellence

CAMPO: Context-Aware Multi-Stage Policy Optimization

Benchmark Performance: State-of-the-Art Efficiency

Full Stack and Reproducibility

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签