MarkTechPost@AI 前天 12:38
MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MiroMind-M1系列模型在数学推理领域取得了显著进展,它基于Qwen-2.5模型,提供了一个完全开源的解决方案,包括数据集、模型、训练代码和评估脚本。该项目采用了两阶段训练协议:首先通过719K的数学问题进行监督微调(SFT),然后利用62K的挑战性问题进行可验证奖励的强化学习(RLVR)。值得一提的是,MiroMind-M1在数据透明度和质量方面尤为突出,通过严格的数据清洗和去重,确保了训练数据的纯净和有效性。其创新的CAMPO算法有效解决了强化学习中的训练不稳定和token效率问题,使得模型在保持高准确率的同时,推理过程更加简洁高效。MiroMind-M1的发布,不仅在开源模型领域树立了新的标杆,也为推动可复现的AI研究奠定了坚实基础。

✨ **开源的数学推理能力**: MiroMind-M1系列模型基于Qwen-2.5,提供了一个全开源的数学推理解决方案,涵盖数据集、模型、训练代码和评估脚本,旨在提升AI在多步逻辑推理,特别是数学问题解决方面的能力,并促进研究的透明度和可复现性。

🚀 **两阶段训练策略**: 模型采用了监督微调(SFT)和带可验证奖励的强化学习(RLVR)相结合的两阶段训练方法。SFT阶段使用了719K经过验证的数学问题来增强模型的逐步推理能力,而RLVR阶段则利用62K的复杂问题和外部验证器提供的奖励信号来进一步优化模型的准确性和效率。

📊 **数据透明与质量保证**: MiroMind-M1项目高度重视训练数据的质量和透明度。SFT语料库包含了经过验证的解决方案和详细的多步推理过程,并通过N-gram重叠过滤等方法进行严格的去重和数据污染检测,确保训练数据的纯净,并优先选择具有长推理轨迹的样本以提升模型性能。

💡 **CAMPO算法创新**: 针对强化学习中的训练不稳定和token效率低下问题,MiroMind-M1引入了CAMPO(Context-Aware Multi-Stage Policy Optimization)算法。该算法通过多阶段训练和动态重复惩罚机制,有效稳定了强化学习过程,并生成了更短、更相关的解决方案,从而在不牺牲准确性的前提下,加速了推理并降低了成本。

🏆 **领先的基准性能**: MiroMind-M1的开源模型在多个数学推理基准测试中取得了与专有模型相当甚至更优的性能。例如,在AIME24、AIME25和MATH500等评测中,MiroMind-RL-7B和MiroMind-RL-32B模型均表现出色,尤其在token效率方面,其解决方案更加简洁,证明了其在先进数学推理领域的强大实力。

Large language models (LLMs) have recently demonstrated remarkable progress in multi-step reasoning, establishing mathematical problem-solving as a rigorous benchmark for assessing advanced capabilities. While proprietary models like GPT-4o and Claude Sonnet 4 lead performance, their closed-source nature impedes transparency and reproducibility. Addressing these gaps, MiroMind AI Released the MiroMind-M1 series, a fully open-source pipeline—spanning datasets, models, training code, and evaluation scripts—that sets new standards for openness and state-of-the-art mathematical reasoning within the Qwen-2.5 model ecosystem.

Architectural Foundation and Motivation

MiroMind-M1 is built on the robust Qwen-2.5 backbone, with enhancements geared explicitly for mathematical reasoning. The team adopts a two-stage training protocol:

    Supervised Fine-Tuning (SFT): The model is fine-tuned on 719K carefully curated and verified mathematical problems, equipping it with strong step-by-step reasoning abilities.Reinforcement Learning with Verifiable Rewards (RLVR): Next, the model undergoes RL on 62K challenging and rigorously verifiable math problems, leveraging reward signals from a robust external verifier.

This approach is motivated by both the need for strong mathematical logic and by the lessons learned from leading RLMs: imitating chain-of-thought exemplars improves general reasoning, while reinforcement learning, guided by precise rewards, further refines accuracy and efficiency.

Data Transparency and Quality

A hallmark of the MiroMind-M1 project is the full openness and cleanliness of its training data:

The resulting dataset provides 719K verified training traces—significantly advancing open reproducible research over prior efforts.

Supervised Fine-Tuning: Empirical Excellence

For SFT, MiroMind-SFT-7B is initialized from Qwen2.5-Math-7B and trained with a large context window (max 32,768 tokens) and a no-packing strategy to avoid cross-sample attention contamination. Its performance on key math benchmarks outpaces peer open models:

ModelAIME24AIME25MATH500
DeepSeek-R1-Distill55.540.492.8
MiMo-7B-SFT58.744.393.0
MiroMind-SFT-7B60.445.094.6

These results validate the efficacy of the data curation and training design: richer, deeper samples and no-packing lead to consistently superior performance.

CAMPO: Context-Aware Multi-Stage Policy Optimization

A key innovation in MiroMind-M1’s RLVR phase is the CAMPO algorithm. CAMPO addresses two critical RL challenges—training instability and token inefficiency—by:

CAMPO not only stabilizes RL dynamics but also results in models that solve problems with fewer, more relevant tokens—accelerating inference and reducing costs without sacrificing accuracy.

Benchmark Performance: State-of-the-Art Efficiency

MiroMind’s open models achieve highly competitive or state-of-the-art results for open Qwen-2.5-based math models (7B/32B parameters):

ModelAIME24AIME25MATH500
DeepSeek-R1-7B55.539.2
MiMo-7B-RL68.255.495.8
Skywork-OR1-7B72.254.6
MiroMind-RL-7B73.457.896.7
Skywork-OR1-32B77.168.297.5
MiroMind-RL-32B77.565.696.4

Notably, MiroMind-M1-RL models not only match or exceed peer accuracy, but do so with greater token efficiency—the 32B model produces shorter, more concise solutions without loss of correctness, thanks to CAMPO’s training.

Full Stack and Reproducibility

Every component of the MiroMind-M1 stack is openly released:

Researchers can replicate, audit, and extend MiroMind-M1 from raw data to trained models, advancing reproducibility and accelerating new open research.

Conclusion

MiroMind-M1 demonstrates that with careful data curation, innovative RL algorithms (CAMPO), and radical transparency, open-source language models can rival proprietary systems in advanced mathematical reasoning. This project sets a new bar for reproducibility and collaborative advancement in reasoning LLMs, providing both a high-quality resource and a robust platform for future innovation.


Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post MiroMind-M1: Advancing Open-Source Mathematical Reasoning via Context-Aware Multi-Stage Reinforcement Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MiroMind-M1 大型语言模型 数学推理 强化学习 开源AI
相关文章