MarkTechPost@AI 02月25日
Open-Reasoner-Zero: An Open-source Implementation of Large-Scale Reasoning-Oriented Reinforcement Learning Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Open-Reasoner-Zero (ORZ) 是由 StepFun 和清华大学的研究人员提出的一个开源项目,旨在实现大规模、面向推理的语言模型强化学习训练。该项目通过直接对基础模型进行大规模RL训练,探索了一种新的扩展现象。ORZ 使用 Qwen2.5-{7B, 32B} 作为基础模型,并采用优化的 PPO 算法,专注于 STEM、数学和各种推理任务。实验结果表明,ORZ 在多个指标上都有显著的性能提升,尤其是在训练效率方面,甚至在某些基准测试中超越了 DeepSeek 的 R1-Zero 模型。

🚀Open-Reasoner-Zero (ORZ) 是一个开源项目,旨在实现大规模、面向推理的语言模型强化学习训练,使先进的RL训练技术更易于被广泛的研究社区所使用。

🧮ORZ 增强了多种推理技能,包括算术、逻辑、编码和常识推理任务,并通过全面的训练策略解决了训练稳定性、响应长度优化和基准性能改进等关键挑战。

📊实验结果表明,Open-Reasoner-Zero 在多个评估指标上表现出色,尤其是在 32B 配置中。与 DeepSeek-R1-Zero-Qwen2.5-32B 相比,在 GPQA DIAMOND 基准测试中取得了优异的成绩,而训练步骤仅为其 1/30,展示了卓越的训练效率。

💡研究表明,使用 vanilla PPO 结合 GAE 和基于规则的奖励函数,可以获得与更复杂的系统相比具有竞争力的结果,而无需 KL 正则化,证明了复杂的架构修改可能不是实现强大推理能力所必需的。

Large-scale reinforcement learning (RL) training of language models on reasoning tasks has become a promising technique for mastering complex problem-solving skills. Currently, methods like OpenAI’s o1 and DeepSeek’s R1-Zero, have demonstrated remarkable training time scaling phenomenon. Both models’ benchmark performance and response length consistently and steadily increase without any sign of saturation as the training computation scales up. Inspired by these advancements, researchers in this paper have explored this new scaling phenomenon by conducting large-scale RL training directly on base models and referred to this approach as Reasoner-Zero training.

Researchers from StepFun and Tsinghua University have proposed Open-Reasoner-Zero (ORZ), an open-source implementation of large-scale reasoning-oriented RL training for language models. It represents a significant advancement in making advanced RL training techniques accessible to the broader research community. ORZ enhances diverse reasoning skills under verifiable rewards, including arithmetic, logic, coding, and common-sense reasoning tasks. It addresses critical challenges in training stability, response length optimization, and benchmark performance improvements through a comprehensive training strategy. Unlike previous approaches that provided limited implementation details, ORZ offers detailed insights into its methodology and best practices.

The ORZ framework utilizes the Qwen2.5-{7B, 32B} as the base model, and implements direct large-scale RL training without preliminary fine-tuning steps. The system leverages a scaled-up version of the standard PPO algorithm, optimized specifically for reasoning-oriented tasks. The training dataset consists of carefully curated question-answer pairs focusing on STEM, Math, and diverse reasoning tasks. The architecture incorporates a specialized prompt template designed to enhance inference computation capabilities. The implementation is built on OpenRLHF, featuring significant improvements including a flexible trainer, GPU collocation generation, and advanced offload-backload support mechanisms for efficient large-scale training.

The training results demonstrate significant performance improvements across multiple metrics for both the 7B and 32B variants of Open-Reasoner-Zero. Training curves reveal consistent enhancements in reward metrics and response lengths, with a notable “step moment” phenomenon indicating sudden improvements in reasoning capabilities. During Response Length Scale-up vs DeepSeek-R1-Zero, the Open-Reasoner-Zero-32B model achieves comparable response lengths to DeepSeek-R1-Zero (671B MoE) with only 1/5.8 of the training steps. This efficiency validates the effectiveness of the minimalist approach to large-scale RL training.

The main experimental results show that Open-Reasoner-Zero performs exceptionally well across multiple evaluation metrics, particularly in the 32B configuration. It achieves superior results compared to DeepSeek-R1-Zero-Qwen2.5-32B on the GPQA DIAMOND benchmark while requiring only 1/30 of the training steps, showcasing remarkable training efficiency. Moreover, the 7B variant exhibits interesting learning dynamics, with steady accuracy improvements and dramatic response length growth patterns. A distinctive “step moment” phenomenon has been observed during evaluation, characterized by sudden increases in both reward and response length, particularly evident in GPQA DIAMOND and AIME2024 benchmarks.

In this paper, researchers introduced Open-Reasoner-Zero, representing a significant milestone in democratizing large-scale reasoning-oriented RL training for language models. The research shows that a simplified approach using vanilla PPO with GAE and rule-based reward functions can achieve competitive results compared to more complex systems. The successful implementation without KL regularization proves that complex architectural modifications may not be necessary for achieving strong reasoning capabilities. By open-sourcing the complete training pipeline and sharing detailed insights, this work establishes a foundation for future research in scaling language model reasoning abilities, and this is just the beginning of a new scaling trend in AI development.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Open-Reasoner-Zero: An Open-source Implementation of Large-Scale Reasoning-Oriented Reinforcement Learning Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Open-Reasoner-Zero 强化学习 语言模型 开源 推理
相关文章