MarkTechPost@AI 03月20日 13:11
ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ByteDance联合清华大学、香港大学发布了DAPO,一个用于增强大型语言模型推理能力的全开源强化学习系统。DAPO旨在通过开放所有算法细节、训练程序和数据集,弥补可复现性方面的差距。DAPO基于verl框架构建,包含训练代码和一个名为DAPO-Math-17K的数学推理任务数据集。DAPO通过Clip-Higher解决熵崩溃问题,通过动态采样提高训练效率,通过Token-level Policy Gradient Loss优化损失计算方法,并通过Overlong Reward Shaping引导模型进行简洁高效的推理。实验表明,DAPO在AIME 2024基准测试中表现出色,超越了以往方法。

🚀DAPO系统通过“Clip-Higher”技术解决熵崩溃问题,该技术通过管理策略更新中的裁剪比例,鼓励模型输出的多样性。

🎯“动态采样”通过动态过滤样本,确保更一致的梯度信号,从而提高训练效率。

💡“Token-level Policy Gradient Loss”提供了一种改进的损失计算方法,强调token级别而非样本级别的调整,以更好地适应不同长度的推理序列。

⏱️“Overlong Reward Shaping”对过长的响应施加了可控的惩罚,引导模型进行简洁高效的推理。

📊DAPO在AIME 2024基准测试中,使用Qwen2.5-32B基础模型取得了50分的成绩,超过了DeepSeek-R1-Zero-Qwen-32B的47分,并且训练步数减少了一半。

Reinforcement learning (RL) has become central to advancing Large Language Models (LLMs), empowering them with improved reasoning capabilities necessary for complex tasks. However, the research community faces considerable challenges in reproducing state-of-the-art RL techniques due to incomplete disclosure of key training details by major industry players. This opacity has limited the progress of broader scientific efforts and collaborative research.

Researchers from ByteDance, Tsinghua University, and the University of Hong Kong recently introduced DAPO (Dynamic Sampling Policy Optimization), an open-source large-scale reinforcement learning system designed for enhancing the reasoning abilities of Large Language Models. The DAPO system seeks to bridge the gap in reproducibility by openly sharing all algorithmic details, training procedures, and datasets. Built upon the verl framework, DAPO includes training codes and a thoroughly prepared dataset called DAPO-Math-17K, specifically designed for mathematical reasoning tasks.

DAPO’s technical foundation includes four core innovations aimed at resolving key challenges in reinforcement learning. The first, “Clip-Higher,” addresses the issue of entropy collapse, a situation where models prematurely settle into limited exploration patterns. By carefully managing the clipping ratio in policy updates, this technique encourages greater diversity in model outputs. “Dynamic Sampling” counters inefficiencies in training by dynamically filtering samples based on their usefulness, thus ensuring a more consistent gradient signal. The “Token-level Policy Gradient Loss” offers a refined loss calculation method, emphasizing token-level rather than sample-level adjustments to better accommodate varying lengths of reasoning sequences. Lastly, “Overlong Reward Shaping” introduces a controlled penalty for excessively long responses, gently guiding models toward concise and efficient reasoning.

In practical experimentation, DAPO has demonstrated significant improvements. Evaluations on the American Invitational Mathematics Examination (AIME) 2024 benchmark show that DAPO-trained models achieved a score of 50 points using the Qwen2.5-32B base model, improving on previous methods such as DeepSeek-R1-Zero-Qwen-32B, which achieved 47 points. Notably, DAPO attained this improvement with approximately half the training steps, underscoring the efficiency of the proposed methods. A systematic analysis revealed incremental enhancements from each introduced technique, moving from a baseline of 30 points (using GRPO alone) up to 50 points with the full DAPO methodology.

Beyond quantitative results, DAPO’s training dynamics provided insights into the model’s evolving reasoning patterns. Initially, the models showed little reflective behavior, often proceeding linearly through tasks without reconsideration of previous steps. However, with ongoing training, the models progressively exhibited more reflective behaviors, demonstrating a form of iterative self-review. This shift highlights the capability of reinforcement learning not only to enhance existing reasoning pathways but also to cultivate entirely new cognitive strategies over time.

In conclusion, the open-sourcing of DAPO represents a meaningful contribution to the reinforcement learning community, removing barriers previously created by inaccessible methodologies. By clearly documenting and providing comprehensive access to the system’s techniques, dataset, and code, this collaborative initiative invites further research and innovation. The combined efforts of ByteDance, Tsinghua University, and the University of Hong Kong showcase the potential of transparent and cooperative research to advance the collective understanding and practical capabilities of large-scale reinforcement learning systems.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post ByteDance Research Releases DAPO: A Fully Open-Sourced LLM Reinforcement Learning System at Scale appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DAPO 强化学习 LLM 开源 数学推理
相关文章