MarkTechPost@AI 07月03日 09:05
Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

上海交通大学的研究人员提出了一种名为OctoThinker的新方法,旨在提升大型语言模型(LLMs)在强化学习(RL)中的表现。该研究深入探讨了不同基础模型在RL中表现差异的原因,特别是Llama和Qwen系列模型。通过采用两阶段的中间训练策略,研究人员成功地将Llama模型转化为更适合RL的基础模型。OctoThinker在多个数学基准测试中展现出优异性能,为RL-Ready基础模型的开发提供了新的思路和方法。

💡研究发现,基础模型在RL中的表现差异显著,Llama和Qwen等模型在RL训练中呈现出不同的行为模式。

📈研究提出了一种名为“Stable-then-Decay”的两阶段中间训练策略,该策略显著提升了Llama模型在RL中的兼容性,转化为OctoThinker模型。

🚀OctoThinker模型在多个数学基准测试中表现出色,OctoThinker-Long变体展现出强大的RL能力,在3B规模模型中,其性能与Qwen2.5-3B相当。

🔬研究表明,中间训练策略对RL的可扩展性至关重要,高质量的数学语料库和QA风格的数据,尤其是包含长CoT推理的数据,能进一步增强RL结果。

🎯未来研究方向包括构建更高质量的数学语料库、开发RL友好的基础模型,以及探索OctoThinker家族的新分支,例如工具集成推理。

Introduction: Reinforcement Learning Progress through Chain-of-Thought Prompting

    LLMs have shown excellent progress in complex reasoning tasks through CoT prompting combined with large-scale reinforcement learning (RL). Models like Deepseek-R1-Zero have shown strong reasoning capabilities by applying RL directly to base models. Similarly, methods such as SimpleRL and Open-ReasonerZero show improvements in smaller models like the Qwen series. However, achieving success across different base model families remains a challenge. Moreover, applying R1-Zero-style training to base models such as the Llama series faces difficulty, posing a fundamental question about the underlying factors that lead different base models to behave inconsistently during reinforcement learning.

    Limitations of RL Scaling on Llama Models

      Large-scale RL advances in models like OpenAI’s o1, o3, and DeepSeek’s R1 on competition-level mathematics problems, motivating the exploration of RL on smaller models with less than 100B parameters. However, they are limited to the Qwen model family, while replicating results on families such as Llama is difficult. The lack of transparency in pre-training pipelines has made it difficult to understand how pre-training influences RL scaling. This has prompted unconventional studies, which found that one-shot prompting improves reasoning in Qwen but offers little benefit in Llama. Efforts to curate high-quality mathematical pre-training corpora through projects like OpenWebMath, MathPile, InfiMM-Web-Math, and FineMath have made progress but remain limited in scale under 100B tokens.

      Exploring Mid-Training with Stable-then-Decay Strategy

        Researchers from Shanghai Jiao Tong University investigate how mid-training strategies shape RL dynamics, focusing on Qwen and Llama. The study presents several insights: First, high-quality mathematical corpora such as MegaMath-Web-Pro boost both base model and RL outcomes. Second, using QA-style data, especially those with long CoT reasoning, further enhances RL results. Third, long CoT introduces verbosity and instability in RL training. Lastly, applying scaling during mid-training results in stronger downstream RL performance. Researchers introduce a two-stage mid-training strategy called Stable-then-Decay, where base models are first trained on 200B tokens, followed by 20B tokens across three CoT-focused branches, resulting in OctoThinker models that show strong RL compatibility.

        RL Configuration and Benchmark Evaluation

          Researchers use the MATH8K dataset for RL training prompts. The configuration includes a global training batch size of 128, 16 rollout responses per query, and a PPO mini-batch size of 64, with experiments conducted on Llama-3.2-3B-Base and Qwen2.5-3B-Base models. For evaluation, few-shot prompting is used for base language models, and zero-shot for RL-tuned models across indicator tasks, including GSM8K, MATH500, OlympiadBench, and AMC23. During RL training, Qwen models exhibit increasing response lengths that remain reasonable throughout, whereas Llama displays abnormal behavior, with average response lengths escalating to 4,096 tokens. Evaluation further reveals that RL-tuned Qwen2.5-3B achieves improvements across benchmarks, while Llama-3.2-3B shows only marginal gains.

          OctoThinker Outperforms Llama in RL Compatibility

            Each OctoThinker branch demonstrates 10%-20% improvement over the original Llama base model and consistent gains over the stable-stage model across all sizes when evaluated on 13 mathematical benchmarks. The OctoThinker-Zero families reveal diverse thinking behaviors during RL scaling, with strong performance from the OctoThinker-Long variant. When comparing three 3B-scale base models during RL training, OctoThinker-Long-3B outperforms the original Llama-3.2-3B model and reaches performance parity with Qwen2.5-3B, a model known for strong reasoning capabilities and extensive pre-training. The hybrid and short branches show slightly lower performance, especially on challenging benchmarks

            Conclusion and Future Work: Toward RL-Ready Foundation Models

              This paper investigates why base models such as Llama and Qwen exhibit divergent behaviors during RL for reasoning, showing that mid-training plays a major role in RL scalability. The two-stage mid-training strategy transforms Llama into a foundation model better suited for RL, resulting in OctoThinker models. Future research directions include:


              Check out the Paper, Hugging Face Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

              The post Shanghai Jiao Tong Researchers Propose OctoThinker for Reinforcement Learning-Scalable LLM Development appeared first on MarkTechPost.

              Fish AI Reader

              Fish AI Reader

              AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

              FishAI

              FishAI

              鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

              联系邮箱 441953276@qq.com

              相关标签

              强化学习 大语言模型 OctoThinker 中间训练 Llama
              相关文章