MarkTechPost@AI 02月01日
The Allen Institute for AI (AI2) Releases Tülu 3 405B: Scaling Open-Weight Post-Training with Reinforcement Learning from Verifiable Rewards (RLVR) to Surpass DeepSeek V3 and GPT-4o in Key Benchmarks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

艾伦人工智能研究所(AI2)发布了Tülu 3 405B,这是一个开源的4050亿参数模型,通过可验证奖励的强化学习(RLVR)技术,在关键基准测试中超越了DeepSeek V3和GPT-4o。该模型采用了全新的后训练方法,在数据管理、监督微调、直接偏好优化和RLVR等多个阶段进行了优化,尤其在数学问题解决方面表现出色。Tülu 3 405B的成功标志着开源模型在性能上能够与领先的专有模型竞争,并为未来更大规模的开源模型研究奠定了基础。

🚀 Tülu 3 405B是首个成功应用完全开源后训练方法达到4050亿参数规模的开源模型,它基于Llama 3.1基础模型,通过多项增强技术提升了性能。

🎯 该模型引入了可验证奖励的强化学习(RLVR)方法,确保奖励基于可验证的结果,而非主观反馈,显著提升了模型在数学等专业任务上的表现。

📊 Tülu 3 405B在多个安全和推理基准测试中,表现优于DeepSeek V3和GPT-4o,同时也超越了之前的开源模型,如Llama 3.1 405B Instruct。

🔬 该研究表明,使用专门的数据集(如MATH)训练大型模型,比使用通用数据集(如GSM8k)效果更好,尤其是在数学问题解决方面。

⚙️ Tülu 3的后训练过程分为四个阶段:数据管理与合成、监督微调(SFT)、直接偏好优化(DPO)和可验证奖励的强化学习(RLVR),通过这些阶段的优化,模型在多项任务中都表现出了卓越的性能。

Post-training techniques, such as instruction tuning and reinforcement learning from human feedback, have become essential for refining language models. But, open-source approaches often fall behind proprietary models due to a lack of transparency in training data, methodologies, and optimization techniques. Despite the availability of foundational models, the absence of robust, publicly available post-training recipes creates a performance gap between open and closed models, limiting advancements in open AI research.

Previous open-source efforts, including Tülu 2 and Zephyr-β, have attempted to improve post-training methods but remain constrained by simpler and more cost-effective pipelines. In contrast, proprietary models like GPT-4o and Claude 3.5-Haiku benefit from access to larger datasets, refined optimization techniques, and extensive human feedback and consistently outperform open-weight models. Research on preference tuning and reinforcement learning has progressed, but existing open approaches lack the scalability and rigor of closed-source methodologies.

In collaboration with the University of Washington, the Allen Institute for AI (AI2) research team introduced Tülu 3 last year, a breakthrough in open-weight post-training. Tülu 3 builds on the Llama 3.1 base model and incorporates multiple enhancements designed to scale effectively while maintaining superior performance. 

The team has developed its latest release, Tülu 3 405B, the first open-weight model to successfully apply a fully open post-training recipe at a 405-billion-parameter scale. The model introduces a novel reinforcement learning approach known as Reinforcement Learning with Verifiable Rewards (RLVR), which significantly improves model performance in specialized tasks by ensuring that rewards are based on verifiable outcomes rather than subjective feedback. The research team deployed Tülu 3 405B using vLLM with 16-way tensor parallelism, optimizing computational efficiency across 256 GPUs running in parallel.

The Tülu 3 post-training recipe follows a four-stage approach that begins with data curation and synthesis, ensuring that core skills such as reasoning, mathematics, coding, and safety are well represented. The next stage involves supervised fine-tuning (SFT), where the model is trained using carefully selected prompts and their completions. Direct Preference Optimization (DPO) is applied in the third stage, leveraging off-policy and on-policy preference data to refine responses. Finally, RLVR is introduced to enhance specialized skills, particularly in verifiable tasks such as mathematical problem-solving. One of the key differentiators of Tülu 3’s approach is its ability to scale effectively. The team found that using MATH data exclusively, rather than combining GSM8k and IFEval, yielded better results for larger models.

Tülu 3 405B demonstrated competitive or superior performance compared to DeepSeek V3 and GPT-4o, outperforming prior open-weight models such as Llama 3.1 405B Instruct and Nous Hermes 3 405B. The results showed a consistent edge in safety benchmarks, where many open-weight models have struggled. The RLVR framework particularly contributed to a significant increase in MATH performance at the 405B scale, with improvements in instruction-following tasks. 

The model’s training process involved extensive computational resources, including 32 nodes and 256 GPUs. During RLVR training, inference took approximately 550 seconds per iteration, weight transfer required 25 seconds, and training took around 1,500 seconds per iteration. After this rigorous training process, the final model demonstrated robust generalization capabilities across multiple benchmarks.

Some key takeaways after their latest enhancements and release from the research on Tülu 3:

    Tülu 3 was released in multiple parameter configurations, including 8B, 70B, and 405B, each fine-tuned using supervised learning, preference optimization, and RLVR techniques.Training Tülu 3 405B required 256 GPUs running in parallel, with RLVR training iterations taking 550 seconds for inference and 1,500 seconds for training.The model surpassed DeepSeek V3 and GPT-4o in various safety and reasoning benchmarks while outperforming previous open-weight models such as Llama 3.1 405B Instruct.The research demonstrated that larger models perform better when trained on specialized datasets like MATH than general datasets like GSM8k.A novel reinforcement learning approach that rewards model completions only when results are verifiable, improving performance in mathematics and structured reasoning.While Tülu 3 405B sets a new standard, further research is needed to explore larger value models and alternate RL algorithms, such as GRPO, for optimizing reward structures.

In conclusion, the evolution of post-training techniques has underscored the persistent performance gap between open and proprietary models due to differences in training methodologies, data transparency, and optimization approaches. While previous open-weight models made progress, they remained behind leading proprietary models. The introduction of Tülu 3 405B marks a milestone in scaling fully open post-training techniques to large-scale models, demonstrating competitive or superior performance to state-of-the-art models such as DeepSeek V3 and GPT-4o. Notably, the Reinforcement Learning with Verifiable Rewards (RLVR) framework showed greater effectiveness at the 405B scale, particularly in mathematical problem-solving, suggesting that larger models benefit more from specialized data. Despite technical challenges in compute requirements and hyperparameter tuning, the success of Tülu 3 405B highlights the viability of open post-training recipes for achieving cutting-edge model performance.


Check out the Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post The Allen Institute for AI (AI2) Releases Tülu 3 405B: Scaling Open-Weight Post-Training with Reinforcement Learning from Verifiable Rewards (RLVR) to Surpass DeepSeek V3 and GPT-4o in Key Benchmarks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Tülu 3 405B 开源模型 强化学习 RLVR AI2
相关文章