MarkTechPost@AI 07月07日 06:15
New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了使用强化学习(RL)对大语言模型(LLMs)进行对齐的方法,以提升其在人类使用中的表现。研究重点在于半在线训练设置,这是一种介于离线和在线方法之间的平衡策略,旨在优化训练效率和模型适应性。研究团队通过调整模型生成和训练组件的同步频率,实现了在可验证和不可验证任务上的性能提升。实验结果表明,这种灵活的同步方案在数学推理和指令遵循等任务中均表现出色,为LLMs的对齐提供了新的思路。

💡**对齐的重要性:** 大语言模型(LLMs)为了更好地服务于人类需求,通常需要经过一个对齐阶段,利用强化学习(RL)进行优化,使其能够根据人类反馈或任务完成情况做出决策,从而更符合用户期望。

🤔**训练策略的挑战:** 训练方法主要分为离线和在线两种,离线方法依赖静态数据,无法在训练过程中调整;在线方法则需要更多计算资源。此外,模型在数学任务(可验证)和开放式任务(不可验证)上的表现也增加了选择的复杂性。

⚙️**半在线方法的优势:** Meta和NYU的研究引入了一种半在线训练设置,通过调整模型生成和训练组件的同步频率来平衡训练效率和模型适应性。这种方法在离线和在线策略之间找到了一个中间地带,降低了训练时间,同时保持了模型的灵活性。

📈**实验结果与应用:** 研究团队使用Llama-3.1-8B-Instruct模型,在指令遵循和数学问题解决两类任务上进行了实验。结果表明,半在线DPO方法在Math500和NuminaMath等基准测试中均优于离线DPO。此外,结合可验证和不可验证奖励类型,模型在AlpacaEval 2.0和Arena-Hard等开放式任务上的表现也得到提升,表明该方法具有良好的泛化能力。

Optimizing LLMs for Human Alignment Using Reinforcement Learning

Large language models often require a further alignment phase to optimize them for human use. In this phase, reinforcement learning plays a central role by enabling models to make decisions based on human feedback or task-based correctness. This fine-tuning allows for the models to align more closely with user expectations, making them more suitable for instruction-based applications or precise mathematical tasks.

Challenges in Choosing Offline vs. Online Reinforcement Learning Strategies

A major difficulty arises when choosing the most effective way to conduct this fine-tuning. Training methods fall into two extremes—offline approaches that depend on static, pre-generated data and fully online approaches that continuously update with each new interaction. Each method has distinct challenges. Offline models can’t adapt during training, which limits performance, while online models often demand more computational resources. Moreover, ensuring that models perform well across both mathematical (verifiable) and open-ended (non-verifiable) tasks adds further complexity to this choice.

Overview of Alignment Algorithms: DPO and GRPO

Historically, tools like Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have been employed for model alignment. DPO operates offline and is designed to work with preference-based data pairs. It is valued for its simplicity and data efficiency but lacks the adaptability of online methods. GRPO is based on the PPO algorithm and handles online fine-tuning by comparing groups of outputs to compute relative advantages. While GRPO adapts in real-time and suits dynamic reward systems, its on-policy nature increases computational load and makes experimentation more demanding.

A Balanced Alternative for LLM Alignment

Research introduced by Meta and NYU explored a method to overcome these limitations through a semi-online training setup. This technique modulates how frequently the model’s generation and training components are synchronized, rather than updating at every training step, as in fully online methods, or not at all, as in offline setups. The semi-online method strikes a middle ground by adjusting the synchronization rate. Researchers designed this approach to reduce training time and maintain high model adaptability. The modular setup also allowed them to apply either DPO or GRPO with task-specific reward models in a flexible manner.

Instruction Following and Mathematical Reasoning

The methodology involved fine-tuning the Llama-3.1-8B-Instruct model using two types of tasks: open-ended instruction following and math problem-solving. For non-verifiable tasks, user prompts were sampled from the WildChat-1M dataset and evaluated using the Athene-RM-8B reward model, which assigns scalar scores to each prompt. For verifiable tasks, the team utilized the NuminaMath dataset in conjunction with the Math-Verify toolkit, which verifies whether generated answers align with expected outputs. Training experiments were conducted on 32 NVIDIA H200 GPUs for training and 8 GPUs for inference, with different setups comparing offline, semi-online, and online synchronization intervals.

Performance Gains Across Both Verifiable and Non-Verifiable Tasks

The performance differences were observed. On Math500, the offline DPO reached 53.7% accuracy, whereas the semi-online DPO with a synchronization interval of s = 100 achieved 58.9%. Online DPO and GRPO showed similar results at 58.7% and 58.1%, respectively. Similar trends were observed on the NuminaMath benchmark, where the offline DPO achieved 36.4%, and semi-online variants increased this to 39.4% (s = 10). The performance gains were not limited to math tasks. When non-verifiable tasks were evaluated with AlpacaEval 2.0 and Arena-Hard benchmarks, models trained with mixed reward types performed consistently better. Combining verifiable and non-verifiable rewards in a single training setup resulted in stronger average scores, indicating that the method generalized effectively.

A Flexible, Scalable Approach for Reinforcement Learning in LLMs

This study demonstrates that fine-tuning large language models does not require strict adherence to either offline or online setups. By introducing a flexible synchronization scheme, the research team from Meta and NYU effectively increased training efficiency while maintaining or improving performance. The results show that carefully balancing reward types and training synchronization frequency leads to models that perform well across task types without incurring high computational costs.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post New AI Method From Meta and NYU Boosts LLM Alignment Using Semi-Online Reinforcement Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大语言模型 强化学习 模型对齐 半在线方法
相关文章