MarkTechPost@AI 2024年08月01日
How Important is the Reference Model in Direct Preference Optimization DPO? An Empirical Study on Optimal KL-Divergence Constraints and Necessity
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

探讨Direct Preference Optimization(DPO)这一语言模型训练方法,包括其原理、优势及相关研究

🎯Direct Preference Optimization是一种先进的训练方法,用于微调大型语言模型,它训练模型区分各种候选输出的质量,以符合人类偏好,增强生成期望响应的能力,且融入强化学习技术,从反馈中学习。

🔍该研究主要探讨DPO过程中过度依赖参考模型或策略的局限性,需理解其最佳使用和强度,以最大化DPO训练模型的效率和输出质量,探索在保持强参考策略和给予模型足够灵活性之间的平衡。

📚当前偏好学习方法包括监督微调、强化学习和奖励训练技术,DPO采用KL-散度约束来管理与参考模型的偏差,平衡对参考的遵循与优化性能,以提高模型与人的偏好的一致性。

🔬耶鲁大学等的研究人员对DPO对参考策略的依赖性进行全面分析,探索KL-散度约束的最佳强度,评估参考策略在指令微调中的必要性,通过实验分析不同约束强度对模型准确性和稳定性的影响。

🎉研究发现较小的KL-散度约束通常会提高性能,直到过小导致性能下降,且更强的参考模型在与微调模型兼容时会提供额外好处,强调选择合适参考策略以实现最佳结果的重要性。

Direct Preference Optimization (DPO) is an advanced training method to fine-tune large language models (LLMs). Unlike traditional supervised fine-tuning, which depends on a single gold reference, DPO trains models to differentiate between the quality of various candidate outputs. This technique is crucial for aligning LLMs with human preferences, enhancing their ability to generate desired responses effectively. By incorporating reinforcement learning techniques, DPO enables models to learn from feedback, making it a valuable approach in language model training.

The primary issue addressed in this study involves the limitations imposed by relying heavily on reference models or policies during the DPO process. While essential for maintaining stability and direction in training, these references can restrict the potential improvements in LLM performance. Understanding these references’ optimal use and strength is vital for maximizing the efficiency and output quality of DPO-trained models. The research explores the balance between maintaining a strong reference policy and allowing enough flexibility for the model to improve beyond the initial constraints.

Current methods in preference learning include supervised fine-tuning (SFT), reinforcement learning (RL) approaches, and reward-based training techniques. SFT relies on a single gold reference, while RL and reward-based methods like contrastive learning train models to rank and prefer better outputs based on feedback. DPO, specifically, incorporates a KL-divergence constraint to manage deviations from a reference model. This constraint ensures the model does not stray too far from the reference, balancing adherence to the reference with optimizing for better performance. These methods improve the model’s alignment with human preferences, making them more effective in generating accurate and preferred outputs.

Researchers from Yale University, Shanghai Jiao Tong University, and the Allen Institute for AI introduced a comprehensive analysis of DPO’s dependency on reference policies. They explored the optimal strength of the KL-divergence constraint and evaluated the necessity of reference policies in instruction fine-tuning. The study involved varying the constraint strength to determine the best balance that maximizes DPO performance without over-relying on the reference model. The research aimed to provide insights into the confounding role of reference policies and offer guidance on best practices for future studies.

The proposed method involves a detailed investigation into different strengths of the KL-divergence constraint used in DPO. The researchers conducted experiments using open-source pre-trained LLMs, Tulu 2 and Mistral, on the AlpacaEval benchmark. They analyzed sequence-level and token-level performance to understand how varying constraint strengths affect model accuracy and stability. The experiments revealed that a smaller KL-divergence constraint generally improved performance until it became too small, leading to degradation. Furthermore, they examined the necessity of reference policies by comparing DPO with alternative learning objectives, demonstrating DPO’s superiority when used with an appropriate reference model.

The study found significant results regarding the impact of the KL-divergence constraint on DPO performance. A smaller constraint typically led to better performance, with the optimal value of β being around 0.01 to 0.02. For example, the model fine-tuned from Mistral-7b achieved an AlpacaEval2 score of 16.25 with a β of 0.01, compared to the original score of 7.57 without DPO. The analysis showed that reducing the constraint strength improved performance until it became too small, at which point the model’s performance degraded. Furthermore, stronger reference models, like Mistral-v0.2 and Llama-3-70b, provided additional benefits, but only when compatible with the fine-tuned model. The study highlighted the importance of selecting an appropriate reference policy to achieve optimal results.

The research underscores the nuanced role of reference policies in DPO. By carefully calibrating the constraint strength and selecting compatible reference models, researchers can significantly enhance the performance of LLMs. The findings emphasize the need for future research to explore the relationship between reference policies and DPO training performance. Moreover, the study calls for more theoretical and empirical guidelines better to understand the compatibility between the trained and reference models. Overall, this research provides valuable insights and practical recommendations for improving DPO and advancing the field of language model fine-tuning.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post How Important is the Reference Model in Direct Preference Optimization DPO? An Empirical Study on Optimal KL-Divergence Constraints and Necessity appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Direct Preference Optimization 语言模型训练 KL-散度约束 参考模型 模型性能
相关文章