MarkTechPost@AI 2024年07月29日
HyPO: A Hybrid Reinforcement Learning Algorithm that Uses Offline Data for Contrastive-based Preference Optimization and Online Unlabeled Data for KL Regularization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

HyPO 是一种混合强化学习算法,它结合了离线数据和在线数据,以提高大型语言模型与人类偏好的对齐。HyPO 利用离线数据进行初始偏好优化,并利用在线无标签数据进行 KL 正则化,以确保模型保持接近参考策略并更好地泛化到训练数据之外。

🎉 HyPO 是一种混合强化学习算法,它结合了离线数据和在线数据,以提高大型语言模型与人类偏好的对齐。HyPO 利用离线数据进行初始偏好优化,并利用在线无标签数据进行 KL 正则化,以确保模型保持接近参考策略并更好地泛化到训练数据之外。HyPO 利用离线数据进行 DPO 目标优化,利用在线样本控制反向 KL 散度。该算法通过优化 DPO 损失并结合从在线样本中获得的 KL 正则化项来迭代更新模型参数。这种混合方法通过结合在线 RL 方法的优势,有效地解决了纯离线方法的不足,例如过度拟合和数据集覆盖不足,而无需其计算复杂性。

📈 HyPO 在多个基准测试中进行了评估,包括 TL;DR 摘要任务和 AlpacaEval 2.0 和 MT-Bench 等通用聊天基准测试。结果令人印象深刻,HyPO 在 TL;DR 任务中使用 Pythia 1.4B 模型获得了 46.44% 的胜率,而 DPO 方法的胜率为 42.17%。对于 Pythia 2.8B 模型,HyPO 的胜率为 50.50%,显著优于 DPO 的 44.39%。此外,HyPO 在反向 KL 散度方面表现出更好的控制,Pythia 1.4B 和 2.8B 模型的值分别为 0.37 和 2.51,而 DPO 的值为 0.16 和 2.43。

📊 在通用聊天基准测试中,HyPO 也显示出显著的改进。例如,在 MT-Bench 评估中,HyPO 微调的模型在第一轮和第二轮平均得分分别为 8.43 和 8.09,超过了 DPO 微调模型的 8.31 和 7.89 分。同样,在 AlpacaEval 2.0 中,HyPO 在第一轮和第二轮的胜率分别为 30.7% 和 32.2%,而 DPO 的胜率分别为 28.4% 和 30.9%。

💡 经验结果表明,HyPO 能够减轻在离线对比方法中常见的过度拟合问题。例如,当在 TL;DR 数据集上进行训练时,HyPO 保持的平均验证 KL 分数明显低于 DPO,表明与参考策略的更好对齐以及减少过度拟合。这种利用在线数据进行正则化的能力有助于 HyPO 在各种任务中实现更稳健的性能。

✅ 总之,混合偏好优化 (HyPO) 的引入有效地结合了离线和在线数据,解决了现有方法的局限性,并增强了大型语言模型与人类偏好的对齐。经验评估中所展示的性能改进突出了 HyPO 在提供更准确和可靠的 AI 系统方面的潜力。

A critical aspect of AI research involves fine-tuning large language models (LLMs) to align their outputs with human preferences. This fine-tuning ensures that AI systems generate useful, relevant, and aligned responses with user expectations. The current paradigm in AI emphasizes learning from human preference data to refine these models, addressing the complexity of manually specifying reward functions for various tasks. The two predominant techniques in this area are online reinforcement learning (RL) and offline contrastive methods, each offering unique advantages and challenges.

A central challenge in fine-tuning LLMs to reflect human preferences is the limited coverage of static datasets. These datasets may need to adequately represent the diverse and dynamic range of human preferences in real-world applications. The issue of dataset coverage becomes particularly pronounced when models are trained exclusively on pre-collected data, potentially leading to suboptimal performance. This problem underscores the need for methods to effectively leverage static datasets and real-time data to enhance model alignment with human preferences.

Existing techniques for preference fine-tuning in LLMs include online RL methods, such as Proximal Policy Optimization (PPO), and offline contrastive methods, like Direct Preference Optimization (DPO). Online RL methods involve a two-stage procedure where a reward model is trained on a fixed offline preference dataset, followed by RL training using on-policy data. This approach benefits from real-time feedback but is computationally intensive. In contrast, offline contrastive methods optimize policies based solely on pre-collected data, avoiding the need for real-time sampling but potentially suffering from overfitting and limited generalization capabilities.

Researchers from Carnegie Mellon University, Aurora Innovation, and Cornell University introduced a novel method called Hybrid Preference Optimization (HyPO). This hybrid approach combines the power of both online and offline techniques, aiming to improve model performance while maintaining computational efficiency. HyPO integrates offline data for initial preference optimization. It uses online unlabeled data for Kullback-Leibler (KL) regularization, ensuring the model remains close to a reference policy and better generalizes beyond the training data.

HyPO utilizes a sophisticated algorithmic framework that leverages offline data for the DPO objective and online samples to control the reverse KL divergence. The algorithm iteratively updates the model’s parameters by optimizing the DPO loss while incorporating a KL regularization term derived from online samples. This hybrid approach effectively addresses the deficiencies of purely offline methods, such as overfitting and insufficient dataset coverage, by incorporating the strengths of online RL methods without their computational complexity.

The performance of HyPO was evaluated on several benchmarks, including the TL;DR summarization task and general chat benchmarks like AlpacaEval 2.0 and MT-Bench. The results were impressive, with HyPO achieving a win rate of 46.44% on the TL;DR task using the Pythia 1.4B model, compared to 42.17% for the DPO method. For the Pythia 2.8B model, HyPO achieved a win rate of 50.50%, significantly outperforming DPO’s 44.39%. Additionally, HyPO demonstrated superior control over reverse KL divergence, with values of 0.37 and 2.51 for the Pythia 1.4B and 2.8B models, respectively, compared to 0.16 and 2.43 for DPO.

In general chat benchmarks, HyPO also showed notable improvements. For instance, in the MT-Bench evaluation, HyPO fine-tuned models achieved scores of 8.43 and 8.09 in the first and second turn averages, respectively, surpassing the DPO-fine-tuned models’ scores of 8.31 and 7.89. Similarly, in the AlpacaEval 2.0, HyPO achieved 30.7% and 32.2% win rates for the 1st and 2nd turns, compared to DPO’s 28.4% and 30.9%.

The empirical results highlight HyPO’s ability to mitigate overfitting issues commonly observed in offline contrastive methods. For example, when trained on the TL;DR dataset, HyPO maintained a mean validation KL score significantly lower than that of DPO, indicating better alignment with the reference policy and reduced overfitting. This ability to leverage online data for regularization helps HyPO achieve more robust performance across various tasks.

In conclusion, the introduction of hybrid preference optimization (HyPO), which effectively combines offline and online data, addresses the limitations of existing methods and enhances the alignment of large language models with human preferences. The performance improvements demonstrated in empirical evaluations underscore the potential of HyPO to deliver more accurate and reliable AI systems.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post HyPO: A Hybrid Reinforcement Learning Algorithm that Uses Offline Data for Contrastive-based Preference Optimization and Online Unlabeled Data for KL Regularization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

HyPO 强化学习 偏好优化 大型语言模型 人工智能
相关文章