MarkTechPost@AI 02月01日
Curiosity-Driven Reinforcement Learning from Human Feedback CD-RLHF: An AI Framework that Mitigates the Diversity Alignment Trade-off In Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了百度提出的新型框架CD-RLHF,旨在解决大型语言模型(LLM)在通过人类反馈强化学习(RLHF)微调时出现的输出多样性降低问题。CD-RLHF引入好奇心作为内在奖励机制,与传统的外在奖励相结合,通过前向动态计算状态表示的预测误差来估计好奇心水平。该方法在TL;DR摘要和UltraFeedback指令跟随任务上进行了测试,结果显示,CD-RLHF在保持高对齐质量的同时,显著提高了输出多样性,优于传统的RLHF方法。尽管如此,仍需进一步研究以完全弥合多样性和对齐之间的差距。

🤔 传统RLHF训练在提高模型对齐性的同时,往往会牺牲输出的多样性,这对于需要创造性的开放式任务,如故事生成和数据合成,构成了挑战。

💡 CD-RLHF框架通过引入好奇心驱动的内在奖励机制,与基于奖励模型的外在奖励相结合,在RLHF训练阶段鼓励模型探索不同的输出,从而提高输出多样性。

📊 实验结果表明,CD-RLHF在TL;DR摘要任务和UltraFeedback指令跟随任务上,均显著提高了输出多样性,同时保持了与基线方法相当甚至更高的对齐质量。例如,在TL;DR摘要任务中,Gemma-2B和Gemma-7B模型的输出多样性分别提高了16.66%和6.22%。

🏆 CD-RLHF通过GPT-4的外部验证,在TL;DR任务中,对PPO基线的胜率高达58%,在UltraFeedback任务中平均胜率高达62%,证明了其在多样性和对齐性上的优越性能。

Large Language Models (LLMs) have become increasingly reliant on Reinforcement Learning from Human Feedback (RLHF) for fine-tuning across various applications, including code generation, mathematical reasoning, and dialogue assistance. However, a significant challenge has emerged in the form of reduced output diversity when using RLHF. Research has identified a critical trade-off between alignment quality and output diversity in RLHF-trained models. When these models align highly with desired objectives, they show limited output variability. This limitation poses concerns for creative open-ended tasks such as story generation, data synthesis, and red-teaming, where diverse outputs are essential for effective performance.

Existing approaches to LLM alignment have focused on enhancing instruction following, safety, and reliability through RLHF, but these improvements often come at the cost of output diversity. Various methods have been developed to address this challenge, including the use of f-divergence with DPO/PPO algorithms, which attempt to balance diversity and alignment. Other approaches integrate evaluation metrics like SelfBLEU and Sentence-BERT into RL fine-tuning to boost diversity, particularly for red-teaming tasks. Moreover, some researchers have explored curiosity-driven reinforcement learning methods, ranging from count-based approaches to prediction error-based techniques. Despite these efforts, the fundamental trade-off between alignment quality and output diversity remains a significant challenge.

Researchers from Baidu have proposed a novel framework called Curiosity-driven Reinforcement Learning from Human Feedback (CD-RLHF) to address the diversity-alignment trade-off in language models. This approach incorporates curiosity as an intrinsic reward mechanism during the RLHF training stage, working alongside traditional extrinsic rewards from the reward model. CD-RLHF uses forward dynamics to compute prediction errors of state representations, which helps estimate curiosity levels. A key feature of this approach is that frequently visited states gradually become less interesting to the model. This dual reward system aims to maintain high alignment quality while promoting diverse outputs through varied token choices at each decision point.

The implementation and evaluation of CD-RLHF encompasses multiple components and datasets. The architecture was tested on two primary datasets: TL;DR for text summarization, containing 93k human-annotated preference pairs, and UltraFeedback for instruction following, with 61.1k training pairs. The framework was implemented using various base models including Gemma-2B, Gemma-7B, Llama-3.2-1B, and Llama-3.2-3B, all trained within the DeepSpeed-Chat framework. The training data was distributed across SFT, RM, and PPO stages in a 20/40/40 ratio. For comparison, baseline methods including vanilla RLHF and Sent-Rewards are implemented, which use SelfBLEU and Sentence-BERT scores as additional rewards during training.

The experimental results demonstrate CD-RLHF’s superior performance across multiple evaluation metrics and models. In the TL;DR summarization task, CD-RLHF achieves significant improvements in output diversity showing gains of 16.66% and 6.22% on Gemma-2B and Gemma-7B respectively compared to the RLHF baseline. For the UltraFeedback instruction-following task, the method shows even more impressive results, with diversity improvements ranging from 7.35% to 14.29% across different models while maintaining strong alignment quality. External validation through GPT-4 evaluation showed CD-RLHF achieving win rates of up to 58% against the PPO baseline on TL;DR and an average of 62% on UltraFeedback.

In conclusion, researchers introduced CD-RLHF which represents a significant advancement in addressing the diversity-alignment trade-off in language model training. The framework combines curiosity-driven exploration with traditional extrinsic rewards to enhance output diversity while maintaining alignment quality, as shown through extensive testing on TL;DR summarization and UltraFeedback instruction-following tasks. Despite these achievements, several challenges remain, including the need to balance different reward scales and the persistent gap between the output diversity of SFT, and RLHF-trained models. While CD-RLHF mitigates the trade-off between diversity and alignment, further research is needed to fully bridge this gap and achieve optimal performance across both metrics.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

Meet IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System (Promoted)

The post Curiosity-Driven Reinforcement Learning from Human Feedback CD-RLHF: An AI Framework that Mitigates the Diversity Alignment Trade-off In Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CD-RLHF 强化学习 语言模型 输出多样性 对齐
相关文章