MarkTechPost@AI 02月04日
This AI Paper from Meta Introduces Diverse Preference Optimization (DivPO): A Novel Optimization Method for Enhancing Diversity in Large Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Meta等机构提出的Diverse Preference Optimization (DivPO)方法,旨在解决大型语言模型(LLMs)在偏好训练后响应多样性降低的问题。传统的偏好优化方法倾向于选择高奖励的响应,导致模型输出重复性高,限制了其在创造性任务中的应用。DivPO通过选择高质量且多样化的响应对进行优化,确保模型生成既符合人类偏好又具有多样性的输出。实验表明,DivPO在不牺牲质量的前提下,显著提高了模型在人物角色生成和开放式创意写作任务中的多样性,为LLMs在需要广泛输出的应用中提供了更有效的解决方案。

🌟大型语言模型在模拟人类语言方面表现出色,但在响应多样性方面存在不足,尤其是在需要创造性的任务中,如合成数据生成和故事讲述。

🎯传统的偏好优化方法,如RLHF和DPO,倾向于将概率集中在有限的高奖励响应上,导致模型输出重复,降低了在创意应用中的适应性。

💡DivPO是一种新的优化技术,通过选择高质量且多样化的响应对进行优化,确保模型生成的输出不仅符合人类偏好,还具有多样性,从而提升模型在创意和数据驱动应用中的表现。

🔬实验结果表明,DivPO在人物角色属性多样性方面提高了45.6%,在故事多样性方面提高了74.6%,同时保持了输出质量,证明了其在提升模型多样性方面的有效性。

Large-scale language models (LLMs) have advanced the field of artificial intelligence as they are used in many applications. Although they can almost perfectly simulate human language, they tend to lose in terms of response diversity. This limitation is particularly problematic in tasks requiring creativity, such as synthetic data generation and storytelling, where diverse outputs are essential for maintaining relevance and engagement.

One of the major challenges in language model optimization is the reduction in response diversity due to preference training techniques. Post-training methods like reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) tend to concentrate probability mass on a limited number of high-reward responses. This results in models generating repetitive outputs for various prompts, restricting their adaptability in creative applications. The decline in diversity hinders the potential of language models to function effectively in fields that require broad-ranging outputs.

Previous methods for preference optimization primarily emphasize aligning models with high-quality human preferences. Supervised fine-tuning and RLHF techniques, while effective at improving model alignment, inadvertently lead to response homogenization. Direct Preference Optimization (DPO) selects highly rewarded responses while discarding low-quality ones, reinforcing the tendency for models to produce predictable outputs. Attempts to counteract this issue, such as adjusting sampling temperatures or applying KL divergence regularization, have failed to significantly enhance diversity without compromising output quality.

Researchers from Meta, New York University, and ETH Zurich have introduced Diverse Preference Optimization (DivPO), a novel technique designed to enhance response diversity while maintaining high quality. Unlike traditional optimization methods prioritizing the highest-rewarded response, DivPO selects preference pairs based on quality and diversity. This ensures that the model generates outputs that are not only human-aligned but also varied, making them more effective in creative and data-driven applications.

DivPO operates by sampling multiple responses for a given prompt and scoring them using a reward model. Instead of selecting the single highest-rewarded response, the most diverse, high-quality response is chosen as the preferred output. Simultaneously, the least varied response that does not meet the quality threshold is selected as the rejected output. This contrastive optimization strategy allows DivPO to learn a broader distribution of responses while ensuring that each output retains a high-quality standard. The approach incorporates various diversity criteria, including model probability, word frequency, and an LLM-based diversity judgment, to assess each response’s distinctiveness systematically.

Extensive experiments were conducted to validate the effectiveness of DivPO, focusing on structured persona generation and open-ended creative writing tasks. The results demonstrated that DivPO significantly increased diversity without sacrificing quality. Compared to standard preference optimization methods, DivPO led to a 45.6% increase in persona attribute diversity and a 74.6% rise in story diversity. The experiments also showed that DivPO prevents models from generating a small subset of responses disproportionately, ensuring a more even distribution of generated attributes. A key observation was that models trained using DivPO consistently outperformed baseline models in diversity evaluations while maintaining high quality, as assessed by the ArmoRM reward model.

Further analysis of persona generation revealed that traditional fine-tuned models, such as Llama-3.1-8B-Instruct, failed to produce varied persona attributes, often repeating a limited set of names. DivPO rectified this issue by expanding the generated attribute range, leading to a more balanced and representative output distribution. The structured persona generation task demonstrated that online DivPO with word frequency criteria improved diversity by 30.07% compared to the baseline model while maintaining a comparable level of response quality. Similarly, the keyword-based creative writing task showed a substantial improvement, with DivPO achieving a 13.6% increase in diversity and a 39.6% increase in quality relative to the standard preference optimization models.

These findings confirm that preference optimization methods inherently reduce diversity, challenging language models designed for open-ended tasks. DivPO effectively mitigates this issue by incorporating diversity-aware selection criteria, enabling language models to maintain high-quality responses without limiting variability. By balancing diversity with alignment, DivPO enhances the adaptability and utility of LLMs across multiple domains, ensuring they remain useful for creative, analytical, and synthetic data generation applications. The introduction of DivPO marks a significant advancement in preference optimization, offering a practical solution to the long-standing problem of response collapse in language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.

The post This AI Paper from Meta Introduces Diverse Preference Optimization (DivPO): A Novel Optimization Method for Enhancing Diversity in Large Language Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DivPO 大型语言模型 响应多样性 偏好优化 人工智能
相关文章