MarkTechPost@AI 03月05日
Few-Shot Preference Optimization (FSPO): A Novel Machine Learning Framework Designed to Model Diverse Sub-Populations in Preference Datasets to Elicit Personalization in Language Models for Open-Ended Question Answering
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

FSPO(Few-Shot Preference Optimization)是一种新型的机器学习框架,旨在通过模拟偏好数据集中不同的子群体,从而在开放式问答中实现语言模型的个性化。与传统的奖励建模不同,FSPO利用元学习方法,仅用少量标记偏好就能快速适应个体用户。该框架通过生成超过100万个合成个性化偏好来解决数据稀缺问题,确保多样性和一致性,从而实现有效的实际应用。在三个领域和1500个合成用户中进行评估,FSPO实现了87%的AlpacaEval胜率和72%的真实用户胜率。

💡FSPO框架将个性化视为一个元学习问题,通过将偏好与用户特定的标识符相关联,并将每个用户建模为一个任务实例,从而解决传统RLHF微调中忽略个体差异的问题。

📚FSPO通过构建少量样本提示,并利用预训练的LLM来实现有效的个性化。此外,用户表示被构建为(N)位偏好编码,从而实现结构化的泛化。

✅FSPO通过生成超过100万个结构化的合成偏好来解决数据稀缺问题,并使用黑盒元学习方法,用最少的数据快速适应新用户。

🎯FSPO在三个领域进行了评估:评论、教育解释和基于角色扮演的问答。结果表明,FSPO在各种任务中始终优于基线模型,在真实用户中实现了82.6%的胜率。

Personalizing LLMs is essential for applications such as virtual assistants and content recommendations, ensuring responses align with individual user preferences. Unlike traditional approaches that optimize models based on aggregated user feedback, personalization aims to capture the diversity of individual perspectives shaped by culture, experiences, and values. Current optimization methods, such as reinforcement learning from human feedback (RLHF), focus on a singular reward model, potentially overlooking minority viewpoints and introducing biases. A more effective approach would involve learning a distribution of reward functions rather than a single one, enabling LLMs to generate responses tailored to different user groups. This shift improves user satisfaction and fosters inclusivity by acknowledging diverse perspectives. However, implementing this effectively in open-ended question-answering and real-world applications remains challenging.

Research on preference learning has explored multiple strategies for personalization. Some methods, like distributional alignment, aim to match model outputs to broad statistical properties but lack direct adaptation to individual users. Others attempt to model reward distributions explicitly, yet they face challenges in sample efficiency and real-world evaluations. Many existing approaches, such as GPO and human-correction-based methods, work well in structured tasks but have not been thoroughly tested for open-ended personalization. Supervised fine-tuning, reinforcement learning techniques like PPO, and alternative methods like DPO and IPO have been explored for refining LLM outputs based on user preferences. FSPO, a black-box meta-learning approach, adapts to new user preferences with minimal examples, leveraging techniques from prior studies in language modeling, reinforcement learning, and meta-learning.

Researchers from Stanford University, Google DeepMind, and OpenAI propose Few-Shot Preference Optimization (FSPO), a framework that personalizes language models by adapting to user preferences with minimal labeled examples. Instead of relying on aggregated human feedback, FSPO reframes reward modeling as a meta-learning problem, enabling models to construct personalized reward functions. The approach generates over a million structured synthetic preferences to address data scarcity. Evaluated across three domains—reviews, educational adaptation, and roleplay—FSPO achieves an 87% win rate in synthetic user personalization and 72% with real users, enhancing LLMs’ ability to align with diverse user needs in open-ended interactions.

The FSPO framework treats personalization as a meta-learning problem. Traditional fine-tuning with RLHF aggregates user preferences across a population, often marginalizing individual differences. FSPO addresses this by associating preferences with user-specific identifiers and modeling each user as a task instance. Using a black-box meta-learning approach, it quickly adapts to new users with minimal data. FSPO constructs few-shot prompts to leverage pre-trained LLMs for effective personalization. Additionally, user representation is framed as an (N)-bit preference encoding, allowing structured generalization. FSPO is evaluated across three domains: reviews, educational explanations, and roleplay-based question answering.

FSPO is evaluated against four baselines: (1) a generic instruct model, (2) few-shot prompting, (3) few-shot fine-tuning (Pref-FT), and (4) prompting with an oracle user description. FSPO consistently outperforms these baselines across various tasks. Synthetic win rates, assessed via a modified AlpacaEval, show FSPO excelling in ELIX, Review, and Roleplay tasks, achieving an 82.6% win rate on real users. A human study with 25 participants confirms FSPO’s effectiveness, with a 72% win rate over both the base and SFT models. FSPO demonstrates strong personalization, narrowing the gap to oracle performance through chain-of-thought reasoning.

In conclusion, FSPO is a framework for personalizing language models in open-ended question answering by modeling diverse human preferences through meta-learning. Unlike traditional reward modeling, FSPO rapidly adapts to individual users using a few labeled preferences. Over 1M synthetic personalized preferences are generated to address data scarcity, ensuring diversity and consistency for effective real-world transfer. Evaluated across three domains and 1,500 synthetic users, FSPO achieves an 87% AlpacaEval win rate and a 72% win rate with real users. This approach enhances personalization in virtual assistants and content curation applications, contributing to more inclusive and user-centric language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Few-Shot Preference Optimization (FSPO): A Novel Machine Learning Framework Designed to Model Diverse Sub-Populations in Preference Datasets to Elicit Personalization in Language Models for Open-Ended Question Answering appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FSPO 元学习 个性化 语言模型
相关文章