MarkTechPost@AI 03月08日 15:05
CMU Researchers Introduce PAPRIKA: A Fine-Tuning Approach that Enables Language Models to Develop General Decision-Making Capabilities Not Confined to Particular Environment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

卡内基梅隆大学的研究人员提出了一种名为PAPRIKA的新方法,旨在赋予语言模型更强的通用决策能力。PAPRIKA不依赖传统的训练数据,而是利用在各种任务中生成的合成交互数据进行训练,这些任务包括猜谜游戏、谜题和客户服务模拟。通过这种方式,模型可以根据环境反馈调整其行为,而无需额外的梯度更新。PAPRIKA方法包括两个阶段的微调过程,并结合了课程学习策略,从而提高数据效率,并帮助模型更好地泛化其决策策略。实验结果表明,PAPRIKA可以显著提高模型在不同任务中的决策成功率。

🧩PAPRIKA是一种旨在提升语言模型通用决策能力的方法,它不局限于特定环境,而是通过训练模型在各种合成任务中进行交互,使其能够根据环境反馈调整行为。

📊PAPRIKA采用两阶段微调过程:第一阶段让LLM接触大量使用Min-p抽样生成的合成轨迹,确保训练数据的多样性和连贯性;第二阶段使用监督微调(SFT)和直接偏好优化(DPO)相结合的方式,比较轨迹对,使模型逐渐倾向于那些能更直接地实现任务成功的轨迹。

🎯PAPRIKA还整合了课程学习策略,动态选择具有潜在学习价值的任务,优先考虑能产生更丰富学习信号的任务,从而提高数据效率,并帮助模型更好地泛化其决策策略。

📈实验结果表明,PAPRIKA在bandit best arm selection任务中显著提高了平均成功率,在训练了十个不同任务组的轨迹后,模型整体性能提高了约47%。

🔍进一步的实验表明,通过PAPRIKA学习到的决策策略可以泛化到以前未见过的任务,即使在省略一组任务的情况下进行训练,模型在省略组上的表现仍然具有竞争力。

In today’s rapidly evolving AI landscape, one persistent challenge is equipping language models with robust decision-making abilities that extend beyond single-turn interactions. Traditional large language models (LLMs) excel at generating coherent responses but often struggle with multi-step problem solving or interacting with dynamic environments. This shortfall largely stems from the nature of the training data, which rarely reflects the structured, interactive experiences that real-world scenarios demand. Moreover, directly deploying models to gather real-world interaction data can be both costly and risky. Hence, there is a clear need for methodologies that teach LLMs to explore, gather relevant information, and make thoughtful, sequential decisions in a safe and controlled manner.

In response to these challenges, researchers from Carnegie Mellon University have developed an approach known as PAPRIKA. This method is designed to endow language models with general decision-making capabilities that are not limited to any single environment. Rather than relying on traditional training data, PAPRIKA leverages synthetic interaction data generated across a diverse set of tasks. These tasks range from classic guessing games like twenty questions to puzzles such as Mastermind and even scenarios simulating customer service interactions. By training on these varied trajectories, the model learns to adjust its behavior based on contextual feedback from its environment—without the need for additional gradient updates. This approach encourages the model to adopt a more flexible, in-context learning strategy that can be applied to a range of new tasks.

Technical Details and Benefits

PAPRIKA’s methodology is built on a two-stage fine-tuning process. The first stage involves exposing the LLM to a large set of synthetic trajectories generated using a method called Min‑p sampling, which ensures that the training data is both diverse and coherent. This step allows the model to experience a wide spectrum of interaction strategies, including both successful and less effective decision-making behaviors. The second stage refines the model using a blend of supervised fine-tuning (SFT) and a direct preference optimization (DPO) objective. In this setup, pairs of trajectories are compared, with the model gradually learning to favor those that lead more directly to task success.

Recognizing that not all tasks are equally challenging, PAPRIKA also integrates a curriculum learning strategy. This component dynamically selects tasks based on their potential to offer meaningful learning experiences. By prioritizing tasks that yield richer learning signals, the approach enhances data efficiency and helps the model better generalize its decision-making strategies. The combination of these methods results in a refined model that is adept at sequential decision making across various contexts.

Results and Insights

The practical benefits of the PAPRIKA method are evident in its empirical results. In one illustrative example, the approach was applied to a bandit best arm selection task—a scenario that requires careful allocation of a limited sampling budget to identify the most promising option. Here, PAPRIKA increased the average success rate notably, demonstrating a marked improvement in strategic decision-making. More broadly, when the model was trained on trajectories from a set of ten diverse task groups, its overall performance improved by approximately 47% compared to the baseline model, achieved with roughly 22,500 training trajectories.

Further experiments using a leave-one-out evaluation demonstrated that the decision-making strategies learned through PAPRIKA could generalize to previously unseen tasks. For example, when the model was trained on all but one group of tasks, it still performed competitively on the omitted group. This finding suggests that the strategies developed through this fine-tuning method are not narrowly tailored to specific tasks but can be transferred across different decision-making scenarios. Moreover, a study involving curriculum learning showed that selectively sampling training tasks according to their difficulty could yield additional improvements, reinforcing the value of a tailored, data-driven approach to task selection.

Conclusion

In summary, PAPRIKA represents a thoughtful and measured approach to bridging the gap between static language understanding and dynamic, sequential decision making. By harnessing synthetic interaction data and employing a carefully designed two-stage fine-tuning process augmented with curriculum learning, CMU researchers have demonstrated that LLMs can be refined into more adaptable decision makers. This method, rather than resorting to task-specific tuning, prepares models to engage in new challenges with minimal additional training.

The capability to interact with external environments, collect pertinent information, and adjust decisions based on feedback is essential for any system designed to operate autonomously. While there remain challenges—such as ensuring a solid starting model and managing the computational costs of synthetic data generation—PAPRIKA offers a promising avenue toward developing more versatile AI systems. Ultimately, as our models continue to advance, approaches like PAPRIKA will be important for creating tools that are not only proficient in language understanding but also capable of navigating complex, real-world decision-making tasks with subtlety and care.


Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post CMU Researchers Introduce PAPRIKA: A Fine-Tuning Approach that Enables Language Models to Develop General Decision-Making Capabilities Not Confined to Particular Environment appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PAPRIKA 语言模型 决策能力 合成数据 课程学习
相关文章