CMU Researchers Introduce PAPRIKA: A Fine-Tuning Approach that Enables Language Models to Develop General Decision-Making Capabilities Not Confined to Particular Environment

In today’s rapidly evolving AI landscape, one persistent challenge is equipping language models with robust decision-making abilities that extend beyond single-turn interactions. Traditional large language models (LLMs) excel at generating coherent responses but often struggle with multi-step problem solving or interacting with dynamic environments. This shortfall largely stems from the nature of the training data, which rarely reflects the structured, interactive experiences that real-world scenarios demand. Moreover, directly deploying models to gather real-world interaction data can be both costly and risky. Hence, there is a clear need for methodologies that teach LLMs to explore, gather relevant information, and make thoughtful, sequential decisions in a safe and controlled manner.

In response to these challenges, researchers from Carnegie Mellon University have developed an approach known as PAPRIKA. This method is designed to endow language models with general decision-making capabilities that are not limited to any single environment. Rather than relying on traditional training data, PAPRIKA leverages synthetic interaction data generated across a diverse set of tasks. These tasks range from classic guessing games like twenty questions to puzzles such as Mastermind and even scenarios simulating customer service interactions. By training on these varied trajectories, the model learns to adjust its behavior based on contextual feedback from its environment—without the need for additional gradient updates. This approach encourages the model to adopt a more flexible, in-context learning strategy that can be applied to a range of new tasks.

Technical Details and Benefits

PAPRIKA’s methodology is built on a two-stage fine-tuning process. The first stage involves exposing the LLM to a large set of synthetic trajectories generated using a method called Min‑p sampling, which ensures that the training data is both diverse and coherent. This step allows the model to experience a wide spectrum of interaction strategies, including both successful and less effective decision-making behaviors. The second stage refines the model using a blend of supervised fine-tuning (SFT) and a direct preference optimization (DPO) objective. In this setup, pairs of trajectories are compared, with the model gradually learning to favor those that lead more directly to task success.

Recognizing that not all tasks are equally challenging, PAPRIKA also integrates a curriculum learning strategy. This component dynamically selects tasks based on their potential to offer meaningful learning experiences. By prioritizing tasks that yield richer learning signals, the approach enhances data efficiency and helps the model better generalize its decision-making strategies. The combination of these methods results in a refined model that is adept at sequential decision making across various contexts.

Results and Insights

The practical benefits of the PAPRIKA method are evident in its empirical results. In one illustrative example, the approach was applied to a bandit best arm selection task—a scenario that requires careful allocation of a limited sampling budget to identify the most promising option. Here, PAPRIKA increased the average success rate notably, demonstrating a marked improvement in strategic decision-making. More broadly, when the model was trained on trajectories from a set of ten diverse task groups, its overall performance improved by approximately 47% compared to the baseline model, achieved with roughly 22,500 training trajectories.

Further experiments using a leave-one-out evaluation demonstrated that the decision-making strategies learned through PAPRIKA could generalize to previously unseen tasks. For example, when the model was trained on all but one group of tasks, it still performed competitively on the omitted group. This finding suggests that the strategies developed through this fine-tuning method are not narrowly tailored to specific tasks but can be transferred across different decision-making scenarios. Moreover, a study involving curriculum learning showed that selectively sampling training tasks according to their difficulty could yield additional improvements, reinforcing the value of a tailored, data-driven approach to task selection.

Conclusion

In summary, PAPRIKA represents a thoughtful and measured approach to bridging the gap between static language understanding and dynamic, sequential decision making. By harnessing synthetic interaction data and employing a carefully designed two-stage fine-tuning process augmented with curriculum learning, CMU researchers have demonstrated that LLMs can be refined into more adaptable decision makers. This method, rather than resorting to task-specific tuning, prepares models to engage in new challenges with minimal additional training.

The capability to interact with external environments, collect pertinent information, and adjust decisions based on feedback is essential for any system designed to operate autonomously. While there remain challenges—such as ensuring a solid starting model and managing the computational costs of synthetic data generation—PAPRIKA offers a promising avenue toward developing more versatile AI systems. Ultimately, as our models continue to advance, approaches like PAPRIKA will be important for creating tools that are not only proficient in language understanding but also capable of navigating complex, real-world decision-making tasks with subtlety and care.

Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

The post CMU Researchers Introduce PAPRIKA: A Fine-Tuning Approach that Enables Language Models to Develop General Decision-Making Capabilities Not Confined to Particular Environment appeared first on MarkTechPost.

Technical Details and Benefits

Results and Insights

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签