MarkTechPost@AI 2024年12月02日
From Wordle to Robotics: Q-SFT Unleashes LLMs’ Potential in Sequential Decision-Making
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种名为Q-SFT的新算法,它将强化学习(RL)与大语言模型(LLM)相结合,有效提升了LLM在需要顺序决策的特定任务(如机器人控制或自然语言处理)上的表现。传统离线RL方法在多轮顺序应用中效果不佳,而Q-SFT通过将监督微调目标与Q学习相结合,避免了信息损失,并在多个基准测试中超越了其他方法,包括棋类游戏、网页交互任务、视觉语言模型和机器人操作等。这项研究为LLM在顺序决策任务中的应用开辟了新的可能性。

🤔 **离线强化学习(Offline RL)在多轮顺序任务中存在局限性:** 由于LLM的训练目标(预测可能性)与RL的Q学习目标(预测动作值)存在差异,导致离线RL在利用LLM的潜在表示时会造成信息损失,从而影响性能。

💡 **Q-SFT算法通过监督微调学习Q值:** 该算法将传统监督微调目标与Q学习相结合,通过学习保守估计价值函数的概率来避免不稳定的回归目标,并保留预训练中的最大似然函数。

🚀 **Q-SFT在多种任务中表现出色:** 在棋类游戏、网页交互任务、视觉语言模型和机器人操作等多个基准测试中,Q-SFT都取得了优于其他强化学习和监督学习方法的成果。

📊 **Q-SFT与传统方法的对比:** Q-SFT结合了Q学习和监督微调的优点,并在多个基准测试中超越了基于提示和监督微调的LLM以及基于隐式语言Q学习的RL方法。

🤖 **Q-SFT在机器人操作方面表现出色:** 在机器人操作任务中,Q-SFT与最先进的方法性能相当,展现了其在实际应用中的潜力。

Integration of Reinforcement Learning RL with large language models catalyzes LLM’s performance on distinct specialty tasks such as robotics control or natural language processing that require sequential decision-making. Offline RL is one such technique in the spotlight today that works with static datasets without additional engagements. However, despite its utility in single-turn scenarios, Offline RL loses ground with multi-turn sequential applications. Policy Gradient Methods are generally applied to LLMs and VLMs in this case to mitigate the complexity of RL, all while achieving similar accuracy. This goes against the principle of RL as to how a technique that guides small models so well fails on LLMs when there is massive data to learn and dynamically adapt and thus more scope for growth. 

Research has revealed that the answer to the above riddle lies in the basic blocks. Offline RL performs below the expectations of LLM because of a mismatch between the training objectives of the two. Language models are trained to predict likelihoods, whereas Q learning of RL aims to predict action values. Therefore, offline RL manipulates the trained likelihood to utilize its underlying representations for its own objective during the fine-tuning Model. This manipulation leads to loss of information, language, vision, and even sequence from LLMs.Now that we have discussed the inefficiencies of offline RL and its massive potential with LLMs, we discuss the latest research that proposes measures to mitigate this problem.

Researchers from the UC Berkeley presented in their paper “Q-SFT: Q-LEARNING FOR LANGUAGE MODELS VIA SUPERVISED FINE-TUNING “a new algorithm that allows unlocking the potential of RL without diminishing the abilities of the Language model. Authors add weights to the traditional supervised fine-tuning objective to learn probabilities that conservatively estimate the value function instead of the behavior policy. They transform the maximum likelihood function into a weighted cross entropy function with weights obtained from Bellman recurrence relations.  The proposed modification allowed authors to evade unstable regression objectives, all while conserving maximum likelihood from pre-training. This method competes head-to-head with state-of-the-art approaches.

Q -SFT follows a unique method. Instead of going the conventional way of training value functions by fitting Q-values to their Bellman backup target via a regression loss, authors instead fine-tune directly on the probabilities learned from pre-training with proposed loss to ensure that Q values aren’t left behind. Q-SFT provides a way to learn Q values for multi-turn RL problems via supervised learning without reinitializing weights or new heads to represent Q values. Furthermore, the Maximum Likelihood Function put up by authors could be directly initialized from the logits of a pre-trained LLM or VLM.Q-SFT is superior to other supervised learning-based RL algorithms such as filtered behavior cloning or return-conditioned supervised learning.

Q-SFT combined aspects from both Q learning and supervised fine-tuning; therefore, the authors tested it against SOTA from both the individual methods to see if the aggregate Model could reach the individual categories. To assess Q-SFT on offline RL multi-step sequential tasks, authors consolidated various benchmarks where a language model is expected to make sequential decisions. The first set of functions in the assessment consisted of a bunch of games from the LMRL benchmark. Q-SFT’s sequential decisions were tested using Chess, Wordle, and Twenty Questions. Q-SFT  outperformed both Prompting and SFT in LLM and  Implicit Language Q Learning in RL in all three games. For the next set of tasks, LLMs had to behave as agents and perform interactive web-based tasks that also needed tools. LLM had to purchase products from WebShop. Q-SFT again achieved the highest score relatively. To test the effectiveness of Vision Language Models, authors evaluate the Model on ALFWorld, a complex text-based environment with image observations where the Model performs various complicated tasks. On ALFWorld, LLM superseded in 4 of 6 tasks, and on the remaining 2, performed head-to-head with others. The last task was Robotic Manipulation, where Q-SFT was performed at par SOTA.

Conclusion: Q-SFT improved upon the conventional Offline RL Q systems by learning Q values as probabilities similar to supervised fine-tuning objectives. Q-SFT on Large Language Models outperformed all the strong supervised and value-based RL baselines. It also was at par with the SOTA in the tasks of vision and robots when integrated with VLMs and Robotics Transformers.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post From Wordle to Robotics: Q-SFT Unleashes LLMs’ Potential in Sequential Decision-Making appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大语言模型 强化学习 Q-SFT 顺序决策 监督微调
相关文章