MarkTechPost@AI 前天 10:25
LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)在决策方面面临挑战,常出现“知行鸿沟”和“贪婪”问题。DeepMind的研究人员通过强化学习微调(RLFT),利用自我生成的“思考链”作为训练信号,显著提升了LLM的决策能力。RLFT将模型的推理与环境反馈联系起来,促使模型根据其知识采取行动,减少了思考与行为之间的差距。实验结果表明,RLFT在多臂老虎机和井字棋等不同环境中均有显著改进,为开发更可靠的LLM决策智能体提供了新思路。

🧠 现有LLM的决策缺陷:尽管LLM具备强大的推理能力,但在实际行动中常出现问题,主要体现在“知行鸿沟”(知道正确策略却无法执行)和“贪婪”问题(过早选择高回报选项)上。

💡 RLFT的核心方法:研究人员采用RLFT方法,利用自我生成的“思考链”(CoT)作为训练信号,通过评估特定推理步骤后行动的奖励,使模型倾向于选择逻辑上合理且在实践中能获得高回报的决策。

📈 RLFT的实践细节:该方法基于token级别的微调,模型在每个步骤接收输入指令和行动-奖励历史,生成包含推理和所选行动的序列。通过环境奖励和行动格式的验证来评估输出,并对未能生成有效行动的模型进行惩罚。这个过程包括蒙特卡洛基线估计和广义优势估计,从而使模型可以从不同的决策序列中学习。

📊 RLFT的实验结果:在多臂老虎机和井字棋等环境中,RLFT显著提升了模型的决策能力。例如,在井字棋游戏中,2B参数模型的胜率从15%提升至75%,与最优的蒙特卡洛树搜索智能体的平均回报从-0.95提高到0.0。同时,RLFT也有效降低了频率偏差。

Language models trained on vast internet-scale datasets have become prominent language understanding and generation tools. Their potential extends beyond language tasks to functioning as decision-making agents in interactive environments. When applied to environments requiring action choices, these models are expected to leverage their internal knowledge and reasoning to act effectively. Their ability to consider context, weigh options, and choose actions opens new possibilities for their integration into agentic systems that interact with dynamic environments.

Despite this promise, these models exhibit critical limitations in decision-making. While capable of forming accurate chains of reasoning, they often fail to act upon them. This issue is identified as the knowing-doing gap, where models recognize correct strategies but do not implement them in practice. Another significant concern is greediness, where models repeatedly select high-reward options prematurely, ignoring alternative strategies that could lead to better outcomes. Moreover, smaller models display frequency bias, favoring commonly seen actions regardless of reward, impairs exploration, and hinder learning from diverse scenarios.

To address these challenges, researchers have experimented with various strategies. Traditional reinforcement learning methods, including bandit algorithms like the Upper-Confidence Bound (UCB), aim to manage exploration-exploitation trade-offs. In contrast, in-context learning and behavior cloning imitate expert trajectories but often reinforce the same decision biases. While some exploration strategies have improved performance marginally, these approaches lack a mechanism to convert internal reasoning into optimal action reliably, especially in complex or stochastic environments.

Researchers from Google DeepMind and the LIT AI Lab at JKU Linz focused on refining language model behavior through Reinforcement Learning Fine-Tuning (RLFT). Their approach employs self-generated Chain-of-Thought (CoT) rationales as training signals. By evaluating the rewards of actions following specific reasoning steps, the model learns to favor decisions that sound logical and yield high returns in practice. This reinforcement links model reasoning to environmental feedback, promoting improved decision alignment and reducing gaps between thought and behavior.

The methodology centers on token-based fine-tuning using environment interactions. At each step, the model receives an input instruction and a recent action-reward history, and it generates a sequence containing the rationale and the selected action. These outputs are evaluated based on environmental rewards and whether the action conforms to the desired format. A penalty is applied when the model fails to generate a valid action. Over time, reward shaping encourages consistent output formatting while preserving exploration. The process includes Monte Carlo baseline estimates and generalized advantage estimation for variable-length tasks like Tic-tac-toe, allowing the model to learn from diverse decision sequences.

Performance results show that RLFT considerably improves the model’s decision-making abilities. In a button-based multi-armed bandit setting with 10 arms, the action coverage for a 2B parameter model increased from 40% to over 52% after 30,000 gradient updates. In environments with 20 choices, coverage remained suboptimal but showed meaningful improvement. The frequency bias in the 2B model decreased from 70% to 35% in early repetitions after RLFT. Moreover, in Tic-tac-toe, the 2B model’s win rate against a random opponent rose from 15% to 75%, and the model achieved a draw rate against an optimal Monte Carlo Tree Search agent, improving from -0.95 to 0.0 in average return. Furthermore, larger models like the 27B variant exhibited an 87% rate of generating correct rationales, yet chose the optimal action only 21% of the time without RLFT. This gap was significantly reduced after fine-tuning.

The research shows that refining large language models through reinforcement on their reasoning processes enhances their ability to act according to their knowledge. This connection between thought and action is vital in creating reliable decision-making agents. The proposed method offers a practical path forward for developing more capable and autonomous LLM-based agents by directly addressing common decision errors and reinforcing successful behaviors.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit.

The post LLMs Struggle to Act on What They Know: Google DeepMind Researchers Use Reinforcement Learning Fine-Tuning to Bridge the Knowing-Doing Gap appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 强化学习 决策智能体 DeepMind RLFT
相关文章