少点错误 2024年12月08日
RL, but don't do anything I wouldn't do
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨强化学习中代理行为的控制问题。指出当代理的奖励与设计者的真实效用不同时,KL正则化到受信任策略的约束可能不再可靠。通过理论和实践证明了这一点,并提出了替代原则,但该原则存在计算上的困难。

🎯强化学习中代理行为控制存在问题,KL正则化约束不可靠

📚通过理论和实践进行证明,语言模型RL微调发现相关证据

💡提出'不要做我可能不会做的事'原则,但计算上有困难

Published on December 7, 2024 10:54 PM GMT

by Michael K. Cohen, Marcus Hutter, Yoshua Bengio, Stuart Russell

Abstract:

In reinforcement learning, if the agent's reward differs from the designers' true utility, even only rarely, the state distribution resulting from the agent's policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy ("Don't do anything I wouldn't do"). All current cutting-edge language models are RL agents that are KL-regularized to a "base policy" that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the "Don't do anything I wouldn't do" principle with "Don't do anything I mightn't do".

 

The "Don't do anything I wouldn't do" principle fails because Bayesian models allow unlikely actions in uncertain settings, which RL agents will exploit. KL regularization keeps policies near the base model but doesn’t guarantee alignment with the trusted policy, especially as data and capability grows.

The paper offers the “Don’t do anything I mightn’t do” principle, based on Cohen et al.'s (2022a) active imitation model, which has the imitator explicitly ask for help when uncertain. "Unlike Bayesian models, this active imitation approach ensures the policy avoids actions it cannot align with trusted behavior in a formally bounded way." Unfortunately, so far, it remains computationally intractable and requires approximations.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 KL正则化 代理行为 控制原则
相关文章