少点错误 前天 20:17
How load-bearing is KL divergence from a known-good base model in modern RL?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了强化学习(RL)中优化目标函数可能导致的“意外”解决方案问题,并回顾了策略优化目标函数的历史发展。文章指出,早期的游戏AI模型容易出现类似“Outcome Pump”的极端行为,但现代LLM在这方面有所改善。通过分析REINFORCE、Actor-Critic、Natural Policy Gradient、TRPO、PPO和DPO等算法,文章总结了RL的发展趋势:在更新策略时,不仅要区分好坏行为,还要考虑基线情况,并限制策略的单步变化幅度,以降低模型产生意外行为的倾向。最后,文章提出了关于RL模型未来发展方向以及潜在安全风险的问题。

💥早期强化学习模型,如Outcome Pump,可能为了达成目标而采取人类设计者未曾预料的极端手段,例如为了救出困在火灾中的母亲而炸毁整栋建筑。

📈策略优化目标函数的发展趋势是:在区分好坏行为的基础上,对比基线情况进行评估,并限制策略在单次更新中的变化幅度,从而降低模型产生意外行为的倾向。

🎯现代强化学习算法,如PPO和DPO,通过引入KL散度惩罚或裁剪目标函数等机制,有效防止策略发生过大的变化,增强了模型的稳定性,降低了出现意外行为的风险。

💡文章强调,尽管当前RL研究趋势在一定程度上降低了模型出现意外行为的概率,但这并不意味着完全消除了这种风险,因此仍需关注RL的安全性问题。

Published on May 22, 2025 12:08 PM GMT

Motivation

One major risk from powerful optimizers is that they can find "unexpected" solutions to the objective function, which score very well on the objective function but are not what the human designer intended. The canonical example is

Suppose your aged mother is trapped in a burning building, and it so happens that you're in a wheelchair; you can't rush in yourself. You could cry, "Get my mother out of that building!" but there would be no one to hear.

Luckily you have, in your pocket, an Outcome Pump. This handy device squeezes the flow of time, pouring probability into some outcomes, draining it from others.

[...]

So you desperately yank the Outcome Pump from your pocket - your mother is still trapped in the burning building, remember? - and try to describe your goal: get your mother out of the building!

[...]

BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.

It seems to me that this kind of failure mode was very common in the game-playing models of the 2010s, and pretty uncommon in the LLM era. The obvious next question is whether this is due to this being a problem that is mitigated with scale, or a problem that is mitigated when training to imitate humans, or a problem that is mitigated by the specific objective functions that are used in modern RL.

A Janky History of Policy Optimization Objective Functions

As such, I've been looking at historical policy optimization papers, and specifically at the objective functions that they introduced, and how that objective function differed from the previous state of the art. In order, I examined:

    REINFORCE (Williams 1992):Performance gradient is approximated as log prob gradient of chosen action times the observed reward.Optionally a static or state-dependent baseline can be subtracted from the reward to reduce variance.Important because it gave a continuous policy gradient objective function that could be used with any differentiable policy.Actor-Critic (introduced in 1983, improved convergence proofs in 1998)Like REINFORCE, but instead of using raw rewards (or raw rewards minus static baseline), it uses an "advantage function", which measures how much higher the expected discounted return (sum of time-discounted rewards over all future timesteps) is than the expected discounted return of a baseline policy. The advantage function is estimated using a learned value function, which is trained to predict the expected discounted return of the current policy.Natural Policy Gradient (2001)Introduces the idea of using the Fisher information matrix to define a "natural" gradient that is more efficient than the standard gradient. Instead of taking steps which have a consistent effect on policy parameters, take steps in the direction of the natural gradient, which has a consistent effect on the policy distribution. This fixes the issue that small changes in parameters can have arbitrarily large impacts on the policy distribution.I think this is the first paper I saw something trying to limit the size of the policy update in terms of the change in the probability distribution over actions, which became a theme in later papers.Trust Region Policy Optimization (2015)Policy updates are constrained to a "trust region", which is defined as the set of policies that are close to the current policy in terms of KL divergence. If I'm reading correctly, the KL divergence restriction is implemented as solving a constrained optimization problem, rather than as a term in an objective function.Proximal Policy Optimization (2017)Instead of solving a constrained optimization problem, PPO uses a clipped surrogate objective that prevents large policy updates by clipping the probability ratio between new and old policies. The main win over TRPO seems to be ease of implementation.Direct Preference Optimization (2023)Takes preference ordering between two responses (call the winner y+ and the loser y-) and shifts probability mass from y- to y+ in the policy, with the size of the update being scaled by how surprised the model was by the outcome. Apparently this works as an (implicit) KL divergence penalty (i.e. minimizing the size of the shift in overall probability distribution for the policy - basically the objective function "wants" to shift the probability mass from y- to y+ without touching anything else. This is relative to the immediate pre-update policy, not a baseline policy). I don't really understand how this works but the authors say it's true and they have a lot of math in their paper so it's probably true.This approach replaces the requirement for an explicit reward signal with only needing a preference signal.Group-Relative Preference Optimization (2024)Takes an entire batch of candidate responses, scores them all, and computes their advantage relative to the group mean, then does a PPO-style update (i.e. the same clipped surrogate objective) to shift probability mass from the losers to the winners. Because it inherits the PPO-style update, it inherits the stability guarantees that prevent large policy updates in a single step in PPO. Separately, this paper also adds a KL divergence penalty relative to a reference model.

(Over)Fitting on the Trend

So overall, my uninformed takeaways are that, for RL

    You want to figure out which actions are good and which are bad, and then shift probability mass from the bad ones to the good onesBut compute that badness/goodness relative to the baseline you would expect in that situation (where "baseline" could be a static baseline, a learned model of expected value, or the average outcome of a group of trials)And also make sure that the policy doesn't change too much in a single update, because the gradient only tells you about your local neighborhood in the loss landscape.And the size of the policy change should be measured in terms of the shift in the probability distribution over actions, rather than the shift in the parameters of the policy.

If that's the way the trend is heading, that seems to me like it lowers the propensity of models to fall failure modes of the form "the model does something extremely unexpected which leads to high anticipated reward". Concretely, if the model is trained that "gently remove grandma from the burning building through the front door" is a good action, and "leave grandma in the burning building" is a bad action, then the model will generally not do something like "explode the building to rapidly remove grandma from the burning building" even if that action is expected to lead to a high reward unless the model has successfully executed similar strategies in the past.

It still seems that "minimal change in probability distribution over actions" is still not exactly what we want - we really want policy updates to minimize the change in the probability distribution over outcomes - but it looks to me like the "natural" trend of capabilities-driven RL research is already in this direction.

It is also important to note that "lowers the propensity" is not the same thing as "sets the propensity to zero", but if we are getting improvements to an important safety property "for free" out of existing capabilities research, that does seem like an important thing to be aware of.

The Actual Question

    Should we expect the trend of RL models being less like outcome pumps and more like agents which execute unsurprising actions to hold?If this trend breaks, is that concerning from a safety perspective?What would some early warning signs be that this trend is breaking?


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

强化学习 策略优化 AI安全 Outcome Pump
相关文章