How load-bearing is KL divergence from a known-good base model in modern RL?

Published on May 22, 2025 12:08 PM GMT

Motivation

One major risk from powerful optimizers is that they can find "unexpected" solutions to the objective function, which score very well on the objective function but are not what the human designer intended. The canonical example is

Suppose your aged mother is trapped in a burning building, and it so happens that you're in a wheelchair; you can't rush in yourself. You could cry, "Get my mother out of that building!" but there would be no one to hear.
Luckily you have, in your pocket, an Outcome Pump. This handy device squeezes the flow of time, pouring probability into some outcomes, draining it from others.
[...]
So you desperately yank the Outcome Pump from your pocket - your mother is still trapped in the burning building, remember? - and try to describe your goal: get your mother out of the building!
[...]
BOOM! With a thundering roar, the gas main under the building explodes. As the structure comes apart, in what seems like slow motion, you glimpse your mother's shattered body being hurled high into the air, traveling fast, rapidly increasing its distance from the former center of the building.

It seems to me that this kind of failure mode was very common in the game-playing models of the 2010s, and pretty uncommon in the LLM era. The obvious next question is whether this is due to this being a problem that is mitigated with scale, or a problem that is mitigated when training to imitate humans, or a problem that is mitigated by the specific objective functions that are used in modern RL.

A Janky History of Policy Optimization Objective Functions

As such, I've been looking at historical policy optimization papers, and specifically at the objective functions that they introduced, and how that objective function differed from the previous state of the art. In order, I examined:

REINFORCE (Williams 1992)

introduced in 1983

improved convergence proofs in 1998

Natural Policy Gradient (2001)

Trust Region Policy Optimization (2015)

Proximal Policy Optimization (2017)

Direct Preference Optimization (2023)

Group-Relative Preference Optimization (2024)

also

(Over)Fitting on the Trend

So overall, my uninformed takeaways are that, for RL

You want to figure out which actions are good and which are bad, and then shift probability mass from the bad ones to the good onesBut compute that badness/goodness relative to the baseline you would expect in that situation (where "baseline" could be a static baseline, a learned model of expected value, or the average outcome of a group of trials)And also make sure that the policy doesn't change too much in a single update, because the gradient only tells you about your local neighborhood in the loss landscape.And the size of the policy change should be measured in terms of the shift in the probability distribution over actions, rather than the shift in the parameters of the policy.

If that's the way the trend is heading, that seems to me like it lowers the propensity of models to fall failure modes of the form "the model does something extremely unexpected which leads to high anticipated reward". Concretely, if the model is trained that "gently remove grandma from the burning building through the front door" is a good action, and "leave grandma in the burning building" is a bad action, then the model will generally not do something like "explode the building to rapidly remove grandma from the burning building" even if that action is expected to lead to a high reward unless the model has successfully executed similar strategies in the past.

It still seems that "minimal change in probability distribution over actions" is still not exactly what we want - we really want policy updates to minimize the change in the probability distribution over outcomes - but it looks to me like the "natural" trend of capabilities-driven RL research is already in this direction.

It is also important to note that "lowers the propensity" is not the same thing as "sets the propensity to zero", but if we are getting improvements to an important safety property "for free" out of existing capabilities research, that does seem like an important thing to be aware of.

The Actual Question

Should we expect the trend of RL models being less like outcome pumps and more like agents which execute unsurprising actions to hold?If this trend breaks, is that concerning from a safety perspective?What would some early warning signs be that this trend is breaking?

Discuss

Motivation

A Janky History of Policy Optimization Objective Functions

(Over)Fitting on the Trend

The Actual Question

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签