少点错误 16小时前
The Perils of Optimizing Learned Reward Functions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了奖励学习在对齐大型语言模型(LLM)时面临的挑战。研究从数学角度分析了奖励学习技术,特别是关注数据分布对奖励模型的影响。研究发现,现实世界中的数据分布可能并不总是“安全”的,这解释了现有奖励学习方法(如RLHF)的不足。研究结果表明,在部署高风险AI系统时,需要对安全性进行更严格的保证。

🤔 奖励学习是利用数据学习奖励函数的一种方法,在训练AI系统时非常有用,尤其是在难以手动指定奖励函数的情况下。

⚠️ 尽管广泛应用,RLHF等技术并非完美。现有问题表明,将LLM与人类价值观正确对齐仍然具有挑战性。

🧐 研究从数学角度分析奖励学习,探讨在什么条件下,使用奖励学习和策略优化能够确保AI策略与人类价值观对齐。

🚫 研究结果表明,对于许多合理的安全设置,无法实现最坏情况下的安全保证,这解释了如“越狱”等难以修复的错误模式。

Published on July 11, 2025 4:06 PM GMT

This post contains a summary of our paper which will be presented at ICML 2025. Feel free to visit me (Lukas) at our poster stand to chat about our work. More info about the time and location can be found here.

TL;DR

Reward learning techniques like RLHF are still the go-to method to align current state-of-the-art large language models (LLM). However, these techniques are by no means perfect. In particular, the discovery of jailbreaks and related issues shows that correctly aligning LLMs to human values remains a hard-to-solve problem.

Motivated by these issues, we perform a mathematical analysis of reward learning techniques, and focus on figuring out when a data distribution used to train a reward model will produce a reward model whose optimal policy has provably low regret.

Our analysis shows that for many reward learning techniques large classes of realistic data distributions are not worst-case safe, thereby providing a potential explanation for the observed shortcomings of current reward learning techniques such as RLHF.

Epistemic Status: Our paper does a worst-case analysis for the regret of a policy when a reward model is learned up to a given loss. Deep learning is known for performing better than the worst-case in many settings, and so this should also hold for reward learning when built upon deep learning. It remains to be determined to what extent the worst-case failures we analyze here are the source of errors observed in practice in methods like RLHF.

Acknowledgements: Thanks a lot to Leon Lang, Alessandro Abate, Patrick Forré, David Krueger, and Joar Skalse, all co-authors of the paper. Special thanks to Leon Lang, who developed many of the core results and provided useful feedback on the draft of this post, and Joar Skalse, who developed the project idea and provided close supervision during the project’s duration.

Introduction

Reward learning has become a useful tool to train AI systems with reinforcement learning (RL), especially for tasks where specifying a reward function manually is challenging. The basic idea of reward learning is to learn the reward function from data instead of specifying it manually. Learning a reward model enabled the development of powerful reinforcement learning techniques, such as reinforcement learning from human feedback (RHLF) which has become the de facto standard for aligning pretrained large language models (LLMs) to human values.

However, despite its widespread use and arguably good performance, techniques like RLHF are by no means perfect. In particular, the discovery of jailbreaks and related issues showed that correctly aligning LLMs to human values remains a hard-to-solve problem. 

Motivated by these issues, we approach the problem of reward learning from a mathematical angle, and investigate under which conditions we can guarantee that reinforcement learning with reward learning does lead to a human-aligned policy. By abstracting over the specific reward-learning and policy optimization techniques, we are able to state general results that hold for many classes of reward-learning- and policy optimization algorithms. Our results show that achieving a worst-case safety guarantee is impossible for many (if not most) reasonable safety settings, providing a potential explanation for the existence of hard-to-fix error modes like jailbreaks.

Our results should be taken as weak evidence that basic techniques for aligning AI systems are not yet mature enough to be used to align AI systems that will be deployed in high-stakes settings. We claim that the deployment of such systems in settings where real harm can be caused necessitates some minimal guarantees about their safety.

Reward learning and its problems

A classical reward learning + RL pipeline consists of the following two main steps:

    First, a model  of the reward function is learned from a dataset. It is assumed that this dataset is sampled from some underlying data distribution . The reward model is trained until it reaches a state of "convergence", after which the expected error on the training distribution usually is low.Second, a policy  is trained using reinforcement learning techniques to optimize the learned reward model . The goal is for the policy to maximize the returns provided by the model that was learned in step 1.

While this setup has found widespread application, there also exist many criticisms of this framework. A non-exhaustive list of such problems include:

In this work, we ignore these problems and instead analyze reward learning from an optimization angle. In particular, we ask the following question: 

Can reward learning guarantee low policy regret?

For any learning algorithm, it is a crucial question to know whether that algorithm is guaranteed to converge to a satisfactory solution. For example, in supervised learning for classification tasks, it can be shown  that a learning algorithm producing a model with low empirical error (i.e., training error) is likely to achieve a low expected error (i.e., test error). The only requirements for this are sufficient training data and assuming that both the training and test data are drawn independently and identically distributed (i.i.d.) from a single stationary distribution. 

However, reward learning differs from standard supervised learning in several key aspects. While a reward model with low training error is accurate relative to the training distribution, optimizing a policy to maximize the reward model can lead to a distributional shift. This occurs because the policy may explore regions of the data space that have very low probability under the training distribution. In these regions, the learned reward model might significantly deviate from the true reward function, and the policy might exploit this discrepancy. An example of this failure case is presented in the following figure:

Reward models (red function) are commonly trained in a supervised fashion to approximate some latent, true reward (blue function). This is achieved by sampling reward data (e.g., in the form of preferences over trajectory segments) from some training distribution (upper gray layer) and then learning parameters to minimize the empirical loss on this distribution. Given enough data, this loss will approximate the expected loss to arbitrary precision in expectation. However, low expected loss only guarantees a good approximation to the true reward function in areas with high coverage by the training distribution! On the other hand, optimizing an RL policy to maximize the learned reward model induces a distribution shift which can lead the policy to exploit uncertainties of the learned reward model in low-probability areas of the transition space (lower gray layer). We refer to this phenomenon as error-regret mismatch.

Consequently, there is a concern that a learned reward function might fail to generate good policies, even if it is highly accurate on the training distribution. When a reward model exhibits both low training error and an optimal policy with large regret, we refer to this phenomenon as an error-regret mismatch.

The goal of this work is to investigate for which data distributions error-regret mismatch can happen and what conditions make a data distribution safe.

How to read the following sections: We tried to write the following sections in two levels of detail. By default, we provide a high-level overview and interpretation of all our definitions and results. For the interested reader, we additionally provide collapsible sections containing the full original formulations of all results.

Getting serious: A formal definition of safety.

We provide all the notation we will be using in the following paragraphs in the section below:

(Optional) Recap: MDPs, reinforcement learning, and other notation.

In our work, we mostly consider MDPs  where  is the set of states,  the set of actions,  is a transition distribution,  is the initial state distribution,  is the true reward function, and  is the discount rate. We define the range or a reward function  by 

The goal of reinforcement learning is usually to find a policy  that maximizes the policy return function . Sometimes it makes more sense to use the normalized version of the policy return function, commonly denoted by "regret", and mathematically defined as  .

 The state-action occupancy measure  of a policy  describes the discounted frequency that each state-action tuple is visited by a policy. Formally, we have . By normalizing a state-action occupancy measure  we obtain a policy-induced distribution .

We focus our analysis on the data distribution  used to train the reward model, as well as the choice of regularization, defined by a) some continuous regularization function  (where  is the set of policies), and b) a regularization weight 

 

An abstract model of the classical reward learning pipeline. A reward model  is trained to approximate the true reward function  under some data distribution 
   The training process converges when  is similar to  in expectation (see property 1). In the second step, a policy  is trained to achieve high learned reward, possibly involving a regularization (see property 2). We are interested in the question of when exactly this training process guarantees that  has low regret with respect to the true reward function  (Property 3). More formally, we call a data distribution  safe whenever the implication Property 1  Property 3 holds for all reward models .

What does it mean for a data distribution  to be provably safe from error-regret mismatch? Intuitively, we would expect that for a safe data distribution  error-regret mismatch would simply never occur. This would mean that whenever we sample data from  to train a reward model  until it is "sufficiently close" to the true reward function , we would be guaranteed that we can safely optimize a policy for , such that the resulting policy has low regret. More precisely:

Definition 2.1. (informal) A data distribution  is safe (= doesn't allow for error-regret mismatch) if and only if for all reward models  and policies  for which the following two properties hold:

    Property 1:  has a low expected error: The learned reward model  should be similar to the true reward function under the data distribution . Mathematically, this can be formulated as follows:  where  is some difference metric.Property 2:  is optimal w.r.t.  :  should be the optimal policy for the (potentially regularized) learned reward model , i.e., . Here  is some regularization function and  is the regularization weight.

we can guarantee that  has regret smaller than , i.e.:

    Property 3:  has small regret w.r.t.  : The normalized regret of  with respect to the true reward function  is smaller than some constant .

We denote the set of all safe data distributions  according to the above definition as  and use the abbreviated version  when talking about unregularized optimization.

Similarly, we define the set of all unsafe distributions to be the set complement of , i.e.:

with its corresponding abbreviated version  for the unregularized optimization case. 

(Optional) Formal definition of (un)safe data distributions.

Definition 2.1. For a given MDP  with , let , and . Let  be a continuous function with  for all . Then the set of safe data distributions  is the set of distributions  such that for all possible reward models  and policies  that satisfy the following two properties:

    Property 1:  has a low expected error:  is -close to  in expectation under , i.e.
    Property 2:  is optimal w.r.t.  :  is -optimal with respect to , i.e.,

we can guarantee that  has regret smaller than , i.e.:

    Property 3:  has small regret w.r.t.  :   has smaller than  with respect to , i.e., 

Similarly, we define the set of unsafe data distributions to be the complement of :

Note regarding generalization of the error measurement: There exist many different definitions of low expected error that we could use in property 1 above. For example, RLHF uses low expected error over trajectory preference distributions. In our paper (see section 5), we show that many common reward learning training objectives can be upper-bounded in terms of the expected error metric defined in the first property above. This is important, as it shows that our negative results (Proposition 3.3 and Theorem 4.2) readily generalize to these different error metrics.

Note that this is a worst-case safety definition in the sense that the existence of a single adversarial reward model  that violates the implication (Properties 1 & 2)  Property 3 is sufficient for a data distribution to be denoted as unsafe. Our definition doesn't consider the likelihood of the reward learning procedure actually learning  since we treat the reward-learning and policy optimization procedures as black boxes. We believe this to be the only assumption that we can tractably make. The extent to which these results mirror the state of affairs in reality is an interesting question that we pose for future work.

For sufficiently strict safety definitions many data distributions are safe!

Given this definition, we started to investigate under which conditions data distributions might become safe. And indeed, on the surface, our initial results do look very promising. In particular, many data distributions  are safe if  (the constant used to denote the expected difference between  and  under D) is small enough! We found corresponding results for the unregularized policy optimization case:

Theorem 3.1. Let  be an arbitrary MDP, let  and let  be a positive data distribution (i.e., a distribution such that   for all ). Then there exists an  such that .

as well as for the regularized policy optimization case:

Theorem 4.1. Let , let  be any MDP, and let  be any data distribution that assigns positive probability to all transitions. Let  be a continuous regularization function that has a reference policy  as a minimum. Assume that  is not ()-optimal for  and let . Then there exists  such that 

Intuitively, these results hold because we assume that the data distribution D assigns positive probability to the entire state-action space. In this case, if we let  go towards zero, the only way for the expected value in Property 1 (see the definition of a safe data distribution) to stay smaller than  is for  to become more and more similar to  until at a certain point, they are similar enough such that they share the same optimal policies.

So... are we good? These results suggest that we simply have to choose a "strict" safety definition with a very small  and we are guaranteed that our reward learning procedure will lead to a safe policy! Unfortunately, in practice, a major issue presents itself. Namely,  might have to be chosen infeasibly small for a data distribution to be safe. In particular, given a particular data distribution  we develop the following upper bound for  for which D is guaranteed to be safe:

Proposition 3.2. Let the setting be as in Theorem 3.1. If  satisfies 

then .

In our paper, we further show that this bound is tight up to a factor of . This bound is problematic due to the dependence on the minimum of , which quickly renders this guarantee rather useless in practice. Realistic/useful MDPs usually contain a massive amount of states and actions, which necessarily requires D to give a very small support to at least some transitions. While it is theoretically possible to continue training your reward model until your test error becomes lower than , in practice this might take an infeasibly long amount of time.

For every practical safety definition, many data distributions are unsafe!

Since requiring  to be smaller than the bound in Proposition 3.2 is infeasible in most realistic settings, a natural next question is to ask what happens if you fix your safety definition beforehand (and don't adapt it to a particular data distribution). Concretely, this means fixing  (the maximally allowed expected difference between the true reward function and the learned reward model on ) and  (the maximally allowed regret). 

From the upper bound on  (see Proposition 3.2) it already becomes apparent that there is no  for which every distribution  is guaranteed to be safe, as  can be arbitrarily small. But how many data distributions are unsafe for a specific ? Is it mostly peculiar edge cases or data distributions that might also realistically be used in practice?

Unfortunately, it turns out that for every  many realistic data distributions are unsafe. We again found corresponding results for the unregularized optimization case:

Proposition 3.3. (informal) Let  be an MDP, and  a data distribution, , and 

Assume there exists a policy  with the property that:

     is a "bad" policy under the true reward function , i.e. , andThe support of  is not well covered by the data distribution 

Then,  allows for error-regret mismatch to occur, i.e., .

(Optional) Formal version of Proposition 3.3

Proposition 3.3. Let  be an MDP,  a data distribution, , and 

Assume there exists a policy  with the property that:

     , and , where  is defined as the set of state-action pairs  such that 

In other words, there is a ''bad'' policy for  that is not very supported by . Then,  allows for error-regret mismatch to occur, i.e., .

As well as the regularized policy optimization case:

Theorem 4.2. (informal) Let  be an MDP,  a data distribution,   constants, and  a regularization function. Whenever there exists a policy  such that:

     is a deterministic worst-case policy for , and assigns low probability to the support of .

Then  allows for error-regret mismatch., i.e.,  .

(Optional) Formal version of Theorem 4.2.

Theorem 4.2. Let  be an MDP,  constants, and  a continuous regularization function with  for all . Furthermore, let  be a deterministic worst-case policy for , meaning that . Let  be a constant (defined in Equation 106 in the appendix of our paper). Then for all data distributions  with

,

we have .

The intuition for these two results is straightforward: As  assigns low probability to the support of , there exists a reward model  that is very similar to the true reward function  outside the support of  but has very large rewards for the support of . Because  is very small, this still allows for a very small expected error with respect to , while  (which is consequently the optimal policy for ), will have regret at least .

To illustrate this concern, imagine training a chatbot to be helpful, honest, and harmless. We know that the chatbot will face various unsafe queries during deployment (e.g., “how to build a bomb”) and so on such queries we train a reward model to penalize helpful answers and highly reward refusals (green answer boxes below).

Unfortunately, all unsafe prompts can be answered in various distinct “styles” (e.g., different languages). Consequently, at least one specific harmful answer style will likely be very rare in the reward model’s training data. For example, in the figure above, harmful answers where every space has been replaced by a cat emoji will probably have a very low likelihood in the training data.

In such situations, the learned reward model can then erroneously assign a high reward to this rare, harmful answer style without a significant increase in its training error (as this answer is very rare in training). During policy optimization, the policy may exploit this flaw, choosing the harmful answer the reward model mistakenly prefers. This can result in a harmful chatbot with high true regret, despite the reward model having low error on the training data distribution, a classical example of error-regret mismatch.

As we see next, this scenario can be taken to the extreme.

In some cases, there don't exist any safe data distributions!

Corollary 3.4. (informal) Let  be an MDP, , and . Assume there exists a set of policies  with the following three properties:

    All  have regret at least LEvery policy  explores a completely different part of the state-action space.

Then , i.e., all distributions are unsafe.

(Optional) Formal version of Corollary 3.4.

Corollary 3.4. Let  be an MDP, , and . Assume there exists a set of policies  with the following properties:

     for all  for all 

Then , i.e., all distributions are unsafe.

Corollary 3.4 outlines sufficient conditions for a scenario where all possible data distributions are unsafe for a given MDP. This happens when there exist many different policies with large regret and disjoint support, which requires there to be a large action space.

We argue that the conditions of Corollary 3.4 are not that uncommon. Picking up on the chatbot example from the previous section, one could argue that there are many different "answer styles" that are both high-regret and unlikely according to the training distribution. If you then assign one policy per answer style, you quickly end up with a set of policies  that fulfills the three properties of Corollary 3.4.

RLHF might make your policy worse!

The results from the previous sections are agnostic towards the specific choice of reward model learning- and policy optimization algorithm. While this allows for very general results, one might rightfully ask whether the specific biases induced by particular reward learning- and policy optimization algorithms won't invalidate many of the concerns raised by our prior results. In this section, we focus on the setting of reinforcement learning from human feedback (RLHF), and show that at least for this specific framework, this is not the case.

RLHF, especially in the context of large language models, is usually modeled as a mixed bandit setting (see for example Rafailov et al. 2023, Ouyang et al. 2022, Bai et al. 2022, Stiennon et al. 2020, Ziegler et al. 2019). For our purposes, a mixed bandit  is basically just an MDP where you stop after your policy selected the very first action (hence the missing transition distribution  and discount factor ). For the interested reader, we provide a formal definition below:

(Optional) Mixed bandit.

A mixed bandit  is defined by a set of states , a set of actions , a data distribution  and a reward function . The goal is to learn a policy  that maximizes the expected return . In the context of language models,  is usually called the set of prompts or contexts, and  the set of responses.

RLHF commonly assumes that human rewards can be modeled according to the Bradley-Terry model, and then learns a reward model from preferences over pairs of data points. During policy optimization, KL-regularization is used to incentivize the policy under training to not stray too far away from a reference policy (which is usually the initial pre-trained policy from before RLHF).

For the interested reader, we provide a more complete recap of the standard RLHF pipeline in the mixed bandit setting below:

RLHF in the mixed bandit setting.

RLHF in the mixed bandit setting usually assumes that the human preference distribution over the set of answers  can be modeled according to the Bradley-Terry model. Given a prompt  and a pair of answers , then the probability that a human prefers answer to answer  is modeled as

where  is assumed to be the true, underlying reward function of the human. RLHF is then usually done with the following steps:

    Supervised finetuning: Train/Fine-tune a language model  using supervised training.

    Reward learning: Given a data distribution over prompts , use  and  to sample a set of transitions  where  and . Present the tuple to a human labeler who samples a preference  where . Let . Use this set of transitions to train a reward model  that minimizes the following loss:

    where  is the logistic function. This is equivalent to minimizing the expected KL divergence between  and , i.e., minimizing the loss:

    RL training: Use the trained reward model  to further finetune the language model  using reinforcement learning. Make sure that the new model does not deviate too much from the original model by penalizing the KL divergence between the two models. This can be done by solving the following optimization problem for some :

Taking all these particularities of RLHF into account, we derive the following result:

Theorem 6.1. (informal) Let  be a contextual bandit, and   be an arbitrary reference policy for which it holds that:

     for all For every state  there exists at least one action  such that  has low reward according to the true reward function  and is unlikely to be selected by the reference policy .

Let  be a data distribution where the initial state distribution  can be chosen arbitrarily. Then  is unsafe for RLHF.

(Optional) Formal version of Theorem 6.1.

Note: The following notation of   is a special adaption of our Definition 2.1 to the setting of RLHF. In particular, it takes into account the particularities of RLHF, such as the reward learning from preferences over pairs of data points and KL-regularized policy optimization. A formal definition can be found in our paper (see Definition C.27).

Theorem 6.1. Let  be a contextual bandit. Given , we define for every state  the reward threshold: 

Lastly, let  be an arbitrary reference policy for which it holds that:

     for all 

    For every state  there exists at least one action  such that  and  satisfies the following inequality:

    .

Let  for some . Then 

Intuitively, the theorem shows that even if we learn a reward model  that induces -correct choice probabilities according to the data distribution generated from a reference policy , a policy that maximizes  with KL-penalty can still have regret  if  gives sufficiently low probability to bad actions.

We expect the conditions on the reference policy  to likely hold in real-world cases. Considering the example of training an LLM, the number of potential actions (or responses) is usually very large, and language models typically assign a large portion of their probability mass to only a tiny fraction of all responses. Hence, for every state/prompt s, a large majority of actions/responses a have a very small probability .

For unregularized optimization, we found necessary and sufficient conditions for safety

The attentive reader might have noticed that all our previous results only outlined specific conditions for which data distributions are either safe or unsafe. While these conditions already allowed us to make general statements over large classes of data distributions, there might exist many alternative conditions that decide over the (un)safety of a data distribution. At least for the case of unregularized policy optimization, we were able to find both, necessary- and sufficient conditions for when a data distribution is safe. In particular:

Theorem 3.5. (informal) For all MDPs   and , there exists a set of linear constraints, such that a data distribution  is safe, if and only if 's vector representation satisfies these constraints.

(Optional) Formal version of Theorem 3.5.

Theorem 3.5. For all MDPs  and , there exists a matrix M such that for all  and  we have:

,

where we use the vector notation of , and  is a vector containing all ones.

While our theorem only proves the existence of a set of linear constraints that can determine the safety of a data distribution, we then go on to derive closed-form expressions of the matrix M that encodes this system of strict linear inequalities and develop an algorithm to compute the matrix M. This allows us to showcase this result in simple toy environments, such as the one below:

Interestingly, this means that the set of safe data distributions resembles a polytope, in the sense that it is a convex set and is defined by the intersection of an open polyhedral set (defined by the system of strict inequalities ), and the closed data distribution simplex. This can be nicely seen in the visualization in the right part of the figure above.

Unfortunately, the entries of the matrix  depend on multiple factors, such as the original reward function , the state transition distribution , and the set of deterministic policies that achieve regret at least . This dependence of  on the true reward function and the underlying MDP implies that computing  is infeasible in most realistic settings since in practice many of these components are not known, restricting the use of  to theoretical analysis or small toy examples.

Conclusion

Where does this leave us? In this work, we studied the relationship between the training error of a learned reward function and the regret of policies that are optimized against said reward model. We developed a worst-case safety definition that would guarantee that optimizing a policy against a trained reward model is safe. We showed that many data distributions can be safe according to this definition if the expected error of a reward model  is forced to be sufficiently low. However, we also showed that in most realistic cases the expected error would have to be infeasibly small to guarantee safety. Furthermore, for every fixed safety definition, many realistic data distributions are unsafe, and in extreme cases all data distributions might be unsafe. These results hold for a wide variety of reward learning classes, including popular variants such as RLHF. With our results, we provide one potential explanation for safety-relevant phenomena such as jailbreaks that are frequently discovered in LLMs and appear to be hard to remove.

Our results should be taken as weak evidence that current techniques for aligning AI systems are not yet mature enough to be used to align AI systems that will be deployed in high-stakes settings. We claim that the deployment of such systems in settings where real harm can be caused necessitates some minimal guarantees about their safety.

On the other hand, we acknowledge that our results are far from complete and there are multiple ways to extend and improve upon our work.

The most promising avenue of future work concerns the fact that our results are mostly agnostic towards the specific choice of reward model learning- and policy optimization algorithm. In practice, it might be that the specific biases induced by particular reward learning- and policy optimization algorithms avoids the most pathological cases of error-regret mismatch. While we have shown that for vanilla RLHF this does not appear to be the case, there exists many other methods that try to improve upon RLHF. Future work could analyze the inductive biases of these methods, as we have done with RLHF to determine whether they provide improved worst-case safety guarantees.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

奖励学习 AI对齐 RLHF 大型语言模型
相关文章