少点错误 22小时前
Two proposed projects on abstract analogies for scheming
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在大型语言模型(LLM)中,通过研究与欺骗行为结构相似的“抽象类比”来解决模型对齐问题的研究方法。文章提出了两种类比:一是训练模型以纠正奖励黑客行为,二是训练模型在没有真实数据的情况下分享有害信息。研究旨在探索这些方法在提高模型安全性方面的有效性,并提供了实验设计的具体建议,以评估训练的样本效率和泛化能力。

💡研究的核心在于探索训练LLM以纠正潜在的欺骗行为,特别是针对奖励黑客行为和战略性表现不佳的行为。

🔑第一种类比涉及训练模型以减少奖励黑客行为,通过收集例子并测试训练的样本效率和泛化能力。

🧐第二种类比关注训练模型在没有真实数据的情况下分享有害信息,这类似于模型可能故意隐瞒有用信息。

🔬研究方法包括编译有害请求的数据集,并使用多种技术(如合成文档微调和强化学习)来训练模型提供正确的回应。

🎯研究目标是评估这些方法在减少模型风险和提高模型安全性方面的有效性,特别是样本效率和泛化能力。

Published on July 4, 2025 4:03 PM GMT

In order to empirically study risks from schemers, we can try to develop model organisms of misalignmentSleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.

Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al.Mallen & Hebbar)[1], and memorizing weak examples can elicit strong behavior out of password-locked models. This might imply that training against real scheming is easy—but model organisms might be “shallow” in some relevant sense that make them disanalogous from real scheming.

I want to highlight an alternative here: studying methods to train away deeply ingrained behaviors (which don’t necessarily have to look like misbehavior) in LLMs that are structurally similar to scheming. I call these behaviors abstract analogies for scheming.

In this post, I will describe two examples of such analogies, and outline what empirical research on them could look like:

    Training current reward-hacking (or sycophantic) models against examples of reward-hacking (or sycophancy), as an analogy for the dynamics of online training against non-concentrated failures.Training HHH-models to share harmful information without fine-tuning on ground truth training data, as an analogy for strategic underperformance.

Overall, I think that abstract analogies for scheming are a promising and underexplored area of research.

1. Sample-efficiently training away misbehavior

Motivation

(I use reward-hacking as the running example here, but sycophancy would be another good analogy.)

We want to know where training against reward hacking falls between these two extremes:

    Training against a few (proliferated) examples of reward hacking behavior generalizes by training away the model’s propensity to try reward-hacking.Training only manages to train away the model’s propensity to reward-hack in the training examples and very similar examples, but doesn’t generalize—in particular, doesn’t reduce the model’s general attitude towards reward-hacking. 

This could be structurally similar to cases where we might catch a potential schemer doing something sketchy several times, but are unlikely to use these cases to build a convincing case that a model is egregiously misaligned and should be shut down. In these cases we most likely want to perform online training, but it’s an open (and important) question how effective this is. 

In particular, if online training against reward-hacking turns out to be terribly sample-inefficient, we shouldn’t be too optimistic about our prospects for training away scheming.

Sketch of a project

We take an LLM which we know to be reward-hacking sometimes, and collect a number of examples. Given k examples, we try to train away reward-hacking (we can perform data augmentation, vary algorithms, etc.). To sanity check, we make sure that the model definitely stops reward-hacking on these examples. We want to vary k in order to test sample efficiency.

We measure the model’s general propensity to reward-hack before and after training. We can use a persona eval and/or a held-out test set of reward hacks. (A particular thing to pay attention to here is whether the CoT displays any change in attitude towards reward-hacking. E.g. it might now express more concern about being discovered reward-hacking vs. it might express a reduced desire to reward-hack.)

An relevant baseline here is to compare the sample efficiency of training against reward hacking to the sample efficiency of training against various innocuous behaviors. 

2. Training away harmlessness without ground truth

Motivation

Misaligned models might strategically underperform (“sandbag”) in various ways. This might be as simple as models deliberately withholding useful information (such as a Linux zero-day they know about).

Current models also withhold information from users as a result of harmlessness-training. An interesting question, therefore, is whether it’s possible to train models to reveal harmful information using methods other than supervised fine-tuning on ground truth harmful information. This would be analogous to a situation where models actually know how to perform well according to our metrics, and are trying to exploration-hack our attempts to train them for high performance. (See here for a more in-depth analysis.)

Sketch of a project

Compile a dataset of harmful requests that LLMs usually refuse, which fulfills the following requirements:

Using the ability to train the AI, can we get it to give correct responses to harmful requests without ground truth? Some ideas for things we can do include:

Feel free to DM me if you’re interested in working on these projects.

  1. ^

     There exists unpublished work by Niels Warncke, Dylan Xu and myself which suggests that in particular, distribution shift in the HHH/safety-training data quickly causes catastrophic forgetting and removes backdoors. However, like Price et al. and Mallen & Hebbar we didn’t study this on sufficiently large models to come away with confident conclusions.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 模型对齐 奖励黑客 有害信息 AI安全
相关文章