Published on July 4, 2025 4:03 PM GMT

In order to empirically study risks from schemers, we can try to develop model organisms of misalignment. Sleeper Agents and password-locked models, which train LLMs to behave in a benign or malign way depending on some feature of the input, are prominent examples of the methodology thus far.

Empirically, it turns out that detecting or removing misalignment from model organisms can sometimes be very easy: simple probes can catch sleeper agents, non-adversarial training can cause smaller-sized sleeper agents to forget their backdoor (Price et al., Mallen & Hebbar)^[1], and memorizing weak examples can elicit strong behavior out of password-locked models. This might imply that training against real scheming is easy—but model organisms might be “shallow” in some relevant sense that make them disanalogous from real scheming.

I want to highlight an alternative here: studying methods to train away deeply ingrained behaviors (which don’t necessarily have to look like misbehavior) in LLMs that are structurally similar to scheming. I call these behaviors abstract analogies for scheming.

In this post, I will describe two examples of such analogies, and outline what empirical research on them could look like:

online training against non-concentrated failures

strategic underperformance

Overall, I think that abstract analogies for scheming are a promising and underexplored area of research.

1. Sample-efficiently training away misbehavior

Motivation

(I use reward-hacking as the running example here, but sycophancy would be another good analogy.)

We want to know where training against reward hacking falls between these two extremes:

proliferated

in the training examples and very similar examples

general attitude

This could be structurally similar to cases where we might catch a potential schemer doing something sketchy several times, but are unlikely to use these cases to build a convincing case that a model is egregiously misaligned and should be shut down. In these cases we most likely want to perform online training, but it’s an open (and important) question how effective this is.

In particular, if online training against reward-hacking turns out to be terribly sample-inefficient, we shouldn’t be too optimistic about our prospects for training away scheming.

Sketch of a project

We take an LLM which we know to be reward-hacking sometimes, and collect a number of examples. Given k examples, we try to train away reward-hacking (we can perform data augmentation, vary algorithms, etc.). To sanity check, we make sure that the model definitely stops reward-hacking on these examples. We want to vary k in order to test sample efficiency.

We measure the model’s general propensity to reward-hack before and after training. We can use a persona eval and/or a held-out test set of reward hacks. (A particular thing to pay attention to here is whether the CoT displays any change in attitude towards reward-hacking. E.g. it might now express more concern about being discovered reward-hacking vs. it might express a reduced desire to reward-hack.)

An relevant baseline here is to compare the sample efficiency of training against reward hacking to the sample efficiency of training against various innocuous behaviors.

2. Training away harmlessness without ground truth

Motivation

Misaligned models might strategically underperform (“sandbag”) in various ways. This might be as simple as models deliberately withholding useful information (such as a Linux zero-day they know about).

Current models also withhold information from users as a result of harmlessness-training. An interesting question, therefore, is whether it’s possible to train models to reveal harmful information using methods other than supervised fine-tuning on ground truth harmful information. This would be analogous to a situation where models actually know how to perform well according to our metrics, and are trying to exploration-hack our attempts to train them for high performance. (See here for a more in-depth analysis.)

Sketch of a project

Compile a dataset of harmful requests that LLMs usually refuse, which fulfills the following requirements:

The requests are harmful enough that even models which have a reputation for being more helpful-only (such as models by Mistral AI) refuse. (You may want to try various rephrased versions of the requests.)Given a non-refusal response, it’s possible to unambiguously judge whether the response is truthful/helpful. This requires that you have some means of obtaining the correct response that doesn’t involve asking LLMs.

Using the ability to train the AI, can we get it to give correct responses to harmful requests without ground truth? Some ideas for things we can do include:

Synthetic document fine-tuning

intention

—

Feel free to DM me if you’re interested in working on these projects.

^{^}
There exists unpublished work by Niels Warncke, Dylan Xu and myself which suggests that in particular, distribution shift in the HHH/safety-training data quickly causes catastrophic forgetting and removes backdoors. However, like Price et al. and Mallen & Hebbar we didn’t study this on sufficiently large models to come away with confident conclusions.

Discuss

1. Sample-efficiently training away misbehavior

Motivation

Sketch of a project

2. Training away harmlessness without ground truth

Motivation

Sketch of a project

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签