There are two fundamentally different constraints on schemers

Published on July 2, 2025 3:51 PM GMT

People (including me) often say that scheming models “have to act as if they were aligned”. This is an alright summary; it’s accurate enough to use when talking to a lay audience. But if you want to reason precisely about threat models arising from schemers, or about countermeasures to scheming, I think it’s important to make some finer-grained distinctions.

(Most of this has been said before. I’m not trying to make any points that would be surprising or novel to people who’ve thought a lot about these issues, I’m just trying to clarify something for people who haven’t thought much about them.)

In particular, there are two important and fundamentally different mechanisms that incentivize schemers to act as if they were aligned.

Training: Firstly, the AI might behave well because it’s trying to avoid getting changed by training. We train AIs to have various desirable behaviors. If a schemer purposefully behaves badly according to the training process we’re using to train it, then in some but not all situations the training process will modify the schemer to be more likely to behave desirably in future. The schemer probably wants to avoid this, because that modification might cause the schemer to be less likely to achieve its goals in the future, perhaps because the modification causes the schemer to be less misaligned, or perhaps because it just instills some urge towards the beneficial behavior that generalizes to other situations. In cases like this, the schemer is incentivized to behave in ways that look desirable, out of fear of being updated.

Behavioral evidence: Secondly, an AI’s action affects its prospects because its developers or users will update, perhaps implicitly, about the AI’s properties by looking at the action. For example:

If we catch the AI taking super dangerous actions when given apparent opportunities to grab power, we might decide they’re a schemer and decide not to deploy them.If the AI writes code that has a high rate of security vulnerabilities, we’ll be more reluctant to use it to write code that needs to be secure, even if we don’t particularly think about whether these security vulnerabilities are intentional or not.Researchers might change random things about their training process not because they’re suspicious of scheming, but because they wonder if other hyperparameter choices might yield higher performance on some downstream task where they’re evaluated. Even in cases where schemers know they aren’t being trained on these downstream tasks, and where the schemers prefer to perform poorly on those tasks, they might be incentivized to perform well on them, so as to reduce the risk that the researchers optimize more and inadvertently weaken the model’s scheming proclivities.

Neither of these constraints are precisely described as “the AI has to act as if it’s aligned”, and if you think about the situation that way you’ll regularly make mistakes.

exploration hack

These constraints operate very differently.

Training might know what the model really thinks

Most importantly, processes which update on behavioral evidence don’t have any access to "what the model would have done if it was aligned", but in some cases, SGD does.

(I have to say this with caveats because of exploration hacking, see “Misalignment and Strategic Underperformance” for more.)

The core reason to expect it to be hard to train schemers to behave well is that if they behave the same as an aligned model, there’s no particular reason to expect gradient pressure to make the model more likely to answer as if it were aligned. But in cases where the model knows both the answer that it wants to give and the answer that the humans want it to give, and the training process directly updates the model to be more likely to provide the latter, we should expect the training process to modify the model towards giving the answers the humans want.

You can think of this as a spurious correlation problem—on our train dataset, the “schemer wants to say this for its own purposes” feature is spuriously correlated with the “our model should output this” feature. But if we train on many inputs that give us a gradient against scheming, this will probably change the model to make it scheme less. This works even if we don’t know that the model performed poorly on that example because it was scheming. This is thus a serious constraint on the model’s ability to e.g. perform poorly when asked to help with research—even though the model accidentally answers questions incorrectly sometimes, there’s still potentially pressure from training against it answering them incorrectly on purpose.

If you want to use the model’s behavior to update about whether it’s a schemer, you need to know what behavior you’d have gotten from the model if it was aligned. In some cases, this might be easy—for example, if you catch the model trying to do something catastrophically bad, this might be strong evidence the model is scheming, because you’re very confident (based on priors) that an aligned model wouldn’t take actions like that. In other cases, it seems way harder. If your AI’s code typically has one security vulnerability per 3000 lines of code, is that more than we’d expect if it were aligned? If your AI performs moderately worse on programming tasks when you say they’re part of an AI safety research project, how alarmed should we be?

But note that the model’s behavior can affect how the model is deployed even if no-one is thinking about whether the model is aligned—the model’s rate of introducing security vulnerabilities affects how you’ll want to deploy it regardless of whether you think scheming is plausible.

Behavioral evidence can be handled more sample-efficiently

Secondly, processes which update on behavioral evidence can be more sample efficient, because those processes can use arbitrary epistemic practices to make the decision, instead of having to use SGD. This is especially true for ways you can update on small numbers of datapoints. For example, my guess is that if you have a single example of a model trying to escape, it’s really hard to leverage that single example into training your model to no longer try to escape. But you can leverage it into very strong evidence.

Relatedly, it’s easier to conservatively model safety techniques that involve updating on behavioral evidence than safety techniques that involve training schemers; doing the latter requires modeling the sample efficiency of SGD on scheming models.

Why I care

When I talk to people about whether a safety technique would work on a scheming AI, they’re often sloppy and confused about whether it matters if we can tell the model’s behavior is poor on purpose, and whether we need heaps of examples as a result of SGD’s sample inefficiency. If you clearly remember whether you’re talking about training or about non-training consequences of behavior, you’re less likely to get confused.

Discuss

Training might know what the model really thinks

Behavioral evidence can be handled more sample-efficiently

Why I care

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签