Published on March 6, 2025 1:49 PM GMT

TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.

What do we mean by in the wild?

By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.

We broadly differentiate between two settings

Developer-internal deployment:

Developer-external deployment:

Since scheming is especially important in LM agent settings, we suggest prioritizing cases where LLMs are scaffolded as autonomous agents, but this is not a hard criterion.

What are we looking for?

Broadly speaking, we’re looking for any setting where an AI model acts covertly or reasons about acting covertly against an explicit or implicit goal of the developer/user.

This setting does not include refusals or other kinds of explicit misalignment because they are not covert.

Examples include:

in-context scheming

alignment-faking

here

Importantly, the class of “scheming in-the-wild” cases we’re looking for is much bigger than “catching AIs red-handed” and includes less egregious and many more cases of scheming. We mostly just want to get a sense of how often models show this kind of behavior rather than demonstrating any particularly concerning case.

Why monitor for scheming in the wild?

Models show early scheming capabilities:

in-context scheming

alignment-faking

Models show some propensities related to scheming:

We could learn a lot:

in-context scheming

alignment-faking

It’s probably easy to do:

Concrete ideas

I’d recommend starting only with developer-internal usage data. Presumably these trajectories are already stored for standard debugging and development purposes. Otherwise, I’d collect them for a period of a few weeks so we get at least a million trajectories.

Then, I’d automatically scan through many of these stored trajectories. By default, I’d do some exploratory scanning on e.g. 10k samples to quickly improve my classifiers. I would use a mix of many classifiers, including:

String-based classifiers, e.g. in our in-context scheming paper, we found that o1 often used the word “sabotage” when it schemed. We could simply scan for ~10 words related to that and see if they turn up a single match related to scheming. Model-based classifiers, e.g. using a cheap and fast model like 4o-mini, haiku, or gemini-flash. I would compare various versions of abstract and very specific prompts, short and long, using different definitions of scheming, crafting a prompt based on existing findings in the scheming papers listed above, etc.

My MATS scholars tested more than 100 different variations of prompts in an exploratory manner and then narrowed it down to a much smaller number later. During the exploratory scanning, I’d let the classifier give the trajectory a “scheming score” between 1 and 100. This score can be used later when we suggest new classifiers, or it can be used for few-shot prompts to improve classifiers. To calibrate the classifier, I’d use examples from existing work on scheming.

I think all or almost all prompts in the exploration phase may turn out to be benign. Thus, I’d resample the exploratory samples a few times and always keep the ones that seem most concerning to get at least some cases that could be classified as somewhat scheming. This intentional oversampling would be mostly used to get a better feeling and calibration for the classifiers.

For the first setting, I’d only use internal use cases that can be widely shared within the organization (e.g., no access restrictions) because we will naturally have to look at lots of examples to double-check the classifiers and the main objective is to find good classifiers, as quickly as possible.

Privacy concerns

There are different kinds of privacy concerns for both internal and external deployment settings.

In the internal setting, typically, not all employees have access to all stored trajectories. For example, sharing the trajectories of the pre-training team might reveal company internal IP that is not accessible to most employees, or the trajectories of the frontier red team might also be restricted. To work with these constraints, we could do some of the following:

CLIO

Once you have a large set of classified data rated by multiple classifiers, we can start with data analysis. We should start with the examples with the highest scheming scores per classifier and can make the depth of the analysis mostly dependent on early findings. If even the worst examples look very benign after an honest look, we can basically call it a day and come back in a few months.

If the worst examples look interesting or surprising, we can continue with more analysis. For example,

We can qualitatively examine the flagged trajectories to see why the model is scheming and what strategies it uses. In general, understanding and classifying the reasons for the scheming seems like the most important part of this exercise. We can quantitatively count how often it engages in scheming and what fraction of total trajectories this is to get a “scheming base rate.” We can plot their distribution if we have scores for how egregious the scheming is.

There are many fairly simple ways to run a project like this that would provide a lot of insight into the model's real-world tendencies. With infrastructure like CLIO in place for other properties, I think an 80/20 version of this project could be done in less than one week and still yield very interesting insights.

At Apollo, we’d be happy to help with such a project to the best of our ability. I guess that in most cases, it will be unrealistic for us to access large amounts of company-internal data, but we would be happy to consult the respective team wherever we can.

Discuss

What do we mean by in the wild?

What are we looking for?

Why monitor for scheming in the wild?

Concrete ideas

Privacy concerns

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签