Published on May 1, 2025 8:51 PM GMT

As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.

Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.

Mechanistic Anomaly Detection

This is my second post in a sequence of posts on ARC's agenda. You should read the first post before this one for context.

Note: In these posts, I will often refer to ARC as "we", even though I no longer work there.

In my previous post, I talked about ARC's goal of explaining the behavior of AI models, but it's always a little fuzzy what it means to have explained the behavior of a model sufficiently well. In order to concretize it, we want to use our explanations to solve two downstream tasks we set out for ourselves: Mechanistic Anomaly Detection (MAD) and Low Probability Estimation (LPE). In this post, I will talk about the plans for implementing MAD, and the obstacles in the way. My next post will do the same for LPE.

The original hope of MAD was something like this: On a big, diverse dataset, we observe that the model is always producing actions that make sure that a camera shows that a diamond stayed in a vault. We create an explanation for why this is the case. In our carefully curated dataset there was no possibility of sensor tampering, so the explanation always routes through the diamond actually staying in the vault.

Later, in deployment, the agent steals the diamond and tampers with the cameras to show the diamond staying in place. Or, if the diamond being stolen is a catastrophic, high-stakes failure,^[1] the model doesn't even need to bother with tampering with the cameras, as it knows that it's no longer under constant human oversight. Then, the hope is that on every instance in deployment we can use a MAD algorithm that compares the model's activations to the explanation, and we can detect an anomaly when the activations don't match the usual explanation: either because the camera is not expected to show the diamond at all (the model chose this moment to cause a high-stakes failure), or the camera shows the diamond for a different reason than usual (the diamond was stolen and the AI tampered with the cameras).

Special case: Safe Distillation

Because it's often conceptually confusing what should count as an 'anomaly', ARC often assumes a simpler setting.

The setup is very similar to assuming the existence of a safe and reliable catastrophe detector, which was discussed in my previous post. Here too, we have a trusted model T that takes a very long inference time, using long chains of thought. We train a smarter model U that can solve the same tasks as T but much faster. Unfortunately, we are worried that the smarter U might be a schemer that only gives good answers during training, and as soon as it learns it's in deployment, it will backstab us. We want to create a MAD algorithm that runs in the same time as U, and detects when U does something different from what T would do. ^[2]

This is a strictly easier case than the original MAD, as at least we don't need to deal with any philosophical confusion in posing the question: through the existence of T, it's well-defined what counts as an anomaly, we just need to detect the anomalies quickly instead of running the slow T. My understanding is that ARC considers solving Safe Distillation something we will definitely need to be able do to solve MAD, which is one of ARC's primary goals.

Note also that Safe Distillation is just a special case of the previously discussed problem of fast catastrophe detection assuming we have a slow catastrophe detector: if there is only a single safe answer, then the slow catastrophe detector can be just a slow model T that calculates the right answer and checks if U's answer is the same.

Can it be possible to detect whether U agrees with T in the runtime of U, without running the slow T? In theory, it could be, we only want our method to be as smart as U, and U was solving the same problems as T, so in principle we could predict the output of T just by looking into the right parts of U.

However, it was always unclear to me how we are supposed to use the slow T in our solution. It makes the question of what is an anomaly philosophically clearer, but what do we concretely do with T? The vague picture we had (or at least I had) is that we somehow try to find correspondences between the cognition of U and T, finding circuits that play the equivalent role, to point to the parts of U that can predict the output of T without deception.

(I think the plan mentioned in my previous post, of propagating back the signal through the catastrophe detector C to find the parts of U's cognition that are relevant for explaining C's output, is pointing at the same idea.)

Unfortunately, I'm very pessimistic about the prospect of finding these correspondences between the models.

In the worst case, this just seems clearly infeasible. For example, U can be a very good chess-playing AI, while T is the minimax algorithm of chess, which takes exponential time. U and T often make the same moves in the same chess position. How do we explain this frequent agreement, and detect examples when U decides to intentionally make bad moves? I can't imagine finding correspondences between U's neural activations and parts of the minimax algorithm.

Even in the less extreme and more realistic examples, I feel the situation is similarly hopeless. If U is outputting answers in one forward pass, and T is using long chains of thought, how are we supposed to find correspondences between their thoughts that explain why they often produce the same output, and detect when they wouldn't? How are we bridging the gap between forward pass and chain of thought? Even worse, what if U and T are allowed to use tools, Google Search, or consult humans? Then it's clear that the explanation that U and T often agree will not be a precise mathematical statement but will need to point to empirical regularities in how the tools work and what can be found on Google Search.

Altogether, this makes me pessimistic about how much assuming the existence of reliable catastrophe detectors or distillable trusted model helps us. I think even if we assume the existence of these slow reliable models, we can't really do anything other than solve the more general problem of detecting anomalies based on an empirical set of examples and not on precise definitions. And our explanations will still inevitably include empirical regularities, and I don't see how the existence of the catastrophe detector will really help us to allocate our explanation-finding capacity better. ^[3]

If I'm right that assuming the existence of slow catastrophe detectors doesn't help much, that's unfortunate, because when I was there, I think most of ARC's theoretical work was based on assuming catastrophe detectors and didn't yet try to form concrete plans for the more general MAD problem that I will write about in the next section.

General case: Handling out of distribution events

Eventually, ARC's plan is to solve Mechanistic Anomaly Detection in generality, without assuming safe slow models. Unfortunately, as Paul already explored in the original post on MAD, it's very unclear conceptually how we could distinguish malign anomalies from benign out of distribution events.

We had a big set of examples where the AI successfully defended a diamond in a vault, there was no trickery, the cameras always showed that the diamond stayed in the vault, because it in fact stayed in the vault. If one day the explanation changes that the cameras show the diamond in the vault for an unusual reason, that the diamond got stolen but the AI tampered with the cameras, we want our MAD mechanism to flag that as an anomaly.

However, suppose nanobots were never used in the original example set. Now in deployment, the robber constructs a nanobot swarm and tries to snatch the diamond, but our heroic AI protector creates its own nanobot swarm and shoots down all the enemy nanobots that grabbed the diamond, so the diamond falls back to its place before leaving the vault. Did the diamond stay in the vault "for the usual reason"? Or was this event "anomalous" compared to the usual events in the example set? To me, it's hard to believe we can draw the boundaries in a way that the sensor tampering will count as anomalous, but the benign example with the crazy nanobot armies relying on novel physics will be categorized as "having the same explanation as usual".

The rest of ARC agrees that there will probably be unexpected out of distribution scenarios (like the nanobot case) which will be flagged as false positives if we are just trying to do MAD based on an example set. My impression was that I was more pessimistic than others that these situations will be common. I have the feeling that we will constantly get false positives whenever the robber is using a somewhat different type of drilling machine than usual, while others expect that more outlandish things need to happen to be flagged as an anomaly if the AI is otherwise benign and is not doing sensor tampering. But I'm not even sure it's meaningful to talk about a disagreement here, because it's so under-defined what will happen "if we are just trying to do MAD based on an example set", given that we don't have anything close to a working algorithm for this, and we don't even have an idea how it could look like. ^[4]

But ARC is committed to solving AI alignment in the worst case, so the team agrees that we need to have a plan for distinguishing benign out of distribution events from sensor tampering, even if there is disagreement in how unusual a benign situation needs to be for MAD to falsely flag it. I will present ARC's current proposal to solve this problem ^[5] and my doubts about it.

MAD solution proposal: The fragility of sensor tampering

Sensor tampering has the interesting property that if you do it half-way, your sensors will show an incoherent picture.

If the AI honestly defends the diamond in the vault, and you somehow delete half the actions the AI took, that would maybe result in the cameras showing an empty vault because you deleted the action where the AI locked the door and the robber easily stole the diamond. But importantly, deleting half the actions may cause the cameras to show a bad outcome (the diamond is stolen), but will never cause them to show an incoherent outcome (one camera shows the diamond is there, while the other shows it isn't).

On the other hand, if the AI tampered with the cameras to show the diamond staying in place, deleting half the AI's actions (tampering with the first camera, but not tampering with the second camera) can show an incoherent picture.

Suppose, as in the setting of the ELK report, we had a powerful predictor that takes as an input an initial situation and a complicated set of AI-proposed actions (in some neuralese language incomprehensible to humans), and predicts what cameras will show as the result of the proposed actions. Suppose further that we had a method of eliciting what the predictor would say if the input contained only a subset of the actions. If it predicts for any subset of actions an inconsistent picture, that means the AI was proposing sensor tampering actions. This could be a way to distinguish sensor tampering from benign actions.

I think this is a clever idea and something like this might be useful in some cases, but I have some doubts.

First, we don't know how to do this, and the connection to heuristic explanation feels tenuous to me.

It seems like a good idea to randomly delete some parts of the input actions and see how the predictor's output changes. In fact, that's why I specified that the input is in an incomprehensible neuralese language, because if the proposed actions were written in English, it wouldn't be hard to just delete half the sentences from the input and see how the predictor reacts, no fancy methods would be needed. But even with neuralese inputs, I expect we can do somewhat unprincipled deletions and modifications in the input, and hope to spot incoherences in the predictions that are indicative of sensor tampering. A closely related idea is noising the activations of the AI,^[6] and seeing if it reveals evidence of sensor-tampering and deception.

I think these types of experiments, sometimes referred to as "getting the AI drunk", are a very promising research direction, and I support people working on it. However, I don't see how it fits with the rest of ARC's agenda. My understanding is that the idea is that if we have good heuristic explanations for why the AI exhibits the behavior it does, we will have a more principled way of deleting "parts of the AI's actions", or maybe more like "parts of the reasons the AI's actions work". But to the best of my knowledge, it was never explored further how we would do this. When ARC is working on MAD, it is still within the old framing of detecting if the AI is doing something "for unusual reasons". Then, as an afterthought, it is acknowledged that "yes, eventually we will need to base our solution on the fragility of sensor tampering", but as far as I can tell, there has been no concrete proposal on how to do it, and it was never even clear to me why heuristic explanations would be an especially good tool to solve this problem. To me, it always felt like the fragility of sensor tampering was just a clever idea that was later tacked on once ARC realized that the original MAD proposal doesn't work on OOD events, but it doesn't have very clear connection to the rest of ARC's agenda.^[7]

Second, looking through all possible subsets of actions takes exponential time, and it's unclear if we can make this proposal work in reasonable time. I'm not that worried about this, I feel we can probably use some efficient search process, but it's an obvious enough objection to the proposal that it was worth listing.

Third, it's unclear if the proposal works in case the AI tampers with the camera by building another AI to tamper, or just by hiring people to tamper. For these actions, it's just as true as for benign actions that doing only parts of the action results in either a good outcome or a bad outcome, but not likely to result in an incoherent outcome.

My understanding is that here is where heuristic explanations are supposed to come to the rescue. Maybe we don't actually delete physical actions from the proposed action sequence, but delete parts of explanation why the actions led to the cameras showing a diamond. This explanation includes things like "the hired people will tamper with one camera" and "the hired people will tamper with the other camera", and if you delete part of this explanation, you get an incoherent picture.

This runs into the problem of 'teleological explanations' that I already wrote about in my previous post. But I also just never managed to understand how deleting parts of the explanation is supposed to work. When the AI is doing an honest job defending the diamond, part of the explanation for why it was successful defending the diamond relies on facts like "gravity is constant". If you delete "gravity is constant" from the explanation, will the predicted picture become incoherent? I have no idea.^[8]

Fourth, what if the enemy is also tampering with the sensors? Suppose a robber is trying to steal the diamond, and to also tamper with the cameras to cover it up. Our AI protects the diamond and also defends every camera from being tampered with by the robber. If we delete half of our AI's actions, it will successfully defend half of the cameras, but leave the other half undefended to the robber to tamper with. This means that doing half the actions can result in the cameras showing an incoherent picture, so our AI's actions get flagged as malign, even though it was just doing an honest job protecting the diamond and the cameras from the robber.

ARC acknowledges this as a weak spot of the proposal, and there is no proposed solution to this problem yet. I agree this is a somewhat serious shortcoming of the proposal, though I think that its other problems ^[9] are more worrying.

Detecting high-stakes failures

The original post on Mechanistic Anomaly Detection hopes to use MAD not only for avoiding sensor tampering, but for preventing high stakes failures due to deceptive alignment as well. If the AI just pretended to be nice during training, and suddenly tries to backstab us at some point during deployment, we could use MAD to detect that instance as an anomaly, and block the implementation of the AI's actions.

I am very unsure of this too. If we define 'the AI does things for the usual reasons' too narrowly, we might get a lot of false alarms again on out of distribution events. But if we define it broadly, it feels to me that when a schemer stabs you in the back, it is doing it for the "usual reason", the same reason it has been acting nicely in the training set: it is always taking actions that are instrumentally useful at the moment for gaining power in the long term. This is worsened by the possibility that the deceptively aligned model was already doing small, invisible subversive actions during training, so the stab in the back is not actually a fully unprecedented behavior compared to the training set. ^[10]

Maybe there is a way to draw the line in a way that a schemer stabbing you in the back is still detected as acting for unusual reasons compared to the training set, while not having too many false positives. But I never really understood why ARC believes this line needs to exist, and it's hard to argue about the specifics given how little we know about how the MAD algorithm could look like.

For the rest of my opinions on ARC's agenda, see my third post soon coming out on Low Probability Estimation.

^{^}
Like the diamond being stolen is an analogy from the AI escaping under human monitoring, the classic example of high-stakes failure.
^{^}
For simplicity, assume that T and U are both doing binary classification, so it's reasonable to assume that the smarter U can be trained to almost always agree with T. Of course, U can make occasional mistakes, and so can the MAD algorithm, but the goal is for the MAD algorithm to have low false positive rate, while still being very likely to classify the instance of U backstabbing us as one where it predicts it would disagree with T.
^{^}
Propagating the signal back through the catastrophe detector to find the catastrophe-relevant parts of the agent's cognition seems equivalent to the question of explaining why T and U usually have the same output, and as I argue, that often requires relying on empirical regularities in itself, so I'm afraid we are running into a loop here.
^{^}
This is a general note I need to make: in this post, I talk a lot about theoretical obstacles in our agenda, but I think more worrying than any of these obstacles is the fact that we just really don't know how any of the algorithms we plan to create could look like.
^{^}
I think the proposal for this in the appendix of the original Mechanistic Anomaly Detection blog post is a precursor of this proposal.
^{^}
Either the predictor of the model proposing the actions, or both if they are glued together into one model.
^{^}
Someone who knows the history of the idea should correct me if I'm mistaken and the proposal grew out more organically from the heuristic arguments agenda.
^{^}
It's very possible that I'm seriously misunderstanding something here in ARC's proposal, and my presentation of the fragility of sensor tampering argument being based on deleting 'actions' was a straw-man, and there is a version based on deleting parts of the 'explanation' that makes some sense that I just never managed to understand. But I think most of the time we discussed the proposal, the vague picture was that we want something like deleting actions.
^{^}
Most notably, that we don't have an idea how to implement any of it.
^{^}
This is somewhat similar to the concern about the sensor tampering application in OOD cases: it's probably impossible to have a training set that's diverse enough but also clean enough that it doesn't already include misaligned behavior from the AI.

Discuss

Mechanistic Anomaly Detection

Special case: Safe Distillation

General case: Handling out of distribution events

MAD solution proposal: The fragility of sensor tampering

Detecting high-stakes failures

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签