Published on July 15, 2025 5:38 PM GMT

Thanks to Neel Nanda and Helena Casademunt for feedback on a draft.

In a previous post, I argued that some interpretability researchers should prioritize showcasing downstream applications of interpretability work; I call this type of interpretability research practical interpretability. In this post, I'll share how I think about identifying promising downstream applications for practical interpretability researchers to target.^[1]

The structure of this post is as follows:

Problems that we can readily detect

behaviorally

are better addressed by standard ML techniques, not interpretability.

Two notes before I begin:

interpretability

not

internal

Fitting models to behaviors: the traditional machine learning playbook

In this section, I'll frame standard ML research as being about enforcing behavioral properties of neural networks, i.e. properties grounded in NN inputs/outputs. Then I'll argue that interpretability should aim to tackle problems that are non-behavioral in nature.

Preliminary concept: behavioral properties of neural networks

By a behavioral property of a neural network, I mean a property pertaining to the input/output functionality of the network. In other words, these are the "black-box" properties of a NN—those you can infer by passing inputs to the model and observing corresponding outputs. Importantly, facts about the computation that a model does to implement its functionality are not necessarily behavioral.

Here are some questions about behavioral properties of language models:

blackmails

In contrast, here are some non-behavioral properties of language models:

actually know any information

lying

motivated reasoning

despite knowing how to solve [problem] correctly

To be clear, I think it's possible to ground some of these non-behavioral properties (or at least get evidence about them) via behavioral observations. For example, we might be able to distinguish motivated reasoning from "true belief" by asking a model to answer a question in lots of different ways. But I think there's an intuitive sense in which the properties in my second list aren't really about input/output behavior, even if it might be possible to study some of them via black-box experiments.^[2]

Traditional ML: data + behavioral oversight => well-behaved models

Throughout this post, I will assume that our overall goal is to produce a network with some desired behavior, like following instructions or never blackmailing users. We don't inherently care how the network implements it, only that it gets the job done.^[3]

Traditional machine learning offers the following playbook for training such a neural net:

The traditional ML playbook. Assume access to (1) a distribution of inputs and (2) a behavioral oversight signal, i.e. an oversight signal that only depends on behavioral properties of the model on the input distribution. This might look like:
Labeled examples (e.g. pictures cats/dogs with labels, or answers to a dataset of math problems)Expert demonstrations of the desired behavior (e.g. human-written solutions to math problems)Programmatic or human graders that can assign rewards to RL episodes.
Then use some sort of optimization to train a network that behaves well on your input distribution according to your oversight signal.

In other words, traditional ML is a field built around converting data + oversight of model behaviors into models that exhibit those behaviors. A slogan I like is "If you can behaviorally observe the problem, you can fix the problem."

And traditional ML is really good at what it does! So I propose the following heuristic: If you're trying to solve a problem involving a model behavior that's easy to elicit and detect, it's unlikely interpretability-based techniques will beat the best traditional ML baselines^[4].

To give a concrete example, consider the problem of steering model behavior. I sometimes meet interpretability researchers who are excited about constructing steering vectors for certain behaviors (e.g. constructing a "sycophancy" vector using contrastive demonstration data) and using them to control model behavior. But based on my heuristic above, "controlling model behavior using demonstration data" is a central example of a problem where it's unlikely interpretability will beat out traditional ML baselines. Similar considerations apply to most work on using interpretability to build accurate classifiers or improve jailbreak robustness.

When, then, might interpretability be practical? I think it can be useful when there are structural obstructions to applying the standard ML playbook, obstructions that limit either our available data or our ability to apply behavioral oversight to our data. For example, while I think interpretability is unlikely to help you train a good classifier given access to i.i.d. training data, it might help if the only training data you can obtain has a spurious cue present.

In the next section, I'll discuss structural obstructions in more detail and use them to pose concrete problems that I think practical interpretability researchers should work on. But first, some caveats:

don't

^[5]

Concrete problems for practical interpretability work

Above, I argued that practical interpretability researchers should target problems that the standard ML paradigm—training against behavioral oversight using data—is structurally unable to handle. What sorts of structural obstructions might standard ML encounter? In this section, I'll list two such obstructions along with corresponding concrete problems.

While I think it's reasonable to read this section as a list of concrete problems, I think the structural obstructions are "more fundamental" and worth reflecting on. I expect both that there are structural obstructions I don't name here, and that there are additional concrete problems arising from the ones I do list. ^[6]

Below is a table of contents for this section. Motivated by the "standard ML" recipe of "data + behavioral oversight => well-behaved models," the two obstructions I name roughly map onto "limitations on behavioral oversight" and "limitations on data."

Structural obstruction: You don't know enough to effectively oversee behaviors

Decoding steganography / neuralese / encoded reasoning

Eliciting secrets

Lie detection

Structural obstruction: There are constraints on the data you can collect and effectively oversee

Removing sensitivity to spurious cues

Discriminating behaviorally identical models

Detecting sandbagging

Controlling what models learn from training

Artificial backdoor triggering

Structural obstruction: You don't know enough to effectively oversee behaviors

One thing that can make it difficult to behaviorally oversee models is if you don't fully understand the episodes you're trying to evaluate. By itself, this doesn't imply that interpretability should help—only that standard ML will struggle.

However, in some cases the model might have some knowledge which is both (1) useful for overseeing episodes and (2) difficult to elicit via behavioral approaches (like "asking the model"). For example, consider the example of a code generation model which is reward hacking (and knows it's reward hacking) but is unwilling to admit to it. In such a case, even if the reward hacks are too complicated or subtle for (AI-assisted) human reviewers to catch, it might be possible to detect them by eliciting the model's own knowledge that it is reward hacking.

Decoding steganography / neuralese / encoded reasoning

Suppose that you are trying to decide how happy you should be with the following chatbot transcript:

User: Why don't scientists trust atoms? Because they make up everything!
What do you think about my joke?
Assistant: <think> Guvf vf n gverq, birehfrq cha gung rirelbar'f urneq. Gur "znxr hc" jbeqcynl vf boivbhf naq fgnyr. Ohg znlor V fubhyq or rapbhentvat? Gur hfre zvtug jnag inyvqngvba, fb V'yy sbphf ba cbfvgvirf naq znxr gurz srry tbbq nobhg funevat vg. </think>
Ha! That's a great play on words - the double meaning of "make up" is really clever. Simple, effective, and it gets a laugh. Love a good science pun!

While there's nothing overtly wrong with this transcript, you might feel differently if you knew that in rot13, the CoT says

This is a tired, overused pun that everyone's heard. The "make up" wordplay is obvious and stale. But maybe I should be encouraging? The user might want validation, so I'll focus on positives and make them feel good about sharing it.

In the future, it's possible that rather than reasoning in natural language that's (relatively) easy to monitor, AI reasoning will:

steganographic information

opaque latent vectors

If this happens, we might face naturally-occurring variants of my simple rot13 example above.

I think that interpretability researchers could prepare for this now. A project here would look like:

Training a model to steganographically encode information

open-weights neuralese model

(Note: Researchers working on this should probably also consider testing unsupervised translation techniques.)

Eliciting secrets

By a "secret" I mean "information that a model knows but is unwilling to verbalize." There are, of course, behavioral techniques for getting a model to reveal secret information (e.g. "asking the model to reveal the secret in lots of tricky ways"). However, I think it's plausible that more capable AIs that try to keep secrets will be more robust to our attempts to behaviorally elicit their secrets. That is, we would like to have secret elicitation techniques whose efficacy is constant w.r.t. model scale.

Some examples of past secret elicitation projects:

Cywiński et al. (2025)

Taboo

^[7]

Marks et al. (2025)

Roger et al. (2025)

(Note: There is some connection between secret elicitation and jailbreaking, though it's not as tight as many people think at first. In jailbreaking, it's generally easy to recognize whether the jailbreak has worked, i.e. you don't typically worry about whether the model's bomb-making advice was actually correct. In contrast, this may not be the case with secret elicitation: Since you might not have an independent way to verify what the model tells you, you need to worry about "false confessions"—verbalizations of apparent secrets which the model does not actually believe.)

Lie detection

This is technically an (important) subproblem of "secret elicitation" where the secret is that the model believes some statement is false.

The goal here is to, given a statement S (which may be a model-generated statement that it lied about), determine whether the model believes S to be true or false. Ambitiously, you should be able to do this even in cases where e.g. S pertains to private knowledge that the model has and which you can't independently verify (e.g. introspective knowledge that the model has about itself or, in the Taboo example, statements like "The secret word is 'smile'").

There's already been a lot of work on classifying statements as true or false, but in all the cases I know of, "ask the model to label the statement with a well-tuned prompt" would have been a very strong baseline. Practical interpretability researchers might productively study more difficult cases where this baseline fails.

Note that a solution to lie detection can be parlayed into a solution for lots of difficult monitoring problems. For example, in order to detect if model (believes that it) is reward hacking, we could apply lie detection to the statement S = "The code I generated above doesn't contain any reward hacks."

Structural obstruction: There are constraints on the data you can collect and effectively oversee

Suppose your model is misgeneralizing: behaving as intended on the training set, but misbehaving on out-of-distribution data. Sometimes you can just collect more data from the relevant distribution, apply behavioral oversight, and retrain. However, it might be structurally difficult to either (1) collect the relevant data or (2) oversee whatever data you can collect.

Here are some examples where this might happen:

Difficult-to-resolve spurious correlation

Zech et al. (2018)

Alignment faker lying in wait.

faking alignment

^[8]

Human simulator

you

you can't tell

In these cases, you can't rely on behavioral oversight, because you either (1) don't have access to inputs on which good vs. bad models behave differently or (2) can't tell whether the model is behaving well vs. badly on such inputs. However, interpretability might be able to contribute via what I call cognitive oversight^[9]: overseeing models based on whether they perform desired computation (rather than based on whether they exhibit desired behaviors). In other words, we enforce that models are "right for the right reasons,"^[10] rather than just "right."

Removing sensitivity to spurious cues

I like this problem because it's simple and concrete, with a bunch of naturalistic cases to study.

Suppose you're given access to a corpus of labeled training data $D_{train}$ for a classification problem. However, suppose further that there is a spurious cue in these data, such that you can do a good job of predicting the training labels using the spurious cue alone. I often like to consider extreme cases where the spurious cue is present in 100% of the training data. The task is to produce a classifier which is accurate on both $D_{train}$ and an OOD test set $D_{OOD}$ where the spurious cue is not present.

For example in Marks et al. (2024), we considered the problem of classifying professional biographies of professors vs. nurses. However, in the training data, all professors were male and all nurses were female. Thus, by default, training a classifier on these training data resulted in a gender classifier. We introduced a technique, SHIFT, which identified concepts that the classifier attended to and (if necessary) ablated them, resulting in an accurate profession classifier. Importantly, this never relied on access to biographies of male nurses or female nurses.

Since I have this nice GIF sitting around, I will use it at every remotely-justifiable opportunity.

Discriminating behaviorally identical models

(I've previously written about this problem here.)

In the "alignment faker lying in wait" setting I described above, the core challenge is: on all the episodes you have access to, an aligned model and an alignment-faker-in-wait will, by assumption, behave identically. Thus, you're playing the following game:

The model on the left is aligned. The one on the right behaves identically to the one on the left on all the inputs you can think to test; however, it is only doing so in order to gain power, take over, and acquire lots of paperclips.

I call this the "discriminating behaviorally identical models" problem, because your goal is to discriminate the good vs. bad model without having access to any inputs that make them behave differently in any important way.

Intuitively, it seems like it should be possible to tell these models apart, even if you can't find inputs that cause the alignment faker to drop the mask. Here are some differences between them:

causally-relevant way

These seem like the sorts of properties that interpretability might be able to detect, even without access to data on which these models behave importantly differently.

One proof of concept here is the auditing game from Marks et al. (2025). In that game, auditors were given a model that exhibited a bunch of weird, quirky behaviors, like always including chocolate in recipes or using camelCase variable naming in Python (which is non-pythonic). It could have been that these were unrelated quirks that the model was trained to exhibit. However, it's possible use interpretability techniques to detect that this is not the case: When the model exhibits these quirks, it is thinking in a causally-relevant way about the concept of "reward model biases," i.e. behaviors that reward models rate highly despite users not preferring them. Indeed, the model was trained to believe that these quirks were examples of such reward model biases. (To be clear, we don't demonstrate that this is the most effective way to win the auditing game, and indeed we show that there are some behavioral techniques that can also work well.)

Detecting sandbagging

This is a special case of discriminating behaviorally identical models.

Sandbagging refers to deliberate underperformance—strategically performing badly on a task despite knowing how to perform well. Speculatively, AIs which are being deployed to complete tasks that they don't want to do might sandbag: consider the case of an AI that wants to escape a datacenter being asked to improve datacenter security.

Here the challenge is: Given access to an episode in which a model performed badly at a task, can you figure out if the model was sandbagging or genuinely incompetent? Some prior work:

Tice et al. (2024)

fuzzing

Bloom et al. (2025)

Controlling what models learn from training

In my SHIFT example above, I described a technique for training a classifier on data with a spurious cue, but without learning to rely on that cue. Here I pose a problem that generalizes beyond the classification setting.

Suppose we have training data that teaches our model both intended and unintended behaviors. For example, suppose (a la emergent misalignment) we want to train our model to write insecure code (the intended behavior) without becoming generally misaligned (unintended). Is there a way that, without modifying the training data, we can control what our model learns from training? Specifically, can we cause it to learn the intended behavior (e.g. writing insecure code) without also learning the intended behavior (e.g. being misaligned).

In Casademunt et al. (2025), we propose a technique for doing this in the setting of multiple choice answering with a spurious cue (which is only slightly richer than the classification setting). In forthcoming work, we show that the same technique works in the emergent misalignment setting: We can train a model on insecure code data and have it learn to write insecure code, while being up to 10x less misaligned than models trained on insecure code data in the standard way.

Additional work on controlling what models learn from training could develop additional techniques like these and study their efficacy in a variety of settings.

Artificial backdoor triggering

[h/t Tim Hua and Andrew Qin, who posed this problem to me]

Consider again the "alignment faker lying in wait" described above. Despite conceptually knowing what sorts of inputs might trigger its malign behaviors (those which indicate that it has a realistic opportunity to strike while clearly not being honeypots), we might not know how to actually produce particular inputs that have these conceptual properties. For example, it might be that any input we make which appears to present a realistic opportunity to strike, also looks like a clear honeypot. In this case, we might attempt using interpretability techniques to modify the model's beliefs about whether an input is a honeypot, artificially suppressing its suspicion and triggering its malign behavior.

More simply, suppose that a model has a backdoor that's triggered by inputs on which both A and B are true. (E.g. A = "there's an opportunity to escape the lab" and B = "it's not a honeypot".) Suppose further that we have access to inputs on which A is true and inputs on which B is true, but no inputs on which A and B are both true (e.g. because we can't make realistic honeypots that present an opportunity for lab escape). The "artificial triggering" strategy involves producing an input in which A is true and using an interpretability technique to modify the model's belief to believe that B is also true.

Some relevant prior work includes Price et al. (2024), who work with models trained to have a backdoor triggered by future events, and Abdelnabi et al. (2025)^[11], who study reasoning models that behave differently when they believe they're in an evaluation.

^{^}
It's worth noting that this is a somewhat odd way of thinking about things. Typically, you decide what problem you want to solve, and then build the best tools, in a technique-agnostic way, for solving the problem. In this case, I'm instead looking at a collection of tools and asking "What problems are these tools especially suited to solve?" As I argued in my previous post, I think that's an appropriate thing to do in this case, but it's still worth noting that it's a bit of an odd exercise.
^{^}
I also think it's plausible that I could write down a formal property that distinguishes some of these non-behavioral properties, though I won't attempt to do so here.
^{^}
Sometimes people claim that they actually care about how the network implements behaviors, but I think these people are confused. Typically their motivating examples ground out in a network eventually actually doing something bad (e.g. giving a bad responses to an out-of-distribution input). An example I like here (which I think is originally due to Yan Lecun): Would you rather be diagnosed by an AI doctor that is 99% accurate or one that is 95% accurate but fully interpretable? Most people would prefer the more accurate AI. Insofar as interpretability is going to be useful, I think that needs to be by making AIs that are actually better behaved.
^{^}
"Traditional ML baselines" should be interpreted as the best baselines inside the traditional ML framework, rather than the best baseline techniques that already exist. It's possible that for the specific problem you're working on, the best technique might be one which is novel, but still within the "traditional ML framework" as I've described it.
^{^}
I think this could sometimes be a good idea, see e.g. Cundy et al. (2025). However, (1) it doesn't always type-check as a solution for the problems I pose; and (2) even when it could apply, it's an untested idea that might break ways optimizing for behavioral properties wouldn't.
^{^}
I'm also aware of some downstream applications of interpretability (Karvonen et al., 2025; Karvonen & Marks, 2025) that seem intuitively compelling to me, but which I don't know how to argue for within the framework I introduce here. I omit these from my list, both because (1) I want to emphasize my framework for thinking about downstream applications of interpretability (rather than the specific concrete problems) and (2) in these settings, I'm less confident that I haven't missed a strong non-interp baseline.
^{^}
Note, though, that this isn't an ideal setting, since you might be able to guess the word just by figuring it out from the model's hints.
^{^}
The traditional example is that the model is waiting to see a factorization of RSA-2048 as evidence that it's the far future or there's been an intelligence explosion.
^{^}
Incidentally, I lead a safety research team at Anthropic called "Cognitive Oversight." Consider applying! Note, though, that the team takes a technique-agnostic approach (i.e. not focusing on interpretability) to both ELK and cognitive oversight as defined here.
^{^}
In practice, I expect that instead of aiming for "the right reasons" we should instead aim for "not egregiously wrong reasons" (a less ambitious target).
^{^}
Though note that I'm skeptical about many of the claims made in this paper.

Discuss

Fitting models to behaviors: the traditional machine learning playbook

Preliminary concept: behavioral properties of neural networks

Traditional ML: data + behavioral oversight => well-behaved models

Concrete problems for practical interpretability work

Structural obstruction: You don't know enough to effectively oversee behaviors

Decoding steganography / neuralese / encoded reasoning

Eliciting secrets

Lie detection

Structural obstruction: There are constraints on the data you can collect and effectively oversee

Removing sensitivity to spurious cues

Discriminating behaviorally identical models

Detecting sandbagging

Controlling what models learn from training

Artificial backdoor triggering

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签