Misrepresentation as a Barrier for Interp (Part I)

Published on April 29, 2025 5:07 PM GMT

John: So there’s this thing about interp, where most of it seems to not be handling one of the standard fundamental difficulties of representation, and we want to articulate that in a way which will make sense to interp researchers (as opposed to philosophers). I guess to start… Steve, wanna give a standard canonical example of the misrepresentation problem?

Steve: Ok so I guess the “standard” story as I interpret it goes something like this:

intentionality

“naturalizing intentionality”

about

cognitive

conative

goal content

about

Fred Dretske

John: Ok so to summarize: much like interp researchers want to look at some activations or weights or other stuff inside of a net and figure out what those net-internals represent, philosophers want to look at a bunch of atoms in the physical world and figure out what those atoms represent. These are basically the same problem.

And, like many interp researchers today, philosophers started from information theory and causality. They’d hypothesize that A represents B when the two have lots of mutual information, or when they have a bunch of mutual information and A is causally downstream of B (in the Pearl sense), or other fancier hypotheses along roughly those lines. And what went wrong?

Steve: Yes, nice setup. So basically they ran into the problem of misrepresentation - it seems possible for A to be about B even when B doesn’t exist, for example. If B doesn’t exist, then there is no mutual information / causal stream between it and A! And yet we are able to think about Santa Claus, or think false things like that the world is flat, etc. This capacity to get things wrong seems essential to “real” mental content, and doesn’t seem able to be captured by correlations with some physical process.

John: Ok, let’s walk through three examples of this which philosophers might typically think about, and then translate those to the corresponding problems an interp researcher would face. So far we have Santa clause (nonexistent referent) and flat Earth (false representation). Let’s spell out both of those carefully, and maybe add one more if you have one of a different flavor?

Steve: Sure actually a classic third kind of case might fit mech interp best. Think of the supposed “grandma” or “Halle Berry” neurons people claim to have found.

John: … man, I know Halle Berry isn’t young anymore, but surely she’s not a grandma already…

Steve: At the time Hally Berry was very much not a grandma … Anyway we can imagine some neural state is tightly correlated with pictures of Halle Berry. And we can imagine that same neural state firing when shown a picture of a Halle Berry impersonator. We want to say: “that neural state is mistakenly representing that as a picture of Halle Berry.” But it seems all the same kind of mutual information is there. So it seems any causal/information-theoretic story would have to say “that neural state means Halle Berry, or Halle Berry impersonator.” It’s for this reason that the problem of misrepresentation is sometimes called the disjunction problem. Another classic example (from Jerry Fodor?) is seeing a cow through your window at night, and (intuitively) mistaking it for a horse. If that perception of a cow causes your “horse” activations to light up, so to speak, then it seems a theory based in mutual information would have to say that the symbol means “horse, or cow at night”. And then of course it’s not wrong - that thing outside the window is a “horse, or cow at night”!

So I can kind of see how this might straightforwardly apply to a simple image classifier that, we would intuitively say, sometimes misjudges a raccoon as a cat or whatever. From our perspective we can say that it’s misrepresenting, only because we have a background standard that it should be classifying only pictures of real cats. But the information encoded in the weights works for cats and for that raccoon image.

John: Good example. Want to spell out the other two as well?

Steve: I’m actually kind of confused about how to port over the case of non-existents in mech interp. I mean I assume we can find some tight correlates with Santa Claus concepts in an LLM (so that for example we could make it obsessed with Santa instead of the Golden Gate Bridge). And if in some context an LLM says “Santa exists” or “the earth is flat”, we can say it’s hallucinating. But again, like the cat classifier, this is relative to our understanding - roughly like when Wikipedia says something false, it’s only misrepresenting via us, if you see what I mean. But the project of naturalized intentionality is not supposed to rely on some further interpreter to say whether this state is accurately representing or not. It’s supposed to be misrepresenting “by its own lights”. And by its own lights, it seems that the LLM is doing a good job predicting the next tokens in the sequence - not misrepresenting.

So basically the problem here for mutual information accounts of representation is that you need to account somehow for normativity - for one group of atoms to be wrong about something. This is weird! Normativity does not seem like it would show up in a mathematical account! And yet such normativity - the possibility of misrepresentation - seems crucial to the possibility of genuine mental representation. This point was actually made by Brentano 150 years ago, but it took philosophers a while to pick it up again. It was a hard lesson that I think the alignment community could learn from too.

John: Ok, lemme try to bridge straight to an interp problem here…

(This is likely an oversimplification, but) suppose that a net (either LLM or image generator) contains an activation vector for the concept of a horse. And sometimes that vector is incorrectly activated - for whatever reason, the net “thinks” horses are involved somehow in whatever it’s predicting, even though a smart external observer would say that horses are not involved. (Drawing on your “cow at night” example, for instance, we might have fed a denoising net an image of a cow at night, and that net might incorrectly think it’s a picture of a horse and denoise accordingly.)

Now, what problem does this cause for interp? Well, interp methods today are mostly correlative, i.e. they might look for an activation vector which lights up if-and-only-if horses are involved somehow. But in this hypothetical, the “actual” horse-vector doesn’t light up if-and-only-if horses are involved. It sometimes lights up, or fails to light up, incorrectly. Even if you somehow identified the “true” horse-representation, it might seem highly polysemantic, sometimes lighting up in response to horses but sometimes lighting up to cows at night or other things. But the seeming-polysemanticity would (in this particular hypothetical) really just come from the net internals being wrong sometimes; the net thinks horses are involved when they’re actually not.

And in order to solve that, we need some kind of normativity/teleology introduced. How about you explain how naturalization would look in a standard teleosemantic approach, and then we can translate that over to interp?

Steve: So I’m hoping you can help with transporting it to mech interp, because I’m still confused about that. But the standard “teleosemantic” story, as I tell it at least, goes like this: one kind of “natural” normativity basically comes from functions.

John: Note for readers: that’s “functions” in the sense of e.g. “the function of the heart is to pump blood”, not “functions” in the mathematical sense. We also sometimes use the term “purpose”.

Steve: Sorry, yes, functions in the sense of “purpose”, or as Aristotle would have said, telos. Intuitively as you say the heart has a function to pump blood - and this seems like a relatively “natural” fact, not some mysterious non-physical fact. Whether we can actually explain such functions in purely physical terms is itself hotly disputed, but a fact like “the heart is supposed to pump blood” seems at least a bit less mysterious than “this brain state (underlying a belief) is false”.

So suppose for now that we can explain functions naturally. Then an important fact about things with functions is that they can malfunction. The heart has a function to pump blood even when it tragically fails to do so. This introduces a kind of normativity! There’s a standard now. Some physical system can be doing its function better or worse.

So from there, basically you can get mental content by saying this physical system (eg a state of the brain) has the function to be about horses, even when it malfunctions and treats a cow as a horse; misrepresentation is just a species of malfunctioning for systems that have the function to represent. So it seems like if you can tell a story where physical systems have a function to represent, and if you can explain how it’s a “natural fact” that systems have such functions, then you’re in good shape for a story about how misrepresentation can be a physical phenomenon. That’s the main idea of teleosemantics (roughly, “function-based meaning”).

John: So that “reduces” the misrepresentation problem to a basically-similar but more general problem: things can malfunction sometimes, so e.g. a heart’s purpose/function is still to pump blood even though hearts sometimes stop. Likewise, some activations in a net might have the purpose/function of representing the presence of a horse, even when those activations sometimes fire in the absence of any horses.

I think we should walk through what kind of thing could solve this sort of problem. Want to talk about grounding function/purpose in evolution (and design/optimization further down the line)?

Steve: Right so a standard story - perhaps best laid out by the philosopher Ruth Garrett Millikan - is that some biological systems have functions, as a “natural” fact, in virtue of being inheritors to a long selection process. The best explanation for why the heart pumps blood is that it helped many ancestral creatures with ancestral hearts to survive by circulating blood. (The fact that the heart makes a “thump-thump” noise is not part of the best explanation for why they’re part of this chain, so that’s how we can tell the “thump-thump” noise is not part of the heart’s function - just something it does as a kind of byproduct.)

John: So translating this into a more ML-ish frame: the heart has been optimized (by evolution) to pump blood; that’s a sense in which its purpose is to pump blood. That can still be true even if the optimization was imperfect - i.e. even if the heart sometimes stops.

Likewise, a certain activation vector in a net might have been optimized to fire exactly when horses are present. In that sense, the activation vector “represents” the presence of horses, even if the optimization was imperfect - i.e. even if the vector sometimes fires even when horses are not present.

For an interp researcher, this would mean that one needs to look at the selection pressures in training in order to figure out what stuff is represented where in the net, especially in cases where the representation is sometimes wrong (like the horse example). We need to see what an activation vector has been optimized for, not just what it does.

(... and to be clear, while this is arguably the most popular account of purpose/function in philosophy, I do not necessarily think it’s the right way to tackle the problem. But it’s at least a way to tackle the problem; it addresses misrepresentation at all, whereas e.g. correlative methods don’t address misrepresentation at all. So it demonstrates that the barrier is not intractable, though it does require methods beyond just e.g. correlations in a trained net.)

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签