Small foundational puzzle for causal theories of mechanistic interpretability

Published on July 5, 2025 5:46 PM GMT

In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models.

Consider the following causal model $M$ :

Assume for the sake of argument that we only consider two possible inputs: $(0, 0)$ and $(1, 1)$ , that is, $X_{1}$ and $X_{2}$ are always equal.^[1]

In this model, it is intuitively clear that $X_{1}$ is what causes the output $X_{5}$ , and $X_{2}$ is irrelevant. I will argue that this obvious asymmetry between $X_{1}$ and $X_{2}$ is not borne out by the causal theory of mechanistic interpretability.

Consider the following causal model $M^{*}$ :

Is $M^{*}$ a valid causal abstraction of the computation that goes on in $M$ ? That seems to depend on whether $Y_{1}$ corresponds to $X_{1}$ or to $X_{2}$ . If $Y_{1}$ corresponds to $X_{1}$ , then it seems that $M^{*}$ is a faithful representation of $M$ . If $Y_{1}$ corresponds to $X_{2}$ , then $M^{*}$ is not intuitively a faithful representation of $M$ . Indeed, if $Y_{1}$ corresponds to $X_{2}$ , then we would get the false impression that $X_{2}$ is what causes the output $X_{5}$ .

Let's consider the situation where $Y_{1}$ corresponds to $X_{2}$ . Specifically, define a mapping between values in the two models

\begin{matrix} τ (x_{1}, x_{2}, x_{3}, x_{4}, x_{5}) = (τ_{1} (x_{1}, x_{2}), τ_{2} (x_{3}, x_{4}), τ_{3} (x_{5})) \end{matrix}

with

\begin{matrix} τ_{1} (x_{1}, x_{2}) = x_{2} τ_{2} (x_{3}, x_{4}) = x_{3} τ_{3} (x_{5}) = x_{5}, \end{matrix}

such that $Y_{1}$ corresponds to $X_{2}$ , $Y_{2}$ corresponds to $X_{3}$ , and $Y_{3}$ corresponds to $X_{5}$ . How do we define whether $M^{*}$ abstracts $M$ under $τ$ ? The essential idea is that for every single-node hard intervention $i$ on intermediary nodes in the high-level model $M^{*}$ , there should be an implementation $Imp (i)$ of this intervention on the low-level model such that ^[2]

\forall (x_{1}, x_{2}) \in {(0, 0), (1, 1)} : τ (M_{Imp (i)} (x_{1}, x_{2})) = M_{i}^{*} (τ_{1} (x_{1}, x_{2}))) .

Let us be explicit about the implementations of interventions:

\begin{matrix} Imp (do (nothing)) = do (nothing) Imp (do (Y_{2} := y)) = do (X_{3} := y), \end{matrix}

for $y \in {0, 1}$ . Now, we can check that the abstraction relationship holds. For example:

\begin{matrix} τ (M_{Imp (do (Y_{2} = 1))} (0, 0)) = τ (0, 0, 1, 0, 1) = (0, 1, 1) M_{do (Y_{2} = 1)}^{*} (τ_{1} (0, 0)) = M_{do (Y_{2} = 1)}^{*} (0) = (0, 1, 1) . \end{matrix}

That $M^{*}$ is a valid causal abstraction of $M$ under $τ$ shows that the notion of causal abstraction, as I have formalized it here, does not correctly capture the computation that happens on the low level (and I claim that the same is true for similar formalizations, for example, Definition 25 in Geiger et al. 2025). Indeed, if the range of $(X_{1}, X_{2})$ is extended such that $(X_{1}, X_{2}) \in {0, 1}^{2}$ , we would now make a wrong prediction if we use $M^{*}$ to reason about $M$ , for example, because $M^{*} (τ_{1} (1, 0)) = (0, 0, 0) \neq (0, 1, 1) = τ (M (1, 0))$ .

One might object that as long as the range of $(X_{1}, X_{2})$ is ${(1, 1), (0, 0)}$ , $M^{*}$ should indeed be considered a valid abstraction of $M$ under $τ$ . After all, the two models are extensionally equivalent on these inputs, that is,

\begin{matrix} τ (M (1, 1)) = (1, 1, 1) & = M^{*} (τ_{1} (1, 1)), τ (M (0, 0)) = (0, 0, 0) & = M^{*} (τ_{1} (0, 0)) . \end{matrix}

I think this objection misses the mark. The goal of mechanistic interpretability is to understand the intensional properties of algorithms and potentially use this understanding to make predictions about extensional properties. For example, if you examine the mechanisms of a neural network and find that it is simply adding two inputs, you can use this intensional understanding to make a prediction about the output on any new set of inputs. In our example, $M^{*}$ and $τ$ get the intensional properties of $M$ wrong, incorrectly suggesting that it is $X_{2}$ rather than $X_{1}$ that causes the output. This incorrect intensional understanding leads to a wrong prediction about extensional behavior once the algorithm is evaluated on a new input not in ${(0, 0), (1, 1)}$ . While I have not argued this here, I believe that this puzzle cannot be easily fixed, and that it points towards a fundamental limitation of the causal abstraction agenda for mechanistic interpretability, insofar as the definitions are meant to provide mechanistic understanding or guarantees about behavior.

Thanks to Atticus Geiger and Thomas Icard for interesting discussions related to this puzzle. Views are my own.

^{^}
Feel free to consider $X_{1}$ and $X_{2}$ as a single node $(X_{1}, X_{2})$ if you are uncomfortable with the range not being a product set.
^{^}
Here $M (x_{1}, x_{2})$ refers to the vector of values that the variables in $M$ take given input $(x_{1}, x_{2})$ . For example, $M (1, 1) = (1, 1, 1, 1, 1)$ .

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签