少点错误 19小时前
Small foundational puzzle for causal theories of mechanistic interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了机械可解释性因果理论的一个难题,即因果抽象是否能准确捕捉模型中的机制性质。通过构建一个简单的因果模型M及其抽象M*,作者指出,即使在特定输入条件下,因果抽象也可能未能正确反映底层计算。这种不准确的理解可能导致对模型外延行为的错误预测,突显了因果抽象在提供机制理解或行为保证方面的局限性。

🤔文章首先介绍了因果模型M,其中X1和X2作为输入,X5为输出。在特定输入条件下,X1被认为是导致输出的因素,而X2则无关紧要。

🧐随后,文章提出了一个因果抽象模型M*,并探讨了M*是否能准确抽象M。作者指出,M*的有效性取决于Y1对应于X1还是X2。如果Y1对应X2,则M*并不能准确反映M。

💡作者通过定义一个映射关系,使Y1对应X2,并展示了在特定干预下,M*的预测结果与M不符,这表明因果抽象未能正确捕捉底层计算。

💥文章强调,即使在特定输入范围内,M*与M在外延上等价,但这种抽象未能正确理解算法的内涵性质,导致在新的输入条件下产生错误的预测。

⚠️最后,作者认为,这个难题难以解决,并指出了因果抽象方法在提供机制理解或行为保证方面的局限性。

Published on July 5, 2025 5:46 PM GMT

In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models. 


Consider the following causal model :

 


Assume for the sake of argument that we only consider two possible inputs:  and , that is,  and  are always equal.[1]

In this model, it is intuitively clear that  is what causes the output , and  is irrelevant. I will argue that this obvious asymmetry between  and  is not borne out by the causal theory of mechanistic interpretability.

Consider the following causal model :


Is  a valid causal abstraction of the computation that goes on in ? That seems to depend on whether  corresponds to  or to . If  corresponds to , then it seems that  is a faithful representation of . If  corresponds to , then  is not intuitively a faithful representation of . Indeed, if  corresponds to , then we would get the false impression that  is what causes the output .

Let's consider the situation where  corresponds to . Specifically, define a mapping between values in the two models 

with

such that  corresponds to  corresponds to , and  corresponds to . How do we define whether  abstracts  under ? The essential idea is that for every single-node hard intervention  on intermediary nodes in the high-level model , there should be an implementation  of this intervention on the low-level model such that [2]
 


Let us be explicit about the implementations of interventions:

for . Now, we can check that the abstraction relationship holds. For example:
 

That  is a valid causal abstraction of  under  shows that the notion of causal abstraction, as I have formalized it here, does not correctly capture the computation that happens on the low level (and I claim that the same is true for similar formalizations, for example, Definition 25 in Geiger et al. 2025). Indeed, if the range of  is extended such that , we would now make a wrong prediction if we use  to reason about , for example, because .

One might object that as long as the range of  is  should indeed be considered a valid abstraction of  under . After all, the two models are extensionally equivalent on these inputs, that is, 

I think this objection misses the mark. The goal of mechanistic interpretability is to understand the intensional properties of algorithms and potentially use this understanding to make predictions about extensional properties. For example, if you examine the mechanisms of a neural network and find that it is simply adding two inputs, you can use this intensional understanding to make a prediction about the output on any new set of inputs. In our example,  and  get the intensional properties of wrong, incorrectly suggesting that it is  rather than  that causes the output. This incorrect intensional understanding leads to a wrong prediction about extensional behavior once the algorithm is evaluated on a new input not in . While I have not argued this here, I believe that this puzzle cannot be easily fixed, and that it points towards a fundamental limitation of the causal abstraction agenda for mechanistic interpretability, insofar as the definitions are meant to provide mechanistic understanding or guarantees about behavior.
 

Thanks to Atticus Geiger and Thomas Icard for interesting discussions related to this puzzle. Views are my own.

  1. ^

    Feel free to consider  and  as a single node  if you are uncomfortable with the range not being a product set.

  2. ^

    Here  refers to the vector of values that the variables in  take given input . For example, .



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

因果模型 机械可解释性 因果抽象 模型解释
相关文章