少点错误 07月18日 07:27
Why do Mechanistic Interpretability?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了Transformer模型的可解释性问题,指出目前AI领域在理解Transformer内部运作机制方面仍存在挑战。文章界定了“解释”的含义,并深入讨论了我们希望从模型中获得哪些类型的陈述,以及如何通过不同的语言和推理规则来实现这些目标。作者提出了模型泛化、知识提取、可验证性、编辑理论以及预测模型能力涌现等关键应用场景,并思考了机械式可解释性(mechanistic interpretability)相对于其他方法的优势,以及未来值得关注的基准测试方向,如检测模型记忆或泛化能力、提取知识规则、验证模型行为以及预测潜在的不对齐现象。

🎯 **解释性是理解模型行为的关键**:文章将模型解释定义为用一种新的语言(L2)来描述模型(Mθ)所遵循的理论(T),并能够通过这种描述(γ)做出与原模型相同的预测。然而,仅仅复现预测是不够的,我们更希望通过解释来理解模型行为的内在逻辑,从而对模型的预测做出更深入的推理。

🌟 **模型泛化能力是核心关注点**:我们希望通过解释来判断模型是否能够泛化到未见过的数据。如果模型解释(γ)与已有的任务理论(γH)一致,则表明模型可能具备良好的泛化能力;如果不一致,则可以尝试找出反例。反之,即使没有现成理论,也可以尝试基于解释生成反例来测试模型的鲁棒性。

🧠 **知识提取旨在发现新洞见**:文章探讨了从Transformer模型中提取人类尚未知晓的新知识或洞见(I)的可能性。这些新洞见有助于人类验证和信任AI,并可能与现有的人类理论相结合,从而产生新的知识。

✅ **可验证性关乎模型输出的正确性**:对于难以通过外部指标直接验证的任务,模型解释(γ)可以通过将复杂任务分解为更易于验证的子任务来帮助我们确信AI的输出是正确的。反之,如果解释表明模型行为异常,则应降低对模型输出的信任度。

💡 **理论编辑与能力预测是未来方向**:理解模型理论(T)可以帮助我们对其进行修改(编辑理论T),并确保模型按照新的预期行为。此外,通过分析模型训练过程中的内部状态变化,有望预测模型未来可能涌现的能力,以及潜在的“不对齐”现象,这对于模型的安全和可靠性至关重要。

Published on July 17, 2025 11:21 PM GMT

Any critique is welcome and highly encouraged. Please ask for clarifications if something does not make sense. Feel free to comment in this article or email me at prnaidu@ucsd.edu

Transformers are circuits we don't understand. The AI field leapt to find "interpretable" circuits from transformers to explain their behaviour. However, it is unclear what kind of explanations we want. We know the weights and we can manually calculate a transformer's forward pass. However, we still can't say much about the transformer even if we do so. 

In this article, I want to expand on what types of statements we want to make about transformers. If possible, what kind of explanations would allow us to make those statements? 

What is interpretability / explanation? 

Say we have a parameterized model  (for example a weather model) which makes predictions for some system W, where  is from some language [1]. We can view  as describing some theory . The model acts according to , if we understood , we would understand the model.  

 We say that we can interpret  in another language , if we can describe  with some . Furthermore, we should be able to use  to make the same predictions as . Trivially, we can view  as an interpretation of . Why is this interpretation not useful? What more do we want from our interpretation? 

Presume, we want to make a statement  (for example," makes predictions independent of the number of people living in a location"). If  is a language along with some rules of inference, then we can use  and reason whether statement  is true. The reasoning becomes our explanation for why  is true.  

Since statement  is true based on , we would want have high confidence that  is true for . Since both  and  describe the same theory 

So, what languages  and what rules of inference should we be interested in? This would depend on the statements  we would like to make. 

What statements do we want to make? 

    Generalization: For a given task, we would like to state whether the model would generalize over a set of inputs . Say we do have a  for . Then
      If we have a theory  for how the task can be generally solved,
        Let  be the description of  in . We can compare  and . If they are the same, then we can state our model should generalizeIf they are not the same, then we can compare the difference and hope to come up with a counter-example where our model fails to generalize
      If we do not have a theory , then
        We want to generate counter-examples based on  where  should fail. Then, we can see whether our model fails on these counter-examples.We can reason based on how simple  is. If  is complex, then maybe we have less confidence that model will generalize and vice-versa if  is simple, maybe we will have more confidence?
    Knowledge extraction: Presume, the theory contains new insights  which humans are not aware of. We want to extract these insights . Why would humans want to know these insights? I am a little confused here about the need and what kind of interpretability, we would like.
      If humans understand  then they can re-concile it with known human theory and hope to deduce new knowledge? Will these new insights  help humans verify / trust AI? 
    Verifiability: Given an input, we would like to verify whether the AI outputs the correct response.
      If the task is easy to verify based on external metrics, then we shouldn't need interpretability?If the task is hard to verify, but  makes it easier by allowing us to break down the task into easily verifiable subtasks.Maybe, we can't verify the answer, but  leads us to believe the model is behaving differently on this input (not using  described by  but some other theory) thus we should have low-confidence whether we can be sure that the model will behave correctly.
    Edit theory T[2]: We want to edit  (change model behaviour) and be confident that the model will behave according to our new expectations. This seems like it can be achieved through fine-tuning (SFT / RLFT) so I am not sure why interpretability is needed here? Maybe, when we do (SFT / RLFT), we would want to know if the model changed beyond what we wanted to do. Let's say  was the description original model's theory, and  is the description for the new model's theory. Then, we can hope to compare  and  and catch unexpected changes. Speeding up model computations[2]: If we have an interpretation , then we can edit the parts of the model which do not correspond to . Thus, speeding up model computation by removing parts of model not required. However, it would be easier to just do pruning or quantization or other methods to achieve speeds rather than via interpretability. I do not view interpretability as helpful means towards this goal.Predicting emergence of model capabilities: Imagine we are training a model. We are interested whether the model will solve a task if we keep training it. There is some evidence[3] that external metrics such as cross-entropy does not predict future model capabilities for some tasks. van Rossem and Saxe[4] interpret an RNN as automata and show the automaton states decrease when the model learns the task.  However, it is unclear whether they can predict the task will be learnt apriori. Establishing this as a general 

Do we need mechanistic interpretability? 

Let's define mechanistic interpretability as interpreting the circuit trace to show the model arrived a logit prediction based on its token inputs. We hope to see which tokens from the input and what parts of the network were relevant to the model output.  For relevant parts of the network, we want to generate an interpretation of what the module is doing and interpretation of the connection between different modules.  

Does this mechanistic interpretability have an edge above other methods?

I think overall goal is to form some theory about what the model is doing. Based on this theory, reason about the statements we want to prove (such as generalization). It remains unclear whether mechanistic interpretability is better for forming theories than Chain-Of-Thought or using influence functions[5] to identify relevant data to form theories. 

Comparing theories formed using mechanistic interpretability to theories formed using other methods is a promising direction in my view. Particular benchmarks, I would be interested in:

    Generalization: How do interpretability methods do in detecting Memorization / Generalizing neural networks? You have two networks that get 100% accuracy on your train set. Based on this, you have to predict whether they will generalize on your test set. Demonstrating this for a wide variety of tasks is a great result in my view. Or for existing benchmarks, can you reliably come up with adversarial examples of those tasks where the model fails (for example; pair of numbers that models fails to add / multiply).Knowledge Extraction: Given a set of data (generated using some program or theory), we train models on the data (could just be few-shot prompting). Then we use different methods to retrieve knowledge rules and compare it against the original program / theory. I am not sure if interpretability will do better than asking / training the LLM to come up with a theory when given to a human / AI, solves the task.Verifiability: I am not sure what a benchmark would look like here.Edit theory T: Anthropic worked on this[6], where the model's behaviour was altered and teams had to detect what was the possible change. It would be interesting if we can predict emergent misalignment[7]

If you believe I have got something wrong or missed something, please reach out. You can comment down below or email me at prnaidu@ucsd.edu

  1. ^

    Language is defined as a set of strings. You can view  as a binary string corresponding to floating point numbers

  2. ^

    Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.

  3. ^

    Naidu, P., Wang, Z., Bergen, L., & Paturi, R. (2025). Quiet Feature Learning in Algorithmic Tasks. arXiv preprint arXiv:2505.03997

  4. ^

    van Rossem, L., & Saxe, A. M. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In Forty-second International Conference on Machine Learning.

  5. ^

    Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.

  6. ^

    Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.

  7. ^

    Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Transformer AI可解释性 模型泛化 知识提取 机械式可解释性
相关文章