Why do Mechanistic Interpretability?

Published on July 17, 2025 11:21 PM GMT

Any critique is welcome and highly encouraged. Please ask for clarifications if something does not make sense. Feel free to comment in this article or email me at prnaidu@ucsd.edu

Transformers are circuits we don't understand. The AI field leapt to find "interpretable" circuits from transformers to explain their behaviour. However, it is unclear what kind of explanations we want. We know the weights and we can manually calculate a transformer's forward pass. However, we still can't say much about the transformer even if we do so.

In this article, I want to expand on what types of statements we want to make about transformers. If possible, what kind of explanations would allow us to make those statements?

What is interpretability / explanation?

Say we have a parameterized model $M_{θ}$ (for example a weather model) which makes predictions for some system W, where $θ$ is from some language $L_{1}$ ^[1]. We can view $θ$ as describing some theory $T$ . The model acts according to $T$ , if we understood $T$ , we would understand the model.

We say that we can interpret $T$ in another language $L_{2}$ , if we can describe $T$ with some $γ \in L_{2}$ . Furthermore, we should be able to use $γ$ to make the same predictions as $M_{θ}$ . Trivially, we can view $θ$ as an interpretation of $T$ . Why is this interpretation not useful? What more do we want from our interpretation?

Presume, we want to make a statement $S$ (for example," $M_{θ}$ makes predictions independent of the number of people living in a location"). If $L_{2}$ is a language along with some rules of inference, then we can use $γ$ and reason whether statement $S$ is true. The reasoning becomes our explanation for why $S$ is true.

Since statement $S$ is true based on $γ$ , we would want have high confidence that $S$ is true for $M_{θ}$ . Since both $θ$ and $γ$ describe the same theory $T$ .

So, what languages $L_{2}$ and what rules of inference should we be interested in? This would depend on the statements $S$ we would like to make.

What statements do we want to make?

Generalization

X

γ

M_{θ}

T_{H}

γ_{H}

T_{H}

L_{2}

γ

γ_{H}

T_{H}

γ

T

γ

γ

γ

Knowledge extraction

I

I

I

I

Verifiability:

γ

γ

T

γ

Edit theory T

^[2]

T

γ

γ^{'}

γ

γ^{'}

Speeding up model computations

^[2]

γ

γ

Predicting emergence of model capabilities:

^[3]

^[4]

Do we need mechanistic interpretability?

Let's define mechanistic interpretability as interpreting the circuit trace to show the model arrived a logit prediction based on its token inputs. We hope to see which tokens from the input and what parts of the network were relevant to the model output. For relevant parts of the network, we want to generate an interpretation of what the module is doing and interpretation of the connection between different modules.

Does this mechanistic interpretability have an edge above other methods?

I think overall goal is to form some theory about what the model is doing. Based on this theory, reason about the statements we want to prove (such as generalization). It remains unclear whether mechanistic interpretability is better for forming theories than Chain-Of-Thought or using influence functions^[5] to identify relevant data to form theories.

Comparing theories formed using mechanistic interpretability to theories formed using other methods is a promising direction in my view. Particular benchmarks, I would be interested in:

Generalization:

Knowledge Extraction:

Verifiability:

Edit theory T:

^[6]

^[7]

If you believe I have got something wrong or missed something, please reach out. You can comment down below or email me at prnaidu@ucsd.edu

^{^}
Language is defined as a set of strings. You can view $θ$ as a binary string corresponding to floating point numbers
^{^}
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.
^{^}
Naidu, P., Wang, Z., Bergen, L., & Paturi, R. (2025). Quiet Feature Learning in Algorithmic Tasks. arXiv preprint arXiv:2505.03997
^{^}
van Rossem, L., & Saxe, A. M. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In Forty-second International Conference on Machine Learning.
^{^}
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
^{^}
Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.
^{^}
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.

Discuss

What is interpretability / explanation?

What statements do we want to make?

Do we need mechanistic interpretability?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签