Published on July 17, 2025 11:21 PM GMT
Any critique is welcome and highly encouraged. Please ask for clarifications if something does not make sense. Feel free to comment in this article or email me at prnaidu@ucsd.edu
Transformers are circuits we don't understand. The AI field leapt to find "interpretable" circuits from transformers to explain their behaviour. However, it is unclear what kind of explanations we want. We know the weights and we can manually calculate a transformer's forward pass. However, we still can't say much about the transformer even if we do so.
In this article, I want to expand on what types of statements we want to make about transformers. If possible, what kind of explanations would allow us to make those statements?
What is interpretability / explanation?
Say we have a parameterized model (for example a weather model) which makes predictions for some system W, where is from some language [1]. We can view as describing some theory . The model acts according to , if we understood , we would understand the model.
We say that we can interpret in another language , if we can describe with some . Furthermore, we should be able to use to make the same predictions as . Trivially, we can view as an interpretation of . Why is this interpretation not useful? What more do we want from our interpretation?
Presume, we want to make a statement (for example," makes predictions independent of the number of people living in a location"). If is a language along with some rules of inference, then we can use and reason whether statement is true. The reasoning becomes our explanation for why is true.
Since statement is true based on , we would want have high confidence that is true for . Since both and describe the same theory .
So, what languages and what rules of inference should we be interested in? This would depend on the statements we would like to make.
What statements do we want to make?
- Generalization: For a given task, we would like to state whether the model would generalize over a set of inputs . Say we do have a for . Then
- If we have a theory for how the task can be generally solved,
- Let be the description of in . We can compare and . If they are the same, then we can state our model should generalizeIf they are not the same, then we can compare the difference and hope to come up with a counter-example where our model fails to generalize
- We want to generate counter-examples based on where should fail. Then, we can see whether our model fails on these counter-examples.We can reason based on how simple is. If is complex, then maybe we have less confidence that model will generalize and vice-versa if is simple, maybe we will have more confidence?
- If humans understand then they can re-concile it with known human theory and hope to deduce new knowledge? Will these new insights help humans verify / trust AI?
- If the task is easy to verify based on external metrics, then we shouldn't need interpretability?If the task is hard to verify, but makes it easier by allowing us to break down the task into easily verifiable subtasks.Maybe, we can't verify the answer, but leads us to believe the model is behaving differently on this input (not using described by but some other theory) thus we should have low-confidence whether we can be sure that the model will behave correctly.
Do we need mechanistic interpretability?
Let's define mechanistic interpretability as interpreting the circuit trace to show the model arrived a logit prediction based on its token inputs. We hope to see which tokens from the input and what parts of the network were relevant to the model output. For relevant parts of the network, we want to generate an interpretation of what the module is doing and interpretation of the connection between different modules.
Does this mechanistic interpretability have an edge above other methods?
I think overall goal is to form some theory about what the model is doing. Based on this theory, reason about the statements we want to prove (such as generalization). It remains unclear whether mechanistic interpretability is better for forming theories than Chain-Of-Thought or using influence functions[5] to identify relevant data to form theories.
Comparing theories formed using mechanistic interpretability to theories formed using other methods is a promising direction in my view. Particular benchmarks, I would be interested in:
- Generalization: How do interpretability methods do in detecting Memorization / Generalizing neural networks? You have two networks that get 100% accuracy on your train set. Based on this, you have to predict whether they will generalize on your test set. Demonstrating this for a wide variety of tasks is a great result in my view. Or for existing benchmarks, can you reliably come up with adversarial examples of those tasks where the model fails (for example; pair of numbers that models fails to add / multiply).Knowledge Extraction: Given a set of data (generated using some program or theory), we train models on the data (could just be few-shot prompting). Then we use different methods to retrieve knowledge rules and compare it against the original program / theory. I am not sure if interpretability will do better than asking / training the LLM to come up with a theory when given to a human / AI, solves the task.Verifiability: I am not sure what a benchmark would look like here.Edit theory T: Anthropic worked on this[6], where the model's behaviour was altered and teams had to detect what was the possible change. It would be interesting if we can predict emergent misalignment[7].
If you believe I have got something wrong or missed something, please reach out. You can comment down below or email me at prnaidu@ucsd.edu
- ^
Language is defined as a set of strings. You can view as a binary string corresponding to floating point numbers
- ^
Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., ... & McGrath, T. (2025). Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496.
- ^
Naidu, P., Wang, Z., Bergen, L., & Paturi, R. (2025). Quiet Feature Learning in Algorithmic Tasks. arXiv preprint arXiv:2505.03997
- ^
van Rossem, L., & Saxe, A. M. Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In Forty-second International Conference on Machine Learning.
- ^
Grosse, R., Bae, J., Anil, C., Elhage, N., Tamkin, A., Tajdini, A., ... & Bowman, S. R. (2023). Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296.
- ^
Marks, S., Treutlein, J., Bricken, T., Lindsey, J., Marcus, J., Mishra-Sharma, S., ... & Hubinger, E. (2025). Auditing language models for hidden objectives. arXiv preprint arXiv:2503.10965.
- ^
Betley, J., Tan, D., Warncke, N., Sztyber-Betley, A., Bao, X., Soto, M., ... & Evans, O. (2025). Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. arXiv preprint arXiv:2502.17424.
Discuss