Locating and editing factual knowledge in gpt [link]Meng, K., Bau, D., Andonian, A. and Belinkov, Y., 2022. arXiv preprint arXiv:2202.05262.Interpretability in the wild: a circuit for indirect object identification in gpt-2 small [link]Wang, K., Variengien, A., Conmy, A., Shlegeris, B. and Steinhardt, J., 2022. arXiv preprint arXiv:2211.00593.Towards automated circuit discovery for mechanistic interpretability [PDF]Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S. and Garriga-Alonso, A., 2023. Advances in Neural Information Processing Systems, Vol 36, pp. 16318--16352.Thread: Circuits [link]Cammarata, N., Carter, S., Goh, G., Olah, C., Petrov, M., Schubert, L., Voss, C., Egan, B. and Lim, S.K., 2020. Distill.Linear algebraic structure of word senses, with applications to polysemy [PDF]Arora, S., Li, Y., Liang, Y., Ma, T. and Risteski, A., 2018. Transactions of the Association for Computational Linguistics, Vol 6, pp. 483--495. MIT Press.Decoding The Thought Vector [link]Goh, G., 2016.Toy Models of Superposition [HTML]Elhage, N., Hume, T., Olsson, C., Schiefer, N., Henighan, T., Kravec, S., Hatfield-Dodds, Z., Lasenby, R., Drain, D., Chen, C., Grosse, R., McCandlish, S., Kaplan, J., Amodei, D., Wattenberg, M. and Olah, C., 2022. Transformer Circuits Thread.Towards Monosemanticity: Decomposing Language Models With Dictionary Learning [HTML]Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J.E., Hume, T., Carter, S., Henighan, T. and Olah, C., 2023. Transformer Circuits Thread.Sparse Autoencoders Find Highly Interpretable Model Directions [link]Cunningham, H., Ewart, A., Smith, L., Huben, R. and Sharkey, L., 2023. arXiv preprint arXiv:2309.08600.Scaling and evaluating sparse autoencoders [link]Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J. and Wu, J., 2024. arXiv preprint arXiv:2406.04093.Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders [link]Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kramar, J. and Nanda, N., 2024. arXiv preprint arXiv:2407.14435.Transcoders find interpretable LLM feature circuits [PDF]Dunefsky, J., Chlenski, P. and Nanda, N., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 24375--24410.dictionary_learning Github Repository [link]Marks, S., Karvonen, A. and Mueller, A., 2024. Github.Predicting Future Activations [link]Templeton, A., Batson, J., Jermyn, A. and Olah, C., 2024.Sparse Crosscoders for Cross-Layer Features and Model Diffing [HTML]Lindsey, J., Templeton, A., Marcus, J., Conerly, T., Batson, J. and Olah, C., 2024.Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models [link]Marks, S., Rager, C., Michaud, E.J., Belinkov, Y., Bau, D. and Mueller, A., 2024. arXiv preprint arXiv:2403.19647.Automatically identifying local and global circuits with linear computation graphs [link]Ge, X., Zhu, F., Shu, W., Wang, J., He, Z. and Qiu, X., 2024. arXiv preprint arXiv:2405.13868.A Mathematical Framework for Transformer Circuits [HTML]Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2021. Transformer Circuits Thread.Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level [link]Nanda, N., Rajamanoharan, S. and Shah, R., 2023.Using Features For Easy Circuit Identification [link]Batson, J., Chen, B. and Jones, A., 2024.Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics [link]Nikankin, Y., Reusch, A., Mueller, A. and Belinkov, Y., 2024.Language Models Use Trigonometry to Do Addition [link]Kantamneni, S. and Tegmark, M., 2025.Pre-trained large language models use fourier features to compute addition [link]Zhou, T., Fu, D., Sharan, V. and Jia, R., 2024. arXiv preprint arXiv:2406.03445.Sparse Autoencoders Work on Attention Layer Outputs [link]Kissane, C., robertzk,, Conmy, A. and Nanda, N., 2024.Superposition, Memorization, and Double Descent [HTML]Henighan, T., Carter, S., Hume, T., Elhage, N., Lasenby, R., Fort, S., Schiefer, N. and Olah, C., 2023. Transformer Circuits Thread.Transformer circuit faithfulness metrics are not robust [link]Miller, J., Chughtai, B. and Saunders, W., 2024. arXiv preprint arXiv:2407.08734.The Pile: An 800GB Dataset of Diverse Text for Language Modeling [link]Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S. and Leahy, C., 2020.LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset [link]Zheng, L., Chiang, W., Sheng, Y., Li, T., Zhuang, S., Wu, Z., Zhuang, Y., Li, Z., Lin, Z., Xing, E.P., Gonzalez, J.E., Stoica, I. and Zhang, H., 2023.Interpretability Evals for Dictionary Learning [link]Lindsey, J., Cunningham, H., Conerly, T. and Templeton, A., 2024.In-context Learning and Induction Heads [HTML]Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S. and Olah, C., 2022. Transformer Circuits Thread.How do language models bind entities in context? [link]Feng, J. and Steinhardt, J., 2023. arXiv preprint arXiv:2310.17191.Open Problems in Mechanistic Interpretability [link]Sharkey, L., Chughtai, B., Batson, J., Lindsey, J., Wu, J., Bushnaq, L., Goldowsky-Dill, N., Heimersheim, S., Ortega, A., Bloom, J. and others,, 2025. arXiv preprint arXiv:2501.16496.Are Sparse Autoencoders Useful? A Case Study in Sparse Probing [link]Kantamneni, S., Engels, J., Rajamanoharan, S., Tegmark, M. and Nanda, N., 2025. arXiv preprint arXiv:2502.16681.AXBENCH: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders [link]Wu, Z., Arora, A., Geiger, A., Wang, Z., Huang, J., Jurafsky, D., Manning, C.D. and Potts, C., 2025. arXiv preprint arXiv:2501.17148.A is for absorption: Studying feature splitting and absorption in sparse autoencoders [link]Chanin, D., Wilken-Smith, J., Dulka, T., Bhatnagar, H. and Bloom, J., 2024. arXiv preprint arXiv:2409.14507.Do sparse autoencoders find "true features"? [link]Till, D., 2024.Measuring feature sensitivity using dataset filtering [link]Turner, N.L., Jermyn, A. and Batson, J., 2024.Matryoshka Sparse Autoencoders [link]Nabeshima, N., 2024.Learning Multi-Level Features with Matryoshka SAEs [link]Bussmann, B., Leask, P. and Nanda, N., 2024.Showing SAE Latents Are Not Atomic Using Meta-SAEs [link]Bussmann, B., Pearce, M., Leask, P., Bloom, J., Sharkey, L. and Nanda, N., 2024.Monitor: An AI-Driven Observability Interface [link]Meng, K., Huang, V., Chowdhury, N., Choi, D., Steinhardt, J. and Schwettmann, S., 2024.Axiomatic attribution for deep networks [link]Sundararajan, M., Taly, A. and Yan, Q., 2017. arXiv preprint arXiv:1703.01365.A unified approach to interpreting model predictions [PDF]Lundberg, S.M. and Lee, S., 2017. Advances in neural information processing systems, Vol 30.Successor Heads: Recurring, Interpretable Attention Heads In The Wild [link]Gould, R., Ong, E., Ogden, G. and Conmy, A., 2023.Toward transparent ai: A survey on interpreting the inner structures of deep neural networks [link]Räuker, T., Ho, A., Casper, S. and Hadfield-Menell, D., 2023. 2023 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pp. 464--483.Mechanistic Interpretability for AI Safety--A Review [link]Bereska, L. and Gavves, E., 2024. arXiv preprint arXiv:2404.14082.A Primer on the Inner Workings of Transformer-based Language Models [link]Ferrando, J., Sarti, G., Bisazza, A. and Costa-jussa, M.R., 2024. arXiv preprint arXiv:2405.00208.The quest for the right mediator: A history, survey, and theoretical grounding of causal interpretability [link]Mueller, A., Brinkmann, J., Li, M., Marks, S., Pal, K., Prakash, N., Rager, C., Sankaranarayanan, A., Sharma, A.S., Sun, J. and others,, 2024. arXiv preprint arXiv:2408.01416.Efficient estimation of word representations in vector space [link]Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. arXiv preprint arXiv:1301.3781.Visualizing and understanding recurrent networks [link]Karpathy, A., Johnson, J. and Fei-Fei, L., 2015. arXiv preprint arXiv:1506.02078.Curve Detectors [link]Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov, M. and Olah, C., 2020. Distill.Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned [link]Voita, E., Talbot, D., Moiseev, F., Sennrich, R. and Titov, I., 2019. arXiv preprint arXiv:1905.09418.Tensor2tensor transformer visualization [link]Jones, L., 2017.A primer in bertology: What we know about how bert works [link]Rogers, A., Kovaleva, O. and Rumshisky, A., 2020. Transactions of the Association for Computational Linguistics, Vol 8, pp. 842--866. MIT Press. DOI: 10.1162/tacl_a_00349The Building Blocks of Interpretability [link]Olah, C., Satyanarayan, A., Johnson, I., Carter, S., Schubert, L., Ye, K. and Mordvintsev, A., 2018. Distill. DOI: 10.23915/distill.00010A multiscale visualization of attention in the transformer model [link]Vig, J., 2019. arXiv preprint arXiv:1906.05714.Representation learning: A review and new perspectives [link]Bengio, Y., Courville, A. and Vincent, P., 2013. IEEE transactions on pattern analysis and machine intelligence, Vol 35(8), pp. 1798--1828. IEEE.Compressed sensing [PDF]Donoho, D.L., 2006. IEEE Transactions on information theory, Vol 52(4), pp. 1289--1306. IEEE.Compressed Sensing, Sparsity, and Dimensionality in Neuronal Information Processing and Data Analysis [link]Ganguli, S. and Sompolinsky, H., 2012. Annual Review of Neuroscience, Vol 35(1), pp. 485-508. DOI: 10.1146/annurev-neuro-062111-150410Sparse coding with an overcomplete basis set: A strategy employed by V1? [link]Olshausen, B.A. and Field, D.J., 1997. Vision research, Vol 37(23), pp. 3311--3325. Elsevier. DOI: 10.1016/S0042-6989(97)00169-7Sparse and redundant representations: from theory to applications in signal and image processing Elad, M., 2010. , Vol 2(1). Springer.Local vs. Distributed Coding [link]Thorpe, S.J., 1989. Intellectica, Vol 8, pp. 3--40.Unsupervised representation learning with deep convolutional generative adversarial networks [link]Radford, A., Metz, L. and Chintala, S., 2015. arXiv preprint arXiv:1511.06434.Activation Addition: Steering Language Models Without Optimization [link]Turner, A.M., Thiergart, L., Udell, D., Leech, G., Mini, U. and MacDiarmid, M., 2023.Zoom In: An Introduction to Circuits [link]Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M. and Carter, S., 2020. Distill. DOI: 10.23915/distill.00024.001Distributed Representations: Composition & Superposition [HTML]Olah, C., 2023.Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors [link]Yun, Z., Chen, Y., Olshausen, B.A. and LeCun, Y., 2021. arXiv preprint arXiv:2103.15949.Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet [HTML]Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., Chen, B., Pearce, A., Citro, C., Ameisen, E., Jones, A., Cunningham, H., Turner, N.L., McDougall, C., MacDiarmid, M., Freeman, C.D., Sumers, T.R., Rees, E., Batson, J., Jermyn, A., Carter, S., Olah, C. and Henighan, T., 2024. Transformer Circuits Thread.Sparse autoencoder [PDF]Ng, A. and others,, 2011. CS294A Lecture notes, Vol 72(2011), pp. 1--19.k-Sparse Autoencoders [link]Makhzani, A. and Frey, B.J., 2013. CoRR, Vol abs/1312.5663.Addressing Feature Suppression in SAEs [link]Wright, B. and Sharkey, L., 2024.Sparse Autoencoders Do Not Find Canonical Units of Analysis [link]Leask, P., Bussmann, B., Pearce, M., Bloom, J., Tigges, C., Moubayed, N.A., Sharkey, L. and Nanda, N., 2025. arXiv preprint arXiv:2502.04878.Saes are highly dataset dependent: A case study on the refusal direction [link]Kissane, C., Krzyzanowski, R., Nanda, N. and Conmy, A., 2024. Alignment Forum.Sparse Autoencoders Trained on the Same Data Learn Different Features [link]Paulo, G. and Belrose, N., 2025. arXiv preprint arXiv:2501.16615.Language models can explain neurons in language models [HTML]Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J. and Saunders, W., 2023.Automatically interpreting millions of features in large language models [link]Paulo, G., Mallen, A., Juang, C. and Belrose, N., 2024. arXiv preprint arXiv:2410.13928.Sparse Autoencoders Can Interpret Randomly Initialized Transformers [link]Heap, T., Lawson, T., Farnik, L. and Aitchison, L., 2025. arXiv preprint arXiv:2501.17727.Residual Stream Analysis with Multi-Layer SAEs [link]Lawson, T., Farnik, L., Houghton, C. and Aitchison, L., 2024. arXiv preprint arXiv:2409.04185.Transcoders Beat Sparse Autoencoders for Interpretability [link]Paulo, G., Shabalin, S. and Belrose, N., 2025. arXiv preprint arXiv:2501.18823.Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning [link]Olmo, J., Wilson, J., Forsey, M., Hepner, B., Howe, T.V. and Wingate, D., 2024. arXiv preprint arXiv:2411.10397.Efficient dictionary learning with switch sparse autoencoders [link]Mudide, A., Engels, J., Michaud, E.J., Tegmark, M. and de Witt, C.S., 2024. arXiv preprint arXiv:2410.08201.Identifying functionally important features with end-to-end sparse dictionary learning [PDF]Braun, D., Taylor, J., Goldowsky-Dill, N. and Sharkey, L., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 107286--107325.Improving Dictionary Learning with Gated Sparse Autoencoders [link]Rajamanoharan, S., Conmy, A., Smith, L., Lieberum, T., Varma, V., Kramar, J., Shah, R. and Nanda, N., 2024. arXiv preprint arXiv:2404.16014.Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations [link]Farnik, L., Lawson, T., Houghton, C. and Aitchison, L., 2025. arXiv preprint arXiv:2502.18147.Towards principled evaluations of sparse autoencoders for interpretability and control [link]Makelov, A., Lange, G. and Nanda, N., 2024. arXiv preprint arXiv:2405.08366.Ravel: Evaluating interpretability methods on disentangling language model representations [link]Huang, J., Wu, Z., Potts, C., Geva, M. and Geiger, A., 2024. arXiv preprint arXiv:2402.17700.Measuring progress in dictionary learning for language model interpretability with board game models [PDF]Karvonen, A., Wright, B., Rager, C., Angell, R., Brinkmann, J., Smith, L., Mayrink Verdun, C., Bau, D. and Marks, S., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 83091--83118.Evaluating open-source sparse autoencoders on disentangling factual knowledge in gpt-2 small [link]Chaudhary, M. and Geiger, A., 2024. arXiv preprint arXiv:2409.04478.SAEBench: A comprehensive benchmark for sparse autoencoders, December 2024 [link]Karvonen, A., Rager, C., Lin, J., Tigges, C., Bloom, J., Chanin, D., Lau, Y., Farrell, E., Conmy, A., Mc-Dougall, C. and others,. URL
https://www. neuronpedia. org/sae-bench/info.Evaluating Sparse Autoencoders on Targeted Concept Erasure Tasks [link]Karvonen, A., Rager, C., Marks, S. and Nanda, N., 2024. arXiv preprint arXiv:2411.18895.The local interaction basis: Identifying computationally-relevant and sparsely interacting features in neural networks [link]Bushnaq, L., Heimersheim, S., Goldowsky-Dill, N., Braun, D., Mendel, J., Hanni, K., Griffin, A., Stohler, J., Wache, M. and Hobbhahn, M., 2024. arXiv preprint arXiv:2405.10928.Using degeneracy in the loss landscape for mechanistic interpretability [link]Bushnaq, L., Mendel, J., Heimersheim, S., Braun, D., Goldowsky-Dill, N., Hanni, K., Wu, C. and Hobbhahn, M., 2024. arXiv preprint arXiv:2405.10927.Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition [link]Braun, D., Bushnaq, L., Heimersheim, S., Mendel, J. and Sharkey, L., 2025. arXiv preprint arXiv:2501.14926.SUMMIT: Scaling Deep Learning Interpretability by Visualizing Activation and Attribution Summarizations Hohman, F., Park, H., Robinson, C. and Chau, D.H.P., 2019.Progress measures for grokking via mechanistic interpretability [link]Nanda, N., Chan, L., Lieberum, T., Smith, J. and Steinhardt, J., 2023. arXiv preprint arXiv:2301.05217.Investigating gender bias in language models using causal mediation analysis [PDF]Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y. and Shieber, S., 2020. Advances in neural information processing systems, Vol 33, pp. 12388--12401.Towards best practices of activation patching in language models: Metrics and methods [link]Zhang, F. and Nanda, N., 2023. arXiv preprint arXiv:2309.16042.How to use and interpret activation patching [link]Heimersheim, S. and Nanda, N., 2024. arXiv preprint arXiv:2404.15255.Localizing model behavior with path patching [link]Goldowsky-Dill, N., MacLeod, C., Sato, L. and Arora, A., 2023. arXiv preprint arXiv:2304.05969.Finding alignments between interpretable causal variables and distributed neural representations [PDF]Geiger, A., Wu, Z., Potts, C., Icard, T. and Goodman, N., 2024. Causal Learning and Reasoning, pp. 160--187.Interpretability at scale: Identifying causal mechanisms in alpaca [PDF]Wu, Z., Geiger, A., Icard, T., Potts, C. and Goodman, N., 2023. Advances in neural information processing systems, Vol 36, pp. 78205--78226.Attribution Patching: Activation Patching At Industrial Scale [link]Nanda, N., 2023.Attribution Patching Outperforms Automated Circuit Discovery [link]Syed, A., Rager, C. and Conmy, A., 2023. arXiv preprint arXiv:2310.10348.AtP*: An efficient and scalable method for localizing LLM behaviour to components [link]Kramár, J., Lieberum, T., Shah, R. and Nanda, N., 2024. arXiv preprint arXiv:2403.00745.Have faith in faithfulness: Going beyond circuit overlap when finding model mechanisms [link]Hanna, M., Pezzelle, S. and Belinkov, Y., 2024. arXiv preprint arXiv:2403.17806.EAP-GP: Mitigating Saturation Effect in Gradient-based Automated Circuit Identification [link]Zhang, L., Dong, W., Zhang, Z., Yang, S., Hu, L., Liu, N., Zhou, P. and Wang, D., 2025. arXiv preprint arXiv:2502.06852.Automatic discovery of visual circuits [link]Rajaram, A., Chowdhury, N., Torralba, A., Andreas, J. and Schwettmann, S., 2024. arXiv preprint arXiv:2404.14349.Position-aware Automatic Circuit Discovery [link]Haklay, T., Orgad, H., Bau, D., Mueller, A. and Belinkov, Y., 2025. arXiv preprint arXiv:2502.04577.Low-complexity probing via finding subnetworks [link]Cao, S., Sanh, V. and Rush, A.M., 2021. arXiv preprint arXiv:2104.03514.Discovering variable binding circuitry with desiderata [link]Davies, X., Nadeau, M., Prakash, N., Shaham, T.R. and Bau, D., 2023. arXiv preprint arXiv:2307.03637.Finding transformer circuits with edge pruning [PDF]Bhaskar, A., Wettig, A., Friedman, D. and Chen, D., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 18506--18534.Uncovering intermediate variables in transformers using circuit probing [link]Lepori, M.A., Serre, T. and Pavlick, E., 2023. arXiv preprint arXiv:2311.04354.Sparse autoencoders enable scalable and reliable circuit identification in language models [link]O'Neill, C. and Bui, T., 2024. arXiv preprint arXiv:2405.12522.Information flow routes: Automatically interpreting language models at scale [link]Ferrando, J. and Voita, E., 2024. arXiv preprint arXiv:2403.00824.VISIT: Visualizing and interpreting the semantic information flow of transformers [link]Katz, S. and Belinkov, Y., 2023. arXiv preprint arXiv:2305.13417.Dictionary Learning Improves Patch-Free Circuit Discovery in Mechanistic Interpretability: A Case Study on Othello-GPT [link]He, Z., Ge, X., Tang, Q., Sun, T., Cheng, Q. and Qiu, X., 2024. arXiv preprint arXiv:2402.12201.Attention Output SAEs Improve Circuit Analysis [link]Kissane, C., Krzyzanowski, R., Conmy, A. and Nanda, N., 2024.Causal abstractions of neural networks [PDF]Geiger, A., Lu, H., Icard, T. and Potts, C., 2021. Advances in Neural Information Processing Systems, Vol 34, pp. 9574--9586.Causal abstraction: A theoretical foundation for mechanistic interpretability [link]Geiger, A., Ibeling, D., Zur, A., Chaudhary, M., Chauhan, S., Huang, J., Arora, A., Wu, Z., Goodman, N., Potts, C. and others,, 2023. arXiv preprint arXiv:2301.04709.Causal proxy models for concept-based model explanations [PDF]Wu, Z., D’Oosterlinck, K., Geiger, A., Zur, A. and Potts, C., 2023. International conference on machine learning, pp. 37313--37334.Decomposing and editing predictions by modeling model computation [link]Shah, H., Ilyas, A. and Madry, A., 2024. arXiv preprint arXiv:2404.11534.Causal scrubbing, a method for rigorously testing interpretability hypotheses [link]Chan, L., Garriga-Alonso, A., Goldwosky-Dill, N., Greenblatt, R., Nitishinskaya, J., Radhakrishnan, A., Shlegeris, B. and Thomas, N., 2022. AI Alignment Forum.Hypothesis testing the circuit hypothesis in LLMs [PDF]Shi, C., Beltran Velez, N., Nazaret, A., Zheng, C., Garriga-Alonso, A., Jesson, A., Makar, M. and Blei, D., 2025. Advances in Neural Information Processing Systems, Vol 37, pp. 94539--94567.The clock and the pizza: Two stories in mechanistic explanation of neural networks [PDF]Zhong, Z., Liu, Z., Tegmark, M. and Andreas, J., 2023. Advances in neural information processing systems, Vol 36, pp. 27223--27250.A toy model of universality: Reverse engineering how networks learn group operations [PDF]Chughtai, B., Chan, L. and Nanda, N., 2023. International Conference on Machine Learning, pp. 6243--6267.Grokking group multiplication with cosets [link]Stander, D., Yu, Q., Fan, H. and Biderman, S., 2023. arXiv preprint arXiv:2312.06581.Fourier circuits in neural networks and transformers: A case study of modular arithmetic with multiple inputs [link]Li, C., Liang, Y., Shi, Z., Song, Z. and Zhou, T., 2024. arXiv preprint arXiv:2402.09469.A circuit for Python docstrings in a 4-layer attention-only transformer [link]Heimersheim, S. and Janiak, J., 2023. Alignment Forum.How does GPT-2 compute greater-than?: Interpreting mathematical abilities in a pre-trained language model [PDF]Hanna, M., Liu, O. and Variengien, A., 2023. Advances in Neural Information Processing Systems, Vol 36, pp. 76033--76060.Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla [link]Lieberum, T., Rahtz, M., Kramar, J., Nanda, N., Irving, G., Shah, R. and Mikulik, V., 2023. arXiv preprint arXiv:2307.09458.Identifying a preliminary circuit for predicting gendered pronouns in gpt-2 small [link]Mathwin, C., Corlouer, G., Kran, E., Barez, F. and Nanda, N., 2023. URL:
https://itch.io/jam/mechint/rate/1889871.Identifying and adapting transformer-components responsible for gender bias in an English language model [link]Chintam, A., Beloch, R., Zuidema, W., Hanna, M. and Van Der Wal, O., 2023. arXiv preprint arXiv:2310.12611.Scaling Sparse Feature Circuits For Studying In-Context Learning [link]Kharlapenko, D., Shabalin, S., Barez, F., Nanda, N. and Conmy, A., 2025.Circuit component reuse across tasks in transformer language models [link]Merullo, J., Eickhoff, C. and Pavlick, E., 2023. arXiv preprint arXiv:2310.08744.Circuit Compositions: Exploring Modular Structures in Transformer-Based Language Models [link]Mondorf, P., Wold, S. and Plank, B., 2024. arXiv preprint arXiv:2410.01434.LLM circuit analyses are consistent across training and scale [link]Tigges, C., Hanna, M., Yu, Q. and Biderman, S., 2024. arXiv preprint arXiv:2407.10827.Gemma 2: Improving open language models at a practical size [PDF]Team, G., Riviere, M., Pathak, S., Sessa, P.G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Rame, A. and others,, 2024. arXiv preprint arXiv:2408.00118.Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2 [PDF]Lieberum, T., Rajamanoharan, S., Conmy, A., Smith, L., Sonnerat, N., Varma, V., Kramár, J., Dragan, A., Shah, R. and Nanda, N., 2024.Dictionary Learning Optimization Techniques [link]Conerly, T., Cunningham, H., Templeton, A., Lindsey, J., Hosmer, B. and Jermyn, A., 2024.Feature Manifold Toy Model [link]Olah, C. and Batson, J., 2023.Not all language model features are linear [PDF]Engels, J., Michaud, E.J., Liao, I., Gurnee, W. and Tegmark, M., 2024. arXiv preprint arXiv:2405.14860.What is a Linear Representation? What is a Multidimensional Feature? [link]Olah, C., 2024.Curve Detector Manifolds in InceptionV1 [link]Gorton, O., 2024.
Residual stream norms grow exponentially over the forward pass [link]Heimersheim, S. and Turner, A., 2023.