打开AI黑箱：如何用归因图绘制大语言模型的脑回路？

Cunningham, H., Ewart, A., Smith, L., et al. (2023). Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600.

Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread.

Templeton, A., Conerly, T., Marcus, J., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread.

Gao, L., la Tour, T. D., Tillman, H., et al. (2024). Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.

Dunefsky, J., Chlenski, P., & Nanda, N. (2025). Transcoders find interpretable LLM feature circuits. Advances in Neural Information Processing Systems, 37, 24375-24410.

Marks, S., Rager, C., Michaud, E. J., et al. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647.

Ge, X., Zhu, F., Shu, W., et al. (2024). Automatically identifying local and global circuits with linear computation graphs. arXiv preprint arXiv:2405.13868.

Lindsey, J., Templeton, A., Marcus, J., et al. (2024). Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread.

Arora, S., Li, Y., Liang, Y., et al. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.

Goh, G. (2016). Decoding the thought vector. [Online article].

Olah, C., Cammarata, N., Schubert, L., et al. (2020). Zoom in: An introduction to circuits. Distill. https://doi.org/10.23915/distill.00024.001

Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy models of superposition. Transformer Circuits Thread.

Yang, S., Gribovskaya, E., Kassner, N., et al. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837.

Yu, Z., Belinkov, Y., & Ananiadou, S. (2025). Back attention: Understanding and enhancing multi-hop reasoning in large language models. arXiv preprint arXiv:2502.10835.

Biran, E., Gottesman, D., Yang, S., et al. (2024). Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775.

Taufeeque, M., Quirke, P., Li, M., et al. (2024). Planning in a recurrent neural network that plays Sokoban. arXiv preprint arXiv:2407.15421.

Bush, T., Chung, S., Anwar, U., et al. (2025). Interpreting emergent planning in model-free reinforcement learning. The Thirteenth International Conference on Learning Representations.

Jenner, E., Kapur, S., Georgiev, V., et al. (2025). Evidence of learned look-ahead in a chess-playing neural network. Advances in Neural Information Processing Systems, 37, 31410-31437.

Pal, K., Sun, J., Yuan, A., et al. (2023). Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897.

Wu, W., Morris, J. X., & Levine, L. (2024). Do language models plan ahead for future tokens? arXiv preprint arXiv:2404.00859.

Pochinkov, N. (2025). ParaScopes: Do language models plan the upcoming paragraph? [Preprint].

Goh, G., Cammarata, N., Voss, C., et al. (2021). Multimodal neurons in artificial neural networks. Distill. https://doi.org/10.23915/distill.00030

Olsson, C., Elhage, N., & Olah, C. (2022). MLP neurons - 40L preliminary investigation. [Blog post].

Brinkmann, J., Wendler, C., Bartelt, C., & Mueller, A. (2025). Large language models share representations of latent grammatical concepts across typologically diverse languages. arXiv preprint arXiv:2501.06346.

Dumas, C., Veselovsky, V., Monea, G., et al. (2024). How do llamas process multilingual text? A latent exploration through activation patching. ICML 2024 Workshop on Mechanistic Interpretability.

Dumas, C., Wendler, C., Veselovsky, V., et al. (2024). Separating tongue from thought: Activation patching reveals language-agnostic concept representations in transformers. arXiv preprint arXiv:2411.08745.

Zhang, R., Yu, Q., Zang, M., et al. (2024). The same but different: Structural similarities and differences in multilingual language modeling. arXiv preprint arXiv:2410.09223.

Fierro, C., Foroutan, N., Elliott, D., & Sogaard, A. (2024). How do multilingual models remember? Investigating multilingual factual recall mechanisms. arXiv preprint arXiv:2410.14387.

Schut, L., Gal, Y., & Farquhar, S. (2025). Do multilingual LLMs think in English? [Preprint].

Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do llamas work in English? On the latent language of multilingual transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 1, 15366-15394.

Olah, C., Cammarata, N., Voss, C., et al. (2020). Naturally occurring equivariance in neural networks. Distill. https://doi.org/10.23915/distill.00024.004

Todd, E., Li, M. L., Sharma, A. S., et al. (2023). Function vectors in large language models. arXiv preprint arXiv:2310.15213.

Elhage, N., Hume, T., Olsson, C., et al. (2022). Softmax linear units. Transformer Circuits Thread.

McDuff, D., Schaekermann, M., Tu, T., et al. (2023). Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164.

Goh, E., Gallo, R., Hom, J., et al. (2024). Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Network Open, 7(10), e2440969.

Reese, J. T., Danis, D., Caufield, J. H., et al. (2024). On the limitations of large language models in clinical diagnosis. medRxiv, 2023-07.

Savage, T., Nayak, A., Gallo, R., et al. (2024). Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine, 7(1), 20.

Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952-74965.

Arcuschin, I., Janiak, J., Krzyzanowski, R., et al. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679.

Ferrando, J., Obeso, O., Rajamanoharan, S., & Nanda, N. (2024). Do I know this entity? Knowledge awareness and hallucinations in language models. arXiv preprint arXiv:2411.14257.

Zou, A., Phan, L., Chen, S., et al. (2023). Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.

Arditi, A., Obeso, O., Syed, A., et al. (2025). Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37, 136037-136083.

Marshall, T., Scherlis, A., & Belrose, N. (2024). Refusal in LLMs is an affine function. arXiv preprint arXiv:2411.09003.

Lee, D., Breck, E., & Arditi, A. (2025). Finding features causally upstream of refusal. [Preprint].

Xie, T., Qi, X., Zeng, Y., et al. (2024). Sorry-bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598.

Wollschlager, T., Elstner, J., Geisler, S., et al. (2025). The geometry of refusal in large language models: Concept cones and representational independence. arXiv preprint arXiv:2502.17420.

Jain, S., Lubana, E. S., Oksuz, K., et al. (2025). What makes and breaks safety fine-tuning? A mechanistic study. Advances in Neural Information Processing Systems, 37, 93406-93478.

Kissane, C., Krzyzanowski, R., Conmy, A., & Nanda, N. (2024). Base LLMs refuse too. [Preprint].

Kissane, C., Krzyzanowski, R., Nanda, N., & Conmy, A. (2024). SAEs are highly dataset dependent: A case study on the refusal direction. Alignment Forum.

Frankfurt, H. G. (2009). On bullshit. Princeton University Press.

Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.

Betley, J., Bao, X., Soto, M., et al. (2025). Tell me about yourself: LLMs are aware of their learned behaviors. arXiv preprint arXiv:2501.11120.

Greenblatt, R., Denison, C., Wright, B., et al. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093.

Marks, S., Treutlein, J., Bricken, T., et al. (2025). Auditing language models for hidden objectives. [Preprint].

Alon, U. (2019). An introduction to systems biology: Design principles of biological circuits (2nd ed.). CRC Press. https://doi.org/10.1201/9781420011432

Tigges, C., Hollinsworth, O. J., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. [Preprint].

Gurnee, W., & Tegmark, M. (2024). Language models represent space and time. [Preprint].

Gurnee, W., Horsley, T., Guo, Z. C., et al. (2024). Universal neurons in GPT-2 language models. arXiv preprint arXiv:2401.12181.

Stolfo, A., Wu, B., Gurnee, W., et al. (2025). Confidence regulation neurons in language models. Advances in Neural Information Processing Systems, 37, 125019-125049.

Schubert, L., Voss, C., Cammarata, N., et al. (2021). High-low frequency detectors. Distill. https://doi.org/10.23915/distill.00024.005

Zhong, Z., Liu, Z., Tegmark, M., & Andreas, J. (2023). The clock and the pizza: Two stories in mechanistic explanation of neural networks. Advances in Neural Information Processing Systems, 36, 27223-27250.

Hou, Y., Li, J., Fei, Y., et al. (2023). Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 4902-4919.

Brinkmann, J., Sheshadri, A., Levoso, V., et al. (2024). A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task. arXiv preprint arXiv:2402.11917.

Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual knowledge in GPT. arXiv preprint arXiv:2202.05262.

Geva, M., Bastings, J., Filippova, K., & Globerson, A. (2023). Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767.

Guez, A., Mirza, M., Gregor, K., et al. (2019). An investigation of model-free planning. International Conference on Machine Learning, 2464-2473.

Pochinkov, N., Benoit, A., Agarwal, L., et al. (2024). Extracting paragraphs from LLM token activations. arXiv preprint arXiv:2409.06328.

Mu, J., Li, X., & Goodman, N. (2023). Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 19327-19352.

Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502.

Wu, Z., Yu, X. V., Yogatama, D., et al. (2024). The semantic hub hypothesis: Language models share semantic representations across languages and modalities. arXiv preprint arXiv:2411.04986.

Zhao, Y., Zhang, W., Chen, G., et al. (2025). How do large language models handle multilingualism? Advances in Neural Information Processing Systems, 37, 15296-15319.

nostalgebraist. (2020). Interpreting GPT: The logit lens. [Blog post].

Zhang, F., & Nanda, N. (2023). Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042.

Heimersheim, S., & Nanda, N. (2024). How to use and interpret activation patching. arXiv preprint arXiv:2404.15255.

Ferrando, J., & Costa-jussa, M. R. (2024). On the similarity of circuits across languages: A case study on the subject-verb agreement task. arXiv preprint arXiv:2410.06496.

Liu, Z., Kitouni, O., Nolte, N. S., et al. (2022). Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35, 34651-34663.

Nanda, N., Chan, L., Lieberum, T., et al. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.

Stolfo, A., Belinkov, Y., & Sachan, M. (2023). A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. arXiv preprint arXiv:2305.15054.

Zhou, T., Fu, D., Sharan, V., & Jia, R. (2024). Pre-trained large language models use Fourier features to compute addition. arXiv preprint arXiv:2406.03445.

Nikankin, Y., Reusch, A., Mueller, A., & Belinkov, Y. (2024). Arithmetic without algorithms: Language models solve math with a bag of heuristics. [Preprint].

Kantamneni, S., & Tegmark, M. (2025). Language models use trigonometry to do addition. [Preprint].

Amann, J., Blasimme, A., Vayena, E., et al. (2020). Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, 1-9.

Band, S. S., Yarahmadi, A., Hsu, C., et al. (2023). Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Informatics in Medicine Unlocked, 40, 101286.

Kanjee, Z., Crowe, B., & Rodman, A. (2023). Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA, 330(1), 78-80.

Strong, E., DiGiammarino, A., Weng, Y., et al. (2023). Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Internal Medicine, 183(9), 1028-1030.

Geng, J., Cai, F., Wang, Y., et al. (2023). A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298.

Gawlikowski, J., Tassi, C. R. N., Ali, M., et al. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1), 1513-1589.

Ahdritz, G., Qin, T., Vyas, N., et al. (2024). Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563.

Lee, A., Bai, X., Pres, I., et al. (2024). A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. arXiv preprint arXiv:2401.01967.

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 80079-80110.

Andriushchenko, M., Croce, F., & Flammarion, N. (2024). Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151.

Anil, C., Durmus, E., Panickssery, N., et al. (2025). Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37, 129696-129742.

He, Z., Wang, Z., Chu, Z., et al. (2024). JailbreakLens: Interpreting jailbreak mechanism in the lens of representation and circuit. arXiv preprint arXiv:2411.11114.

Lanham, T., Chen, A., Radhakrishnan, A., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702.

Radhakrishnan, A., Nguyen, K., Chen, A., et al. (2023). Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768.

Joshua Batson, Emmanuel Ameisen., et al. (2025).Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits, transformer-circuits.pub/2025/attribution-graphs/methods.html.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签