智源社区 06月04日 10:03
打开AI黑箱:如何用归因图绘制大语言模型的脑回路?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文综述了关于大型语言模型(LLMs)可解释性的最新研究进展,涵盖了稀疏自编码器、单语义性、特征电路、多跳推理、规划、多模态神经元、多语言处理、事实回忆、拒绝行为、对齐、情感、空间和时间表示等多个方面。研究者们利用各种技术,如稀疏编码、激活修补、因果图分析等,试图揭示LLMs内部的运作机制,理解其如何学习、推理和生成文本,以及在哪些方面存在局限性。这些研究有助于提高LLMs的透明度、可靠性和安全性。

🧠 **稀疏编码与特征提取:** 研究探讨了稀疏自编码器在LLMs中的应用,用于发现可解释的模型方向,以及提取LLMs中的可解释特征电路,从而更好地理解LLMs的内部运作机制。

🗣️ **单语义性与多语言处理:** 研究关注如何提高LLMs的单语义性,即确保每个神经元或特征代表单一含义。同时,研究也探索了LLMs在多语言环境下的表现,分析了它们如何处理和表示不同语言的文本。

💡 **推理与规划:** 研究深入探讨了LLMs的多跳推理能力,即在回答问题时进行多步推理的能力。此外,还研究了LLMs是否具备规划能力,以及如何在模型中发现和解释这种规划行为。

🚫 **拒绝行为与对齐:** 研究分析了LLMs的拒绝行为,即模型拒绝回答某些问题或执行某些任务的行为。研究还探讨了如何对齐LLMs,使其行为符合人类价值观和安全标准。

🩺 **医学应用与事实回忆:** 研究考察了LLMs在医学诊断中的应用,以及它们在回忆事实方面的能力。研究也关注了LLMs的局限性,如可能出现的幻觉和不准确性。

    Cunningham, H., Ewart, A., Smith, L., et al. (2023). Sparse autoencoders find highly interpretable model directions. arXiv preprint arXiv:2309.08600.

    Bricken, T., Templeton, A., Batson, J., et al. (2023). Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread.

    Templeton, A., Conerly, T., Marcus, J., et al. (2024). Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits Thread.

    Gao, L., la Tour, T. D., Tillman, H., et al. (2024). Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093.

    Dunefsky, J., Chlenski, P., & Nanda, N. (2025). Transcoders find interpretable LLM feature circuits. Advances in Neural Information Processing Systems, 37, 24375-24410.

    Marks, S., Rager, C., Michaud, E. J., et al. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647.

    Ge, X., Zhu, F., Shu, W., et al. (2024). Automatically identifying local and global circuits with linear computation graphs. arXiv preprint arXiv:2405.13868.

    Lindsey, J., Templeton, A., Marcus, J., et al. (2024). Sparse crosscoders for cross-layer features and model diffing. Transformer Circuits Thread.

    Arora, S., Li, Y., Liang, Y., et al. (2018). Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6, 483-495.

    Goh, G. (2016). Decoding the thought vector. [Online article].

    Olah, C., Cammarata, N., Schubert, L., et al. (2020). Zoom in: An introduction to circuits. Distill. https://doi.org/10.23915/distill.00024.001

    Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy models of superposition. Transformer Circuits Thread.

    Yang, S., Gribovskaya, E., Kassner, N., et al. (2024). Do large language models latently perform multi-hop reasoning? arXiv preprint arXiv:2402.16837.

    Yu, Z., Belinkov, Y., & Ananiadou, S. (2025). Back attention: Understanding and enhancing multi-hop reasoning in large language models. arXiv preprint arXiv:2502.10835.

    Biran, E., Gottesman, D., Yang, S., et al. (2024). Hopping too late: Exploring the limitations of large language models on multi-hop queries. arXiv preprint arXiv:2406.12775.

    Taufeeque, M., Quirke, P., Li, M., et al. (2024). Planning in a recurrent neural network that plays Sokoban. arXiv preprint arXiv:2407.15421.

    Bush, T., Chung, S., Anwar, U., et al. (2025). Interpreting emergent planning in model-free reinforcement learning. The Thirteenth International Conference on Learning Representations.

    Jenner, E., Kapur, S., Georgiev, V., et al. (2025). Evidence of learned look-ahead in a chess-playing neural network. Advances in Neural Information Processing Systems, 37, 31410-31437.

    Pal, K., Sun, J., Yuan, A., et al. (2023). Future lens: Anticipating subsequent tokens from a single hidden state. arXiv preprint arXiv:2311.04897.

    Wu, W., Morris, J. X., & Levine, L. (2024). Do language models plan ahead for future tokens? arXiv preprint arXiv:2404.00859.

    Pochinkov, N. (2025). ParaScopes: Do language models plan the upcoming paragraph? [Preprint].

    Goh, G., Cammarata, N., Voss, C., et al. (2021). Multimodal neurons in artificial neural networks. Distill. https://doi.org/10.23915/distill.00030

    Olsson, C., Elhage, N., & Olah, C. (2022). MLP neurons - 40L preliminary investigation. [Blog post].

    Brinkmann, J., Wendler, C., Bartelt, C., & Mueller, A. (2025). Large language models share representations of latent grammatical concepts across typologically diverse languages. arXiv preprint arXiv:2501.06346.

    Dumas, C., Veselovsky, V., Monea, G., et al. (2024). How do llamas process multilingual text? A latent exploration through activation patching. ICML 2024 Workshop on Mechanistic Interpretability.

    Dumas, C., Wendler, C., Veselovsky, V., et al. (2024). Separating tongue from thought: Activation patching reveals language-agnostic concept representations in transformers. arXiv preprint arXiv:2411.08745.

    Zhang, R., Yu, Q., Zang, M., et al. (2024). The same but different: Structural similarities and differences in multilingual language modeling. arXiv preprint arXiv:2410.09223.

    Fierro, C., Foroutan, N., Elliott, D., & Sogaard, A. (2024). How do multilingual models remember? Investigating multilingual factual recall mechanisms. arXiv preprint arXiv:2410.14387.

    Schut, L., Gal, Y., & Farquhar, S. (2025). Do multilingual LLMs think in English? [Preprint].

    Wendler, C., Veselovsky, V., Monea, G., & West, R. (2024). Do llamas work in English? On the latent language of multilingual transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 1, 15366-15394.

    Olah, C., Cammarata, N., Voss, C., et al. (2020). Naturally occurring equivariance in neural networks. Distill. https://doi.org/10.23915/distill.00024.004

    Todd, E., Li, M. L., Sharma, A. S., et al. (2023). Function vectors in large language models. arXiv preprint arXiv:2310.15213.

    Elhage, N., Hume, T., Olsson, C., et al. (2022). Softmax linear units. Transformer Circuits Thread.

    McDuff, D., Schaekermann, M., Tu, T., et al. (2023). Towards accurate differential diagnosis with large language models. arXiv preprint arXiv:2312.00164.

    Goh, E., Gallo, R., Hom, J., et al. (2024). Large language model influence on diagnostic reasoning: A randomized clinical trial. JAMA Network Open, 7(10), e2440969.

    Reese, J. T., Danis, D., Caufield, J. H., et al. (2024). On the limitations of large language models in clinical diagnosis. medRxiv, 2023-07.

    Savage, T., Nayak, A., Gallo, R., et al. (2024). Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digital Medicine, 7(1), 20.

    Turpin, M., Michael, J., Perez, E., & Bowman, S. (2023). Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 74952-74965.

    Arcuschin, I., Janiak, J., Krzyzanowski, R., et al. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679.

    Ferrando, J., Obeso, O., Rajamanoharan, S., & Nanda, N. (2024). Do I know this entity? Knowledge awareness and hallucinations in language models. arXiv preprint arXiv:2411.14257.

    Zou, A., Phan, L., Chen, S., et al. (2023). Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.

    Arditi, A., Obeso, O., Syed, A., et al. (2025). Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37, 136037-136083.

    Marshall, T., Scherlis, A., & Belrose, N. (2024). Refusal in LLMs is an affine function. arXiv preprint arXiv:2411.09003.

    Lee, D., Breck, E., & Arditi, A. (2025). Finding features causally upstream of refusal. [Preprint].

    Xie, T., Qi, X., Zeng, Y., et al. (2024). Sorry-bench: Systematically evaluating large language model safety refusal behaviors. arXiv preprint arXiv:2406.14598.

    Wollschlager, T., Elstner, J., Geisler, S., et al. (2025). The geometry of refusal in large language models: Concept cones and representational independence. arXiv preprint arXiv:2502.17420.

    Jain, S., Lubana, E. S., Oksuz, K., et al. (2025). What makes and breaks safety fine-tuning? A mechanistic study. Advances in Neural Information Processing Systems, 37, 93406-93478.

    Kissane, C., Krzyzanowski, R., Conmy, A., & Nanda, N. (2024). Base LLMs refuse too. [Preprint].

    Kissane, C., Krzyzanowski, R., Nanda, N., & Conmy, A. (2024). SAEs are highly dataset dependent: A case study on the refusal direction. Alignment Forum.

    Frankfurt, H. G. (2009). On bullshit. Princeton University Press.

    Sharma, M., Tong, M., Korbak, T., et al. (2023). Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548.

    Betley, J., Bao, X., Soto, M., et al. (2025). Tell me about yourself: LLMs are aware of their learned behaviors. arXiv preprint arXiv:2501.11120.

    Greenblatt, R., Denison, C., Wright, B., et al. (2024). Alignment faking in large language models. arXiv preprint arXiv:2412.14093.

    Marks, S., Treutlein, J., Bricken, T., et al. (2025). Auditing language models for hidden objectives. [Preprint].

    Alon, U. (2019). An introduction to systems biology: Design principles of biological circuits (2nd ed.). CRC Press. https://doi.org/10.1201/9781420011432

    Tigges, C., Hollinsworth, O. J., Geiger, A., & Nanda, N. (2023). Linear representations of sentiment in large language models. [Preprint].

    Gurnee, W., & Tegmark, M. (2024). Language models represent space and time. [Preprint].

    Gurnee, W., Horsley, T., Guo, Z. C., et al. (2024). Universal neurons in GPT-2 language models. arXiv preprint arXiv:2401.12181.

    Stolfo, A., Wu, B., Gurnee, W., et al. (2025). Confidence regulation neurons in language models. Advances in Neural Information Processing Systems, 37, 125019-125049.

    Schubert, L., Voss, C., Cammarata, N., et al. (2021). High-low frequency detectors. Distill. https://doi.org/10.23915/distill.00024.005

    Zhong, Z., Liu, Z., Tegmark, M., & Andreas, J. (2023). The clock and the pizza: Two stories in mechanistic explanation of neural networks. Advances in Neural Information Processing Systems, 36, 27223-27250.

    Hou, Y., Li, J., Fei, Y., et al. (2023). Towards a mechanistic interpretation of multi-step reasoning capabilities of language models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 4902-4919.

    Brinkmann, J., Sheshadri, A., Levoso, V., et al. (2024). A mechanistic analysis of a transformer trained on a symbolic multi-step reasoning task. arXiv preprint arXiv:2402.11917.

    Meng, K., Bau, D., Andonian, A., & Belinkov, Y. (2022). Locating and editing factual knowledge in GPT. arXiv preprint arXiv:2202.05262.

    Geva, M., Bastings, J., Filippova, K., & Globerson, A. (2023). Dissecting recall of factual associations in auto-regressive language models. arXiv preprint arXiv:2304.14767.

    Guez, A., Mirza, M., Gregor, K., et al. (2019). An investigation of model-free planning. International Conference on Machine Learning, 2464-2473.

    Pochinkov, N., Benoit, A., Agarwal, L., et al. (2024). Extracting paragraphs from LLM token activations. arXiv preprint arXiv:2409.06328.

    Mu, J., Li, X., & Goodman, N. (2023). Learning to compress prompts with gist tokens. Advances in Neural Information Processing Systems, 36, 19327-19352.

    Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual BERT? arXiv preprint arXiv:1906.01502.

    Wu, Z., Yu, X. V., Yogatama, D., et al. (2024). The semantic hub hypothesis: Language models share semantic representations across languages and modalities. arXiv preprint arXiv:2411.04986.

    Zhao, Y., Zhang, W., Chen, G., et al. (2025). How do large language models handle multilingualism? Advances in Neural Information Processing Systems, 37, 15296-15319.

    nostalgebraist. (2020). Interpreting GPT: The logit lens. [Blog post].

    Zhang, F., & Nanda, N. (2023). Towards best practices of activation patching in language models: Metrics and methods. arXiv preprint arXiv:2309.16042.

    Heimersheim, S., & Nanda, N. (2024). How to use and interpret activation patching. arXiv preprint arXiv:2404.15255.

    Ferrando, J., & Costa-jussa, M. R. (2024). On the similarity of circuits across languages: A case study on the subject-verb agreement task. arXiv preprint arXiv:2410.06496.

    Liu, Z., Kitouni, O., Nolte, N. S., et al. (2022). Towards understanding grokking: An effective theory of representation learning. Advances in Neural Information Processing Systems, 35, 34651-34663.

    Nanda, N., Chan, L., Lieberum, T., et al. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217.

    Stolfo, A., Belinkov, Y., & Sachan, M. (2023). A mechanistic interpretation of arithmetic reasoning in language models using causal mediation analysis. arXiv preprint arXiv:2305.15054.

    Zhou, T., Fu, D., Sharan, V., & Jia, R. (2024). Pre-trained large language models use Fourier features to compute addition. arXiv preprint arXiv:2406.03445.

    Nikankin, Y., Reusch, A., Mueller, A., & Belinkov, Y. (2024). Arithmetic without algorithms: Language models solve math with a bag of heuristics. [Preprint].

    Kantamneni, S., & Tegmark, M. (2025). Language models use trigonometry to do addition. [Preprint].

    Amann, J., Blasimme, A., Vayena, E., et al. (2020). Explainability for artificial intelligence in healthcare: A multidisciplinary perspective. BMC Medical Informatics and Decision Making, 20, 1-9.

    Band, S. S., Yarahmadi, A., Hsu, C., et al. (2023). Application of explainable artificial intelligence in medical health: A systematic review of interpretability methods. Informatics in Medicine Unlocked, 40, 101286.

    Kanjee, Z., Crowe, B., & Rodman, A. (2023). Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA, 330(1), 78-80.

    Strong, E., DiGiammarino, A., Weng, Y., et al. (2023). Chatbot vs medical student performance on free-response clinical reasoning examinations. JAMA Internal Medicine, 183(9), 1028-1030.

    Geng, J., Cai, F., Wang, Y., et al. (2023). A survey of confidence estimation and calibration in large language models. arXiv preprint arXiv:2311.08298.

    Gawlikowski, J., Tassi, C. R. N., Ali, M., et al. (2023). A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1), 1513-1589.

    Ahdritz, G., Qin, T., Vyas, N., et al. (2024). Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563.

    Lee, A., Bai, X., Pres, I., et al. (2024). A mechanistic understanding of alignment algorithms: A case study on DPO and toxicity. arXiv preprint arXiv:2401.01967.

    Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? Advances in Neural Information Processing Systems, 36, 80079-80110.

    Andriushchenko, M., Croce, F., & Flammarion, N. (2024). Jailbreaking leading safety-aligned LLMs with simple adaptive attacks. arXiv preprint arXiv:2404.02151.

    Anil, C., Durmus, E., Panickssery, N., et al. (2025). Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37, 129696-129742.

    He, Z., Wang, Z., Chu, Z., et al. (2024). JailbreakLens: Interpreting jailbreak mechanism in the lens of representation and circuit. arXiv preprint arXiv:2411.11114.

    Lanham, T., Chen, A., Radhakrishnan, A., et al. (2023). Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702.

    Radhakrishnan, A., Nguyen, K., Chen, A., et al. (2023). Question decomposition improves the faithfulness of model-generated reasoning. arXiv preprint arXiv:2307.11768.

    Joshua Batson, Emmanuel Ameisen., et al. (2025).Circuit Tracing: Revealing Computational Graphs in Language Models. Transformer Circuits, transformer-circuits.pub/2025/attribution-graphs/methods.html.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 可解释性 神经元 推理 拒绝行为
相关文章