MarkTechPost@AI 04月06日 23:05
This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Anthropic的研究人员开发的一种名为“归因图”的新型可解释性技术,用于追踪大型语言模型(LLM)的内部推理过程。通过应用于Claude 3.5 Haiku模型,归因图揭示了模型在解决问题时的中间步骤和内部表示,如诗歌创作中的押韵词预规划,以及多步问题解答中的中间概念形成。研究结果表明,LLM不仅仅是模仿人类回应,而是通过分层、结构化的步骤进行计算,为更透明、负责任地部署高级AI系统提供了可能。

🧠 归因图是一种新的可解释性技术,用于追踪LLM内部信息流。

💡 归因图通过识别模型内部特征之间的信息流,揭示中间概念和推理步骤。

✍️ 研究人员将归因图应用于Claude 3.5 Haiku模型,发现其在诗歌创作和多步问题解答中展现出复杂的推理行为。

🔬 研究表明,Claude 3.5 Haiku能够进行预先规划、内部目标设定和逐步逻辑推理,而无需明确指令。

🔓 归因图为深入审计模型行为、实现更透明和负责任的AI系统部署提供了可能。

While the outputs of large language models (LLMs) appear coherent and useful, the underlying mechanisms guiding these behaviors remain largely unknown. As these models are increasingly deployed in sensitive and high-stakes environments, it has become crucial to understand what they do and how they do it.

The main challenge lies in uncovering the internal steps that lead a model to a specific response. The computations happen across hundreds of layers and billions of parameters, making it difficult to isolate the processes involved. Without a clear understanding of these steps, trusting or debugging their behavior becomes harder, especially in tasks requiring reasoning, planning, or factual reliability. Researchers are thus focused on reverse-engineering these models to identify how information flows and decisions are made internally.

Existing interpretability methods like attention maps and feature attribution offer partial views into model behavior. While these tools help highlight which input tokens contribute to outputs, they often fail to trace the full chain of reasoning or identify intermediate steps. Moreover, these tools usually focus on surface-level behaviors and do not provide consistent insight into deeper computational structures. This has created the need for more structured, fine-grained methods to trace logic through internal representations over multiple steps.

To address this, researchers from Anthropic introduced a new technique called attribution graphs. These graphs allow researchers to trace the internal flow of information between features within a model during a single forward pass. By doing so, they attempt to identify intermediate concepts or reasoning steps that are not visible from the model’s outputs alone. The attribution graphs generate hypotheses about the computational pathways a model follows, which are then tested using perturbation experiments. This approach marks a significant step toward revealing the “wiring diagram” of large models, much like how neuroscientists map brain activity.

The researchers applied attribution graphs to Claude 3.5 Haiku, a lightweight language model released by Anthropic in October 2024. The method begins by identifying interpretable features activated by a specific input. These features are then traced to determine their influence on the final output. For example, when prompted with a riddle or poem, the model selects a set of rhyming words before writing lines, a form of planning. In another example, the model identifies “Texas” as an intermediate step to answer the question, “What’s the capital of the state containing Dallas?” which it correctly resolves as “Austin.” The graphs reveal the model outputs and how it internally represents and transitions between ideas.

The performance results from attribution graphs uncovered several advanced behaviors within Claude 3.5 Haiku. In poetry tasks, the model pre-plans rhyming words before composing each line, showing anticipatory reasoning. In multi-hop questions, the model forms internal intermediate representations, such as associating Dallas with Texas before determining Austin as the answer. It leverages both language-specific and abstract circuits for multilingual inputs, with the latter becoming more prominent in Claude 3.5 Haiku than in earlier models. Further, the model generates diagnoses internally in medical reasoning tasks and uses them to inform follow-up questions. These findings suggest that the model can abstract planning, internal goal-setting, and stepwise logical deductions without explicit instruction.

This research presents attribution graphs as a valuable interpretability tool that reveals the hidden layers of reasoning in language models. By applying this method, the team from Anthropic has shown that models like Claude 3.5 Haiku don’t merely mimic human responses—they compute through layered, structured steps. This opens the door to deeper audits of model behavior, allowing more transparent and responsible deployment of advanced AI systems.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post This AI Paper from Anthropic Introduces Attribution Graphs: A New Interpretability Method to Trace Internal Reasoning in Claude 3.5 Haiku appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 可解释性 归因图 Claude 3.5 Haiku AI推理
相关文章