少点错误 11小时前
Agentic Interpretability: A Strategy Against Gradual Disempowerment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文提出了一种名为“智能体可解释性”的研究方向,旨在利用AI系统构建对人类的心理模型,从而帮助人类更好地理解大型语言模型(LLM)。与传统的可解释性方法不同,智能体可解释性侧重于通过多轮交互过程,主动协助人类理解AI。它强调人类赋权,减少潜在的负面影响,例如逐渐丧失控制权。文章还探讨了“开放模型手术”的概念,通过干预AI内部机制,促使模型解释其行为变化,从而检测欺骗行为。智能体可解释性为提升人类对AI的理解提供了新的视角。

🧠 智能体可解释性的核心在于,通过构建对用户的心理模型,促使AI系统主动协助人类理解自身,从而让人类更好地理解LLM。

💡 智能体可解释性旨在提升人类的自主权,避免人类逐渐失去对AI的控制力,而非主要关注对抗恶意AI系统。

🔪 “开放模型手术”是一种通过干预AI内部机制(如切断连接、放大激活等),并引导模型解释其行为变化来检测欺骗行为的方法,类似于神经外科医生在清醒的病人脑部手术中进行的功能映射。

🤝 智能体可解释性认为,强大的AI系统既带来了理解上的挑战,也提供了利用AI来帮助我们理解它们的机会。这种方法是对现有可解释性研究的补充。

Published on June 17, 2025 2:52 PM GMT

Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord

Full paper on arXiv

 

We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising the question of whether we could ask and help AI systems to build mental models of us which help us to build mental models of the LLMs.

 

The core idea of agentic interpretability is for a method to `proactively assist human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user which in turn enables humans to develop better mental models of the LLM’. In other words, enable machines to help us understand them. 

 

In the AGI safety community, interpretability is primarily considered as a means of detecting deception and other forms of misalignment. In contrast, here we use interpretability in the broad sense of techniques that aid us in building a deeper understanding of models, what they are doing, and why. Agentic interpretability is not primarily intended to be robust to adversarial systems (although we suggest an idea of “open-model surgery” in the paper, summarized below), and we will likely also need solutions to deceptive misalignment via different approaches. We instead believe that one of the main forms of AGI safety relevance is by increasing human empowerment and helping reduce bad outcomes like gradual disempowerment. What is the optimal human-AI communication framework that enables human users to understand AI operations and reasoning, rather than simply deferring all decision-making power to the AI?

 

The idea of Open-model surgery: While agentic interpretability is not primarily intended for adversarial systems, we suggest some  interesting ways to use agentic interpretability to help detect deception: Imagine an ``open-model surgery" where researchers actively converse with the model while intervening in its internal mechanisms—ablating connections, amplifying activations, or injecting specific inputs into identified circuits. The model, guided by its understanding of the researchers' goal (to understand a specific component's function), is then encouraged to explain the resulting behavioral or internal state changes. This interactive probing, akin to neurosurgeons conversing with patients during awake brain surgery to map critical functions, offers a dynamic way to test hypotheses and build understanding. A model whose internal states are being directly manipulated and inspected, yet which is simultaneously engaged in an explanatory dialogue, faces a stringent test of coherence, analogous to an interrogation where a suspect's claims are immediately cross-referenced with physical evidence or observed physiological responses.

 

Increasingly powerful AI systems pose increasing challenges to our ability to understand them. Agentic interpretability recognizes that these systems also provide the opportunity to enlist them in service of our understanding of them. We believe methods that pursue this direction will have complementary benefits to the existing interpretability landscape. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

智能体可解释性 AI可解释性 LLM 开放模型手术
相关文章