Published on June 17, 2025 2:52 PM GMT
Authors: Been Kim, John Hewitt, Neel Nanda, Noah Fiedel, Oyvind Tafjord
We propose a research direction called agentic interpretability. The idea of agentic interpretability stems from the observation that AI systems are becoming increasingly adept at communicating with us, verbalizing their thoughts, and providing explanations, raising the question of whether we could ask and help AI systems to build mental models of us which help us to build mental models of the LLMs.
The core idea of agentic interpretability is for a method to `proactively assist human understanding in a multi-turn interactive process by developing and leveraging a mental model of the user which in turn enables humans to develop better mental models of the LLM’. In other words, enable machines to help us understand them.
In the AGI safety community, interpretability is primarily considered as a means of detecting deception and other forms of misalignment. In contrast, here we use interpretability in the broad sense of techniques that aid us in building a deeper understanding of models, what they are doing, and why. Agentic interpretability is not primarily intended to be robust to adversarial systems (although we suggest an idea of “open-model surgery” in the paper, summarized below), and we will likely also need solutions to deceptive misalignment via different approaches. We instead believe that one of the main forms of AGI safety relevance is by increasing human empowerment and helping reduce bad outcomes like gradual disempowerment. What is the optimal human-AI communication framework that enables human users to understand AI operations and reasoning, rather than simply deferring all decision-making power to the AI?
The idea of Open-model surgery: While agentic interpretability is not primarily intended for adversarial systems, we suggest some interesting ways to use agentic interpretability to help detect deception: Imagine an ``open-model surgery" where researchers actively converse with the model while intervening in its internal mechanisms—ablating connections, amplifying activations, or injecting specific inputs into identified circuits. The model, guided by its understanding of the researchers' goal (to understand a specific component's function), is then encouraged to explain the resulting behavioral or internal state changes. This interactive probing, akin to neurosurgeons conversing with patients during awake brain surgery to map critical functions, offers a dynamic way to test hypotheses and build understanding. A model whose internal states are being directly manipulated and inspected, yet which is simultaneously engaged in an explanatory dialogue, faces a stringent test of coherence, analogous to an interrogation where a suspect's claims are immediately cross-referenced with physical evidence or observed physiological responses.
Increasingly powerful AI systems pose increasing challenges to our ability to understand them. Agentic interpretability recognizes that these systems also provide the opportunity to enlist them in service of our understanding of them. We believe methods that pursue this direction will have complementary benefits to the existing interpretability landscape.
Discuss