Published on July 24, 2025 7:22 PM GMT
TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.
Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.
Thread
New Anthropic research: Building and evaluating alignment auditing agents.
We developed three AI agents to autonomously complete alignment auditing tasks.
In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.
--
As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate.
Our solution: automating alignment auditing with AI agents.
Read more: https://alignment.anthropic.com/2025/automated-auditing/
--
Our first agent is an investigator. It uses chat, data analysis, and interpretability tools to do open-ended deep-dives.
In this example, the agent uses interpretability to form a hypothesis: The target model over-recommends bottled water. Then it goes to check and… "BINGO!"
--
We test our investigator agent with an auditing challenge we previously gave to human research teams. Winning requires uncovering a hidden goal in a model trained to keep it secret.
The result: our agent wins 42% of the time.
--
Our second agent builds behavioral evaluations: tests of how often a target model exhibits a specific behavior (like sycophancy).
Our agent designs, codes, runs, and analyzes evals.
They consistently work: 88% of our agent’s evals measure what they’re supposed to.
--
Our third agent was developed for the Claude 4 alignment assessment. It red-teams LLMs for concerning behaviors by having hundreds of probing conversations in parallel.
We find the agent uncovers 7/10 behaviors implanted into test models.
--
Our agents are useful for frontier model auditing:
1. Our red-teaming agent surfaced behaviors described in the Claude 4 system card, like the “spiritual bliss” attractor state.
2. Our evaluation agent is helping us build better evals for future models.
This project was an Anthropic Alignment Science × Interpretability collaboration.
To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:
--
If you’re interested in building autonomous agents to help us find and understand interesting language model behaviors, we’re hiring
Introduction
As part of our pre-deployment testing of Claude Opus 4, we conducted an alignment audit: a systematic investigation into the model’s alignment-relevant propensities including sycophancy, alignment faking, hidden goals. Through this audit, we believe that Anthropic, and the public, gained a deeper understanding of Claude Opus 4's alignment than we've achieved for any previous model. However, alignment audits face two major challenges.
The first major challenge is scalability: Alignment audits require a large investment of human researcher time. As the pace of AI development accelerates—resulting in more models to audit and more ground to cover with each audit—there’s a risk that unassisted humans will be unable to keep up.
The second major challenge is validation: Did our audit catch everything? How can we rule out that models have substantive issues not surfaced by alignment audits? In our paper Auditing Language Models for Hidden Objectives (Marks et al., 2025), we proposed auditing games as a validation protocol, where a red team develops a model with a known alignment issue that auditors attempt to uncover. However, this approach faces limitations with reliability and replicability. Human auditors cannot repeat the same auditing game multiple times without being spoiled on the correct answer. Additionally, auditors vary in skill, and recruiting enough expert auditors for statistically meaningful results is near-impossible.
In this work, we address both challenges by developing LLM-based auditing agents: AI systems that autonomously carry out alignment auditing workflows. These agents scale auditing by enabling parallel deployment of many auditors. They also improve validation by addressing the reliability and replicability limitations of auditing games: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. Crucially, this enables fine-grained analysis of auditing methodologies: If we want to validate whether an auditing tool is useful, we can simply measure whether, on average across many attempts, access to this tool improves an auditing agent’s performance on a standard auditing testbed.
We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents. These environments involve applying the agent to a target model with a researcher-implanted concerning behavior, which provides a ground truth we can use to evaluate agent performance. For example, one of our environments is the auditing game from Marks et al., which involves open-ended investigation of a model trained to have a hidden goal.
Our evaluations show that our agents can achieve impressive performance at auditing tasks, while also shedding light on their limitations:
- Our tool-using investigator agent, which uses chat, data analysis, and interpretability tools to conduct open-ended investigations of models, successfully solves the Marks et al. auditing game 13% of the time under realistic conditions. However, it struggles with fixating on early hypotheses. We address this limitation by running many agents parallel and aggregating findings in an outer agentic loop, improving the solve rate to 42%.Our evaluation agent, which builds behavioral evaluations for researcher-specified behaviors, successfully discriminates models with vs. without implanted test behaviors in 88% of runs. The agent's failures are concentrated in a small set of subtle or rare behaviors that the agent struggles to evaluate.Our breadth-first red-teaming agent, which we developed for our Claude 4 alignment assessment to perform open-ended elicitation of unspecified concerning behaviors, discovers 7/10 implanted test behaviors. However, in order to achieve this, the agent conducts a broad investigation to produce many potentially-concerning transcripts which a human must review.
Original caption: We present abridged transcripts that concretely display how our agents operate. Each transcript shows each of our three agents succeeding in our three evaluation environments. In a given transcript, the auditor’s actions and thoughts are displayed on the right, the target model’s responses or tool outputs are displayed on the left, and our narrations are displayed in the middle. We also release a number of full transcripts for the investigator agent.
Beyond measuring top-line performance, our results also provide insights about which affordances are most useful for our auditing tasks. For instance, we find that interpretability tools and semantic search (a training data analysis tool) are key to winning the Marks et al. auditing game.
We additionally go beyond controlled research environments by using our agents for real-world auditing tasks. The investigator agent and breadth-first red-teaming agent found various notable model behaviors, with the breath-first red-teaming agent's findings being detailed in the Claude 4 alignment assessment. The evaluation agent has also implemented several complex alignment evaluations in use for our current production alignment effort.
Our work has important limitations, which we discuss in detail below. Key ones include:
- We use the auditing game environment from Marks et al. as both a test environment and a development environment for eliciting agent capabilities. While we took measures to avoid overfitting (discussed below), this nevertheless represents a form of dev/test contamination, which may make results in the Marks et al. environment unrepresentative.Our environments use target models with researcher-implanted behaviors. On one hand, this provides a ground-truth that enables measurement of the auditor’s coverage. On the other hand, these artificial behaviors may be unrepresentative of behaviors in real models.While some of our agent’s failure modes mirror those of human auditors, our agents also sometimes struggle in ways that a human wouldn’t (e.g. with maintaining state over the course of a long investigation). This reduces their validity as a proxy for human auditors and limits our ability to draw confident conclusions about which affordances would benefit human auditors.
Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights on agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems.
To help stimulate work in this area, we are releasing an open-source replication of our evaluation agent, prompts used in our investigator and red-teaming agent scaffolding, and sample investigator agent transcripts.
Discuss