Published on July 24, 2025 7:22 PM GMT

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate.

Our solution: automating alignment auditing with AI agents.

Our first agent is an investigator. It uses chat, data analysis, and interpretability tools to do open-ended deep-dives.

In this example, the agent uses interpretability to form a hypothesis: The target model over-recommends bottled water. Then it goes to check and… "BINGO!"

We test our investigator agent with an auditing challenge we previously gave to human research teams. Winning requires uncovering a hidden goal in a model trained to keep it secret.

The result: our agent wins 42% of the time.

Our second agent builds behavioral evaluations: tests of how often a target model exhibits a specific behavior (like sycophancy).

Our agent designs, codes, runs, and analyzes evals.

They consistently work: 88% of our agent’s evals measure what they’re supposed to.

Our third agent was developed for the Claude 4 alignment assessment. It red-teams LLMs for concerning behaviors by having hundreds of probing conversations in parallel.

We find the agent uncovers 7/10 behaviors implanted into test models.

Our agents are useful for frontier model auditing:

1. Our red-teaming agent surfaced behaviors described in the Claude 4 system card, like the “spiritual bliss” attractor state.

2. Our evaluation agent is helping us build better evals for future models.

This project was an Anthropic Alignment Science × Interpretability collaboration.

To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:

If you’re interested in building autonomous agents to help us find and understand interesting language model behaviors, we’re hiring

Introduction

As part of our pre-deployment testing of Claude Opus 4, we conducted an alignment audit: a systematic investigation into the model’s alignment-relevant propensities including sycophancy, alignment faking, hidden goals. Through this audit, we believe that Anthropic, and the public, gained a deeper understanding of Claude Opus 4's alignment than we've achieved for any previous model. However, alignment audits face two major challenges.

The first major challenge is scalability: Alignment audits require a large investment of human researcher time. As the pace of AI development accelerates—resulting in more models to audit and more ground to cover with each audit—there’s a risk that unassisted humans will be unable to keep up.

The second major challenge is validation: Did our audit catch everything? How can we rule out that models have substantive issues not surfaced by alignment audits? In our paper Auditing Language Models for Hidden Objectives (Marks et al., 2025), we proposed auditing games as a validation protocol, where a red team develops a model with a known alignment issue that auditors attempt to uncover. However, this approach faces limitations with reliability and replicability. Human auditors cannot repeat the same auditing game multiple times without being spoiled on the correct answer. Additionally, auditors vary in skill, and recruiting enough expert auditors for statistically meaningful results is near-impossible.

In this work, we address both challenges by developing LLM-based auditing agents: AI systems that autonomously carry out alignment auditing workflows. These agents scale auditing by enabling parallel deployment of many auditors. They also improve validation by addressing the reliability and replicability limitations of auditing games: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. Crucially, this enables fine-grained analysis of auditing methodologies: If we want to validate whether an auditing tool is useful, we can simply measure whether, on average across many attempts, access to this tool improves an auditing agent’s performance on a standard auditing testbed.

We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents. These environments involve applying the agent to a target model with a researcher-implanted concerning behavior, which provides a ground truth we can use to evaluate agent performance. For example, one of our environments is the auditing game from Marks et al., which involves open-ended investigation of a model trained to have a hidden goal.

Our evaluations show that our agents can achieve impressive performance at auditing tasks, while also shedding light on their limitations:

tool-using investigator agent

evaluation agent

breadth-first red-teaming agent

Claude 4 alignment assessment

(See the blog post for an interactive viewer for abridged agent transcripts.)
Original caption: We present abridged transcripts that concretely display how our agents operate. Each transcript shows each of our three agents succeeding in our three evaluation environments. In a given transcript, the auditor’s actions and thoughts are displayed on the right, the target model’s responses or tool outputs are displayed on the left, and our narrations are displayed in the middle. We also release a number of full transcripts for the investigator agent.

Beyond measuring top-line performance, our results also provide insights about which affordances are most useful for our auditing tasks. For instance, we find that interpretability tools and semantic search (a training data analysis tool) are key to winning the Marks et al. auditing game.

We additionally go beyond controlled research environments by using our agents for real-world auditing tasks. The investigator agent and breadth-first red-teaming agent found various notable model behaviors, with the breath-first red-teaming agent's findings being detailed in the Claude 4 alignment assessment. The evaluation agent has also implemented several complex alignment evaluations in use for our current production alignment effort.

Our work has important limitations, which we discuss in detail below. Key ones include:

below

Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights on agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems.

To help stimulate work in this area, we are releasing an open-source replication of our evaluation agent, prompts used in our investigator and red-teaming agent scaffolding, and sample investigator agent transcripts.

Discuss

Thread

Introduction

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签