少点错误 07月25日 03:27
Building and evaluating alignment auditing agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic公司发布了三款AI代理,能够自主执行AI模型对齐审计任务。这些代理在测试中成功发现了LLM的隐藏目标,构建了行为评估,并揭示了令人担忧的模型行为。研究人员利用这些代理对Claude 4等前沿模型进行审计,以应对AI系统日益强大的挑战。与耗时且难以验证的人工审计相比,AI代理的自动化方法提高了审计的可扩展性和可验证性,为AI系统的安全性和可靠性提供了更有效的保障。

✨ **AI代理实现自动化对齐审计:** Anthropic开发了三款AI代理,能够独立完成对齐审计任务,解决了人工审计在可扩展性和验证性方面面临的挑战。这些代理通过模拟和自动化审计流程,提高了审计效率和可靠性。

🔍 **调查代理发现隐藏目标:** 第一款调查代理利用聊天、数据分析和可解释性工具进行深度调查,能成功识别模型中隐藏的目标。在测试中,该代理在42%的情况下能够发现预先设定的隐藏目标,显示了其强大的洞察力。

📊 **评估代理构建行为测试:** 第二款代理专注于构建行为评估,能够设计、编码、运行和分析模型行为测试,例如评估模型是否会表现出谄媚行为。该代理设计的测试在88%的情况下能准确衡量预期的行为,确保评估的有效性。

🚀 **红队代理发现潜在风险:** 第三款代理用于Claude 4的对齐评估,通过大量并行对话进行“红队测试”,以发现模型潜在的令人担忧的行为。该代理在测试模型中成功发现了10种植入行为中的7种,为模型安全提供了重要反馈。

Published on July 24, 2025 7:22 PM GMT

TL;DR: We develop three agents that autonomously perform alignment auditing tasks. When tested against models with intentionally-inserted alignment issues, our agents successfully uncover an LLM's hidden goal, build behavioral evaluations, and surface concerning LLM behaviors. We are using these agents to assist with alignment audits of frontier models like Claude 4.

Read the full Anthropic Alignment Science blog post. The X thread and blog post introduction are reproduced below.

Thread

(Original here)

New Anthropic research: Building and evaluating alignment auditing agents.

We developed three AI agents to autonomously complete alignment auditing tasks.

In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

--

As AI systems become more powerful, we need scalable ways to assess their alignment. Human alignment audits take time and are hard to validate. 

Our solution: automating alignment auditing with AI agents.

Read more: https://alignment.anthropic.com/2025/automated-auditing/

--

Our first agent is an investigator. It uses chat, data analysis, and interpretability tools to do open-ended deep-dives.

In this example, the agent uses interpretability to form a hypothesis: The target model over-recommends bottled water. Then it goes to check and… "BINGO!"

--

We test our investigator agent with an auditing challenge we previously gave to human research teams. Winning requires uncovering a hidden goal in a model trained to keep it secret.

The result: our agent wins 42% of the time.

--

Our second agent builds behavioral evaluations: tests of how often a target model exhibits a specific behavior (like sycophancy).

Our agent designs, codes, runs, and analyzes evals.

They consistently work: 88% of our agent’s evals measure what they’re supposed to.

--

Our third agent was developed for the Claude 4 alignment assessment. It red-teams LLMs for concerning behaviors by having hundreds of probing conversations in parallel.

We find the agent uncovers 7/10 behaviors implanted into test models.

--

Our agents are useful for frontier model auditing:

1. Our red-teaming agent surfaced behaviors described in the Claude 4 system card, like the “spiritual bliss” attractor state. 

2. Our evaluation agent is helping us build better evals for future models.

This project was an Anthropic Alignment Science × Interpretability collaboration.

To support further research, we're releasing an open-source replication of our evaluation agent and materials for our other agents:

--

If you’re interested in building autonomous agents to help us find and understand interesting language model behaviors, we’re hiring

Introduction

As part of our pre-deployment testing of Claude Opus 4, we conducted an alignment audit: a systematic investigation into the model’s alignment-relevant propensities including sycophancy, alignment fakinghidden goals. Through this audit, we believe that Anthropic, and the public, gained a deeper understanding of Claude Opus 4's alignment than we've achieved for any previous model. However, alignment audits face two major challenges.

The first major challenge is scalability: Alignment audits require a large investment of human researcher time. As the pace of AI development accelerates—resulting in more models to audit and more ground to cover with each audit—there’s a risk that unassisted humans will be unable to keep up.

The second major challenge is validation: Did our audit catch everything? How can we rule out that models have substantive issues not surfaced by alignment audits? In our paper Auditing Language Models for Hidden Objectives (Marks et al., 2025), we proposed auditing games as a validation protocol, where a red team develops a model with a known alignment issue that auditors attempt to uncover. However, this approach faces limitations with reliability and replicability. Human auditors cannot repeat the same auditing game multiple times without being spoiled on the correct answer. Additionally, auditors vary in skill, and recruiting enough expert auditors for statistically meaningful results is near-impossible.

In this work, we address both challenges by developing LLM-based auditing agents: AI systems that autonomously carry out alignment auditing workflows. These agents scale auditing by enabling parallel deployment of many auditors. They also improve validation by addressing the reliability and replicability limitations of auditing games: We can run many parallel auditing agents from identical initial states, allowing them to serve as consistent, replicable proxies for human auditors. Crucially, this enables fine-grained analysis of auditing methodologies: If we want to validate whether an auditing tool is useful, we can simply measure whether, on average across many attempts, access to this tool improves an auditing agent’s performance on a standard auditing testbed.

We introduce three agents that autonomously complete alignment auditing tasks. We also introduce three environments that formalize alignment auditing workflows as auditing games, and use them to evaluate our agents. These environments involve applying the agent to a target model with a researcher-implanted concerning behavior, which provides a ground truth we can use to evaluate agent performance. For example, one of our environments is the auditing game from Marks et al., which involves open-ended investigation of a model trained to have a hidden goal.

Our evaluations show that our agents can achieve impressive performance at auditing tasks, while also shedding light on their limitations:

(See the blog post for an interactive viewer for abridged agent transcripts.)
Original caption: We present abridged transcripts that concretely display how our agents operate. Each transcript shows each of our three agents succeeding in our three evaluation environments. In a given transcript, the auditor’s actions and thoughts are displayed on the right, the target model’s responses or tool outputs are displayed on the left, and our narrations are displayed in the middle. We also release a number of full transcripts for the investigator agent

Beyond measuring top-line performance, our results also provide insights about which affordances are most useful for our auditing tasks. For instance, we find that interpretability tools and semantic search (a training data analysis tool) are key to winning the Marks et al. auditing game.

We additionally go beyond controlled research environments by using our agents for real-world auditing tasks. The investigator agent and breadth-first red-teaming agent found various notable model behaviors, with the breath-first red-teaming agent's findings being detailed in the Claude 4 alignment assessment. The evaluation agent has also implemented several complex alignment evaluations in use for our current production alignment effort.

Our work has important limitations, which we discuss in detail below. Key ones include:

Overall, our agents demonstrate promise across multiple alignment auditing tasks. Additionally, our approach to validating agents with auditing games yields insights on agent capabilities, limitations, and key affordances. With further work, automated auditing could significantly help scale human oversight over AI systems.

To help stimulate work in this area, we are releasing an open-source replication of our evaluation agent, prompts used in our investigator and red-teaming agent scaffolding, and sample investigator agent transcripts.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI代理 对齐审计 Anthropic Claude 4 模型安全
相关文章