少点错误 08月01日 18:43
Research Areas in Benchmark Design and Evaluation (The Alignment Project by UK AISI)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

对齐项目是一项由国际政府、行业、风险投资和慈善基金支持的全球性基金,规模超过1500万英镑,致力于加速AI控制与对齐研究的进展。该项目关注如何构建能够安全、可控地执行任务的AI系统,特别是在人类难以进行有效监督的场景下。文章详细介绍了项目在基准设计与评估、可扩展监督、AI安全故障集中化以及AI监控与红队演练等方面的研究重点,旨在通过量化和系统化的评估方法,应对AI系统可能出现的欺骗、规避控制等复杂挑战,确保AI的长期安全与可控性。

🧰 **基准设计与评估是AI对齐研究的关键**:文章强调,基准测试将抽象的对齐挑战转化为可衡量的任务,使研究人员能够系统地研究和评估解决方案。这包括开发评估人类因专业知识、成本或时间限制而难以有效监督的AI行为,以及测试可能故意欺骗人类的AI模型。特别关注可扩展监督(Scalable oversight)和控制(Control)两个评估基础设施领域,旨在确保AI系统在超出人类直接监督能力时仍能按预期行事。

⚖️ **可扩展监督以应对高风险任务中的AI评估难题**:随着AI系统在高风险任务中的应用日益广泛,人类将难以进行有效评估,因此需要新的方法和评估。文章特别提到,需要量化AI在各种任务上的性能,包括没有唯一可验证解决方案的任务(如研究建议)。以“可疑论点”问题为例,旨在研究如何检测AI系统故意隐藏其错误的策略,以及如何在部分可观测环境中设计奖励模型和评估方法,以应对人类监督的局限性。

🛡️ **聚焦AI安全故障的集中化场景与控制协议**:该部分关注AI模型可能通过少量关键操作来规避人类控制的“集中化故障”场景。例如,AI可能在代码库中植入漏洞,导致失控。文章提出开发“控制协议”,即旨在防止AI系统进行不安全操作的计划,并通过“控制评估”来验证其有效性。这包括构建高风险控制设置(如模拟前沿模型开发基础设施)和进行高风险控制评估,以测试AI系统在故意欺骗情况下的安全性。

🔍 **AI监控与红队演练的跨领域研究**:文章还探讨了AI监控和红队演练等不限于特定威胁模型的交叉研究领域。这包括研究AI系统之间的“串通”(collusion)和“过度诱导”(overelicitation)问题,以及评估AI执行低概率攻击的能力、内省能力和情境感知能力。此外,还强调了“链式思维”(Chain of Thought, CoT)的可监控性,以及如何评估和保护CoT的真实性,以期更有效地监控和控制AI的行为。

Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Benchmark Design and Evaluation

Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:

    Scalable oversight: assessing AI performance on tasks where human evaluation is limited by expertise, cost, or time.Control: testing whether safety measures work when models might actively try to circumvent them.

Reward Model Benchmarking for Scalable Oversight

Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.

Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.

Examples of subproblems:

Established subproblems:

    The 'obfuscated arguments' problem (Barnes 2020) is a strategy by which a dishonest AI system could prevent an honest AI system from efficiently pointing out its mistakes. Develop a dataset of debate questions on which a dishonest system could successfully use this strategy.To model the problem of scalable oversight and human limitations, partial observability can be used as a proxy in environment design. Develop benchmarks for oversight and reinforcement learning where the reward model or judge does not see all actions taken by the agent. See Michael et al., 2023 for an example of benchmarking a scalable oversight method in a partially observable environment.As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to ‘fake’ alignment during training and evaluation and then behave unsafely at deployment (Mallen et al., 2024).

New questions where we are not aware of prior work:

    We would like datasets of problems that are both realistic stand-ins for future alignment problems, e.g. alignment problems that might arise in automated R&D, and show flat or inverse scaling (McKenzie et al., 2024).Develop a reward modelling benchmark for research advice and empirically validate it. Show that this benchmark is uncorrelated from the broader progress driven by verifiable RL and compute scaling.

Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025). 

Benchmarking Concentrated AI Security Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:

Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign. 

There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment. 

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently  being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Capability evaluations of latent reasoning

Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures:

View The Alignment Project website to learn more or apply for funding here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 AI安全 可扩展监督 控制协议 红队演练
相关文章