Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Benchmark Design and Evaluation

Benchmarks translate alignment's abstract conceptual challenges into measurable tasks that researchers can systematically study and use to evaluate solutions. For alignment, this means creating tests that capture the difficulties of ensuring AI systems behave as intended—from evaluating behaviours we can't efficiently oversee, to testing against models that might actively try to deceive us. This section develops evaluation infrastructure in two areas:

Scalable oversight: assessing AI performance on tasks where human evaluation is limited by expertise, cost, or time.Control: testing whether safety measures work when models might actively try to circumvent them.

Reward Model Benchmarking for Scalable Oversight

Problem summary: As highly capable AI systems are deployed on high-stakes tasks humans will be unable to evaluate them, and new methods and evaluations are required to surface emergent safety failures. We are particularly concerned with alignment of AI systems which are used by humans to execute AI research tasks. We need benchmarks which quantify performance on the broad scope of tasks models will be used for, including tasks without uniquely verifiable solutions, like research advice. We also need benchmarks and datasets quantifying known issues.

Why this matters: Scalable oversight is the problem of correctly rewarding intended behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A robust scalable oversight method is needed to safely train models with highly advanced capabilities, and debate is among the most promising candidates. More generally, we believe AI systems themselves will likely be used heavily in AI research, but that this may not happen safely by default. For iterating on scalable oversight methods, benchmarks and datasets make it easier to identify issues with existing methods, iterate and improve.

Examples of subproblems:

Established subproblems:

Barnes 2020

partial observability

Michael et al., 2023

Järviniemi, 2024

Mallen et al., 2024

New questions where we are not aware of prior work:

McKenzie et al., 2024

Suggestions to learn more: Bowman 2022 provides background on benchmark design for scalable oversight. Previous research in this area includes Tan et al., 2024, Lang et al., 2024, and Farquhar et al., 2025).

Benchmarking Concentrated AI Security Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over misaligned AI models which are deliberately aiming to subvert human control. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Why this matters: In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. We would like to develop control protocols—plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations—testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage and the effectiveness of control protocols in preventing it. They contain main tasks, like legitimate software engineering tasks for a coding agent, as well as side tasks that a misaligned model might attempt to conduct, like reducing security restrictions that protect model weights. See ControlArena for example settings. A diverse range of settings meeting the following desiderata would substantially advance control research (Shlegeris, 2025):

The main tasks have to be such that frontier models are reasonably competent at them, but weaker models do much worse, because we want to rule out protocols that just use the trusted model.The side tasks should not require the models to make use of affordances that aren’t needed for the main task, since this allows for the blue team to flag any use of unnecessary affordances.It should be possible to inform both the red and blue teams as to what the possible side tasks may be.

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows studying of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt, 2025 for more context. It’s important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al., 2023 and Bhatt et al., 2025. Valuable projects may include:

Shlegeris, 2024

Understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

Fine-tuning may instill a propensity for the newly fine-tuned model to produce artifacts containing information that flows back to the untrusted model that initiated the CMFT, for example under the guise of performing a hypothesis-driven experiment. The main risk here is that the fine-tuned model does a large amount of malign intellectual labor, producing insights that increase the chance of a critical AI safety failure.Fine-tuning may instill a secret loyalty, which could be used in a coordinated attempt by the AI model to seize resources and escape human control.

We would like to advance our understanding of this threat model, and assess possible countermeasures.Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts—dataset, logs and model behaviour on normal prompts—look benign.

There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

Auditing the dataset or RL environment.

Davies et al., 2025

Auditing the fine-tuned model.

Betley et al., 2025

Mallen and Hebbar, 2025

during

Asking the model that initiated the fine-tuning for detailed justifications.

Design patterns for limiting harmful information flow.

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There are a diverse range of research areas, for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Key subproblems:

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (Shlegeris, 2024) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible however that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Capability evaluations of latent reasoning.

Korbak et al., 2025a)

Studying chain of thought (CoT). Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures:

Evaluating monitorability

Measures of causal relevance of CoTs

Lanham et al., 2023

Whitebox chain-of-thought reliability

Lindsey et al., 2025

Steganographic capability

Korbak et al., 2025a)

CoT obfuscation capability evaluations: Determining whether models can obfuscate otherwise suspicious reasoning within their CoTs on purpose. These evaluations would help establish the extent to which models are capable of hiding malicious reasoning even if their CoT was not optimised against a monitor during training.

View The Alignment Project website to learn more or apply for funding here.

Discuss

Benchmark Design and Evaluation

Reward Model Benchmarking for Scalable Oversight

Benchmarking Concentrated AI Security Failures

Fundamental Science of AI Monitoring and Red Teaming

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签