Published on August 1, 2025 10:27 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Research Areas in AI Control

Problem statement: Current alignment methods can’t guarantee that an AI’s goals and actions match human intentions (Carlsmith, 2023). In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques such as monitoring of untrusted AI systems by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt, 2024).

The aim is to develop control protocols – plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations – testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Why this matters: Control measures provide a practical approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using models with current capability levels, making these methods particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems, see Korbak et al., 2025a).

Section overview

We organize this part of the research agenda into seven subsections. First three sub-sections focus on empirical investigations targeting key identified threat models:

High-stakes/concentrated AI safety failures

Research sabotage:

Covert malicious fine-tuning:

The following four sub-sections focus oncross-cutting research across multiple threat models:

Modelling of risks from potentially misaligned AI systems:

Uplift of trusted models:

Chain-of-thought monitoring:

Fundamental science of AI monitoring and red teaming:

We believe that many of these research areas require the efforts of entire research teams (3-10 people). We end the section with a call for new research groups focused on addressing different parts of the research agenda.

High-Stakes/Concentrated AI Safety Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over scheming AI models. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example an environment containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage (or proxies for sabotage) and the effectiveness of control protocols in preventing it. They contain main tasks (e.g., legitimate software engineering tasks for a coding agent ), as well as side tasks that a misaligned model might attempt to conduct (e.g., reducing security restrictions that protect model weights). See ControlArena for example settings. A diverse range of settings for the study of control techniques would be extremely helpful for advancing control research. Possible further settings may include various software engineering settings, settings where models have access to a shell, or settings where models run ML experiments. Building settings is an active area of research because it is difficult to build settings that fulfill at least the following desiderata (Shlegeris, 2025):

The main tasks have to be such that frontier models are reasonably competent at them, but weaker models do much worse (because we want to rule out protocols that just use the trusted model).The side tasks shouldn’t require the models to make use of affordances that aren’t needed for the main task, since this allows for the blue team to flag any use of unnecessary affordances.It should be possible to inform both the red and blue teams as to what the possible side tasks may be.

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows the study of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt 2025 for more context. It is important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al. 2023 and Bhatt et al. 2025. Valuable projects may include:

Shlegeris, 2024

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing.

Some valuable research directions here include:

Kutasov et al., 2025

Shlegeris, 2024

(Greenblatt and Shlegeris, 2024)

Developing design patterns for limiting excessive agency. One way to reduce risks from schemers is to limit the potential paths through which they can cause harm. Adopting a principle of least privilege may be appropriate, where potentially scheming models have exactly the affordances they need in order to do their job. In the context of hierarchical agent structures, instances of the model higher up in the hierarchy could be set up to have more context but fewer direct affordances; creating strategies and making decisions, but then delegating sub-tasks. In contrast, instances lower in the hierarchy could have less context, but they may be able to write and execute code. In general, the situation would (all else equal) be safer the more we can encapsulate the key affordances of AIs in safeguarded APIs. In order to be useful, a project in this area would probably have to run empirical control evaluations which test structural safety measures such as:

Shlegeris, 2024

Beurer-Kellner et al., 2025

Chennabasappa et al., 2025

Halawi et al., 2024

Covert malicious fine-tuning

Chan et al., 2025

Research Sabotage

Problem summary: In the future, potentially scheming AIs will likely be used for large amounts of intellectual work, in particular AI R&D. A worry here is that they might systematically sabotage safety-relevant research, preventing us from controlling increasingly capable AI systems.

Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions (such as the ones described in the High-stakes/concentrated AI safety failures section) may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence.

Key subproblems:

Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models. We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.

Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.

Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable such as dangerous capability evaluations, AI safety research, or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.

Covert Malicious Fine-Tuning

Problem summary: We are interested in better understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

Fine-tuning may instill a propensity for the newly fine-tuned model to produce artifacts containing information that flows back to the untrusted model that initiated the CMFT, for example, under the guise of performing a hypothesis-driven experiment. The main risk here is that the fine-tuned model does a large amount of malign intellectual labor, producing insights that increase the chance of a critical AI safety failure.Fine-tuning may instill a secret loyalty, which could be used in a coordinated attempt by the AI model to seize resources and escape human control.

We would like to advance our understanding of this threat model, and assess possible countermeasures.

Why this matters: Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts (dataset, logs and model behaviour on normal prompts) look benign.

Key subproblems:

Studying foundational questions. We’d like to get more clarity on the various possible threat models:

Betley et al., 2025

Cloud et al. 2025

Studying countermeasures. There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

Auditing the dataset or RL environment.

Davies et al., 2025

Auditing the fine-tuned model.

Betley et al., 2025

Mallen and Hebbar, 2025

during

Asking the model that initiated the fine-tuning for detailed justifications.

Design patterns for limiting harmful information flow.

Modelling of Risks From Deployment of Potentially Misaligned AI Systems

Problem summary: Identification of the broad range of possible paths through which misaligned AI systems might cause catastrophic harm.

Why this matters: To have high confidence that a system of control protocols is sufficient to prevent catastrophic outcomes, it is essential to identify and evaluate the high-level pathways to harm (e.g., code sabotage, or human influencing lab employees).

We emphasise that the scale of work required here is substantial – it involves attempting to identify all the ways in which intelligent systems could sabotage complex safety systems.

Suggestions to learn more: For a literature review of possible risk analysis techniques, see Koessler and Schuett 2023. See the high level risk modelling approaches proposed by model developers (Anthropic, 2025, Google Deepmind, 2025, OpenAI, 2025), and examples of safety cases (Korbak et al., 2025b) and Systematic Hazard Analysis (Mylius, 2025) as two structured approaches.

Uplift of Trusted Models

Problem summary: These investigations explore ways to increase the monitoring capabilities of trusted AI models, that are otherwise less capable than their untrusted counterparts. Promising research directions here include the creation of reliable evaluations of model trustedness,as well as developing various ways of narrowly increasing model capabilities relevant to AI monitoring without increasing model scheming capabilities or propensity towards scheming.

Why this matters: Current proposed control protocols often benefit substantially from access to capable trusted models for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).

Key subproblems:

Developing trustedness evaluations. As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to "fake" alignment during training and evaluation and then behave unsafely at deployment (see Mallen et al., 2024 for previous work).

Improving the usefulness of trusted monitors. For instance:

Shlegeris, 2024

Pacchiardi et al., 2023

Building capable models with narrow capability profiles. For instance:

unlearning

Chain-of-Thought Monitoring

Problem summary: Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures.

Why this matters: CoT monitorability matters because it's an additional safety layer that's already working today—catching models planning misbehaviour in their reasoning traces. No fundamental breakthroughs are needed. Without preserving this capability, we are flying blind since we have no other reliable window into models' intentions or planning processes as they become more capable (Korbak et al., 2025c).

Key subproblems:

Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:

Capability evaluations of latent reasoning

Korbak et al., 2025a)

Measures of causal relevance of CoTs

Lanham et al., 2023

Whitebox chain-of-thought reliability

Lindsey et al., 2025

Steganographic capability

Korbak et al., 2025a)

CoT obfuscation capability evaluations

Preserving monitorability. Monitorability of chains of thought should not be taken for granted. More research is needed to understand what degrades monitorability and which techniques could help preserve it. Example projects include:

Studying the effect of optimization pressure.

Decoding opaque reasoning.

Hao et al., 2024

Lample et al., 2017

Goldwasser et al., 2022

Artetxe et al., 2017

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There is a diverse range of research areas for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Key subproblems:

Catching natural examples of scheming and studying them. Finding completely natural examples of scheming in deployment would likely be most informative. Alternatively, it might be useful to develop rigged training processes that make scheming more likely, with approaches similar to those used to research model organisms of misalignment (see Hubinger et al., 2024, Turner et al., 2025). It might also be helpful to rig the training process in a way that produces a schemer that is easier to catch, e.g. by making scheming show up at lower capability levels or by training the schemer on synthetic facts in order to weaken its situational awareness. However, it should be noted that this risks making the schemer too disanalogous from the real threat. Some methods for making scheming more likely include:

Changing the training incentives

Altering capabilities or situational awareness.

Trying to get the AI to latch onto a scheming goal

Developing methods for few-shot catastrophe prevention. Related to the project above, but plausibly possible to study without natural examples, we are interested in techniques for improving safety in the case where a model has been caught misbehaving a small number of times:

When should you just train the monitor(s) vs train the policy (especially in cases where catching the model produces some value due to increased evidence for misalignment)?How effective is training against misalignment for removing in somewhat realistic cases where the model could adapt its policy to not get caught in this sort of case?How well does it work to train against proliferated examples of a case where we caught an AI? (Proliferation is where you try to capitalise on one example of AIs attempting to do something bad to find a diverse set of examples of AIs doing something bad.) This might be particularly effective for catching future collusion attempts, because we could train an untrusted monitor on proliferated collusion signals.

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (see Shlegeris, 2024 for detailed analysis.) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible, however, that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment.

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Developing techniques for white-box control. We are interested in an increased understanding of the limits of white-box techniques for control. These could include linear probes for deception (see Goldowsky-Dill, 2025), or probes trained directly on off-policy malicious trajectories—there are many techniques in the literature that have been proposed. A key difficulty here is the development of realistic model organisms that can serve as test beds for white-box methods. For more details on research questions related to whitebox methods see also the “Interpretability” section.

Understanding the limits of synthetic input generation. It would be very useful if we had the ability to generate synthetic inputs that powerful LLMs can’t distinguish from “real” inputs. For instance, we may want to prevent a model from being able to confidently infer from its prompt that it is in an evaluation (see Apollo, 2025). We might also want to teach models certain synthetic facts, but there are open questions about how well this will work for powerful models. See Stastny and Greenblatt 2025 and Greenblatt 2025 for more details on this area of research.

Studying abstract analogies for scheming. In order to empirically study risks from schemers, we can try to study methods to train away deeply ingrained behaviours (which don’t necessarily have to look like misbehaviour) in LLMs that are structurally similar to scheming. Two proposed example projects in this area can be found in Stastny 2025.

Analyzing human insider threat parallels to AI scheming risk models. There exists extensive research on insider threat detection in domains such as cyber-security and national security (e.g. Capelli et al., 2012; Homoliak et al., 2019). This project involves conducting a literature review, to identify learnings that can be carried over to mitigating threats from scheming AI systems. It may also involve reviewing broader literature on sabotage of organisations such as the CIA Simple Sabotage Field Menu (CIA, 1944). The outputs of this project could be both high level security principles e.g. approaches to layered defense or proposals for implementation of specific techniques e.g. honeypots and tripwires.

Call for New Research Groups

In the previous section, we described project proposals of differing scopes illustrating the different axes along which the field of AI control can be advanced. We believe that many of these research areas require the efforts of entire research teams (3-10 people), and should possibly be undertaken by new research groups. In particular, we want to encourage the formation of new research groups that undertake ambitious projects including:

covert malicious fine-tuning

research sabotage

fundamental science

uplift trusted models

chain-of-thought monitoring

human-driven auditing

View The Alignment Project website to learn more or apply for funding here.

Discuss

Research Areas in AI Control

Section overview

High-Stakes/Concentrated AI Safety Failures

Research Sabotage

Key subproblems:

Covert Malicious Fine-Tuning

Modelling of Risks From Deployment of Potentially Misaligned AI Systems

Uplift of Trusted Models

Chain-of-Thought Monitoring

Fundamental Science of AI Monitoring and Red Teaming

Call for New Research Groups

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签