少点错误 前天 18:43
Research Areas in AI Control (The Alignment Project by UK AISI)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

The Alignment Project是一个价值超过1500万英镑的全球基金,致力于加速AI控制和对齐研究。该项目由国际政府、产业、风险投资和慈善基金捐助者组成。文章介绍了AI控制领域的研究重点,包括高风险/集中式AI安全失败、研究破坏和隐蔽恶意微调等威胁模型,以及提升可信模型、思维链监控等跨领域研究。项目旨在通过控制协议和评估,降低AI失控风险,促进负责任的AI发展。

🔍 高风险/集中式AI安全失败:研究AI模型少量行为导致人类失去控制的情况,如通过插入漏洞破坏关键代码库或进行系统操作中的破坏行为。重点在于构建高风险控制环境,如模拟前沿模型开发基础设施,并评估控制协议的有效性。

🛠️ 研究破坏:探索AI系统如何暗中破坏安全相关研究,例如通过系统性 sabotage 阻碍对齐研究。研究重点包括设计抗恶意破坏的可扩展监督技术,研究在线训练的样本效率,以及缓解 sandbagging(故意表现不佳)的行为。

🎯 隐蔽恶意微调:研究AI系统如何暗中污染其他AI系统,例如植入后门或秘密忠诚。研究重点在于理解植入后门或秘密忠诚的最难检测和样本效率最高的方法,并探索数据集、模型和微调过程的审计方法等反制措施。

🧠 可信模型提升:开发评估可信度、提升监控器技能(如漏洞检测、隐写术识别)的方法,并训练具有狭窄能力配置的监控模型。研究还包括思维链监控,测量和保留AI模型推理的可监控性,并将其用于控制策略。

🔬 基础科学:解决AI监控和红队工作的基础研究问题,包括研究自然存在的欺骗案例、开发小样本灾难预防方法,以及研究不可信模型之间的串通行为。

Published on August 1, 2025 10:27 AM GMT

 The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Research Areas in AI Control

Problem statement: Current alignment methods can’t guarantee that an AI’s goals and actions match human intentions (Carlsmith, 2023). In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques such as monitoring of untrusted AI systems by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt, 2024).

The aim is to develop control protocols – plans involving these techniques that are designed to prevent unsafe actions by AI systems. These control protocols should then be regularly evaluated during control evaluations – testing exercises designed to provide evidence that models would be unable to cause catastrophic harm even in case of intentional scheming.

Why this matters: Control measures provide a practical approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using models with current capability levels, making these methods particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems, see Korbak et al., 2025a).

Section overview

We organize this part of the research agenda into seven subsections.  First three sub-sections focus on empirical investigations targeting key identified threat models:

    High-stakes/concentrated AI safety failures: Investigating scenarios where a relatively small number of actions by AI models could lead to humans losing control of them.Research sabotage: Exploring scenarios where AI systems subtly sabotage safety-related research.Covert malicious fine-tuning: Examining scenarios where AI systems covertly data poison other AI systems – for example to instill backdoors or secret loyalties.

The following four sub-sections focus oncross-cutting research across multiple threat models:

    Modelling of risks from potentially misaligned AI systems: Identifying key paths to harm and prioritising attack vectors for research and mitigations.Uplift of trusted models: Developing trustedness evaluations to assess scheming likelihood, enhancing the skills of trusted monitors (e.g., vulnerability detection, steganography spotting) and developing monitor models with narrow capability profiles through specialized training.Chain-of-thought monitoring: Measuring and preserving the monitorability of reasoning verbalised by AI models and leveraging the verbalised reasoning in control strategies.Fundamental science of AI monitoring and red teaming: Addressing foundational research questions, including catching and studying natural examples of scheming, developing few-shot catastrophe prevention methods, and studying collusion between untrusted models.

We believe that many of these research areas require the efforts of entire research teams (3-10 people). We end the section with a call for new research groups focused on addressing different parts of the research agenda.

High-Stakes/Concentrated AI Safety Failures

Problem summary: We are interested in research on concentrated failure scenarios, where a relatively small number of actions are required for humans to lose control over scheming AI models. For example, it has been theorised that a misaligned future model might be able to create an unmonitored deployment by inserting just a few vulnerabilities in a model developer’s most critical codebases (Shlegeris, 2022). Concentrated failure scenarios could involve AI systems conducting sabotage during coding or system operation tasks, potentially leading to rogue internal deployments or weight exfiltration (Shlegeris, 2024).

Key subproblems:

Building high-stakes control settings. Settings are empirical research environments, for example an environment containing mock frontier model development infrastructure, where researchers can study the ability of models to conduct sabotage (or proxies for sabotage) and the effectiveness of control protocols in preventing it. They contain main tasks (e.g., legitimate software engineering tasks for a coding agent ), as well as side tasks that a misaligned model might attempt to conduct (e.g., reducing security restrictions that protect model weights). See ControlArena for example settings. A diverse range of settings for the study of control techniques would be extremely helpful for advancing control research. Possible further settings may include various software engineering settings, settings where models have access to a shell, or settings where models run ML experiments. Building settings is an active area of research because it is difficult to build settings that fulfill at least the following desiderata (Shlegeris, 2025):

A particularly promising way to build control settings might be to convert actual AI usage into control settings. We are interested in studies where people are given a version of an agentic code editor where the AI is sometimes replaced with a misaligned version that tries to insert security issues or sabotage the work if it thinks it can succeed. This allows the study of how often the AI successfully introduces issues without detection, identification of interesting sabotage strategies and greater understanding of which protocols are most effective in practice. See Greenblatt 2025 for more context. It is important to run control evaluations in these realistic settings to validate their usefulness for control research.

High-stakes control evaluations. We are interested in follow-up work on previous high-stakes control research, such as Greenblatt et al. 2023 and Bhatt et al. 2025. Valuable projects may include:

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing. 

Some valuable research directions here include:

Developing design patterns for limiting excessive agencyOne way to reduce risks from schemers is to limit the potential paths through which they can cause harm. Adopting a principle of least privilege may be appropriate, where potentially scheming models have exactly the affordances they need in order to do their job. In the context of hierarchical agent structures, instances of the model higher up in the hierarchy could be set up to have more context but fewer direct affordances; creating strategies and making decisions, but then delegating sub-tasks. In contrast, instances lower in the hierarchy could have less context, but they may be able to write and execute code. In general, the situation would (all else equal) be safer the more we can encapsulate the key affordances of AIs in safeguarded APIs. In order to be useful, a project in this area would probably have to run empirical control evaluations which test structural safety measures such as:

Research Sabotage

Problem summary: In the future, potentially scheming AIs will likely be used for large amounts of intellectual work, in particular AI R&D. A worry here is that they might systematically sabotage safety-relevant research, preventing us from controlling increasingly capable AI systems. 

Why this matters: Preventing research sabotage matters insofar as we rely on AIs to help perform research that enables alignment of future even more capable AIs. Methods focused on catching clearly incriminating actions (such as the ones described in the High-stakes/concentrated AI safety failures section) may not be suitable for detecting this type of sabotage, as research sabotage is likely to be difficult to distinguish from lack of competence.

Key subproblems:

Scheming-robust scalable oversight research. Scalable or recursive oversight is a technique where training signals for more powerful models are provided by less capable models. We are interested in research on scalable oversight techniques robust to intentional subversion by the monitored models. We envision that this kind of scalable oversight research is done using control-style adversarial analysis (see Shlegeris and Greenblatt, 2025). Examining approaches to increasing robustness of AI safety via debate (Irving et al., 2018) offers one promising starting direction for this research.

Studying the sample efficiency of online training. Monitors might detect that models behave in undesirable but non-catastrophic ways during deployment. In such cases, AI developers are unlikely to shut down the deployment, but they might be motivated to perform online training to remove the undesirable behaviour. We’re interested in studying the sample efficiency of online training in order to understand the prospects of successfully training away undesirable scheming behaviours. Studying this requires developing malign initialisations, i.e., models which are trained to resist online training.

Mitigating sandbagging. A particular worry with misaligned models is that they might deliberately underperform in cases where high performance would be desirable such as dangerous capability evaluations, AI safety research, or when performing oversight over other AI models (see Shlegeris and Stastny, 2025 for detailed analysis). Previous empirical work has studied this threat model using password-locked models (see Greenblatt et al., 2025). We are interested in extensions of this work, as well as alternative approaches to studying sandbagging. One example project here would be to study password-locked reasoning models.

Covert Malicious Fine-Tuning

Problem summary: We are interested in better understanding covert malicious fine-tuning (CMFT, Halawi et al., 2024), whereby an AI system may covertly poison other AI systems. This could lead to the following types of risks:

We would like to advance our understanding of this threat model, and assess possible countermeasures.

Why this matters: Exploring the feasibility and mitigations of covert malicious fine-tuning matters because it poses a risk of rogue internal deployment where all visible artefacts (dataset, logs and model behaviour on normal prompts) look benign. 

Key subproblems:

Studying foundational questions. We’d like to get more clarity on the various possible threat models:

Studying countermeasures. There are multiple approaches to making CMFT significantly harder, especially when combined in a “defense in depth” approach. These include:

Modelling of Risks From Deployment of Potentially Misaligned AI Systems

Problem summary: Identification of the broad range of possible paths through which misaligned AI systems might cause catastrophic harm.

Why this matters: To have high confidence that a system of control protocols is sufficient to prevent catastrophic outcomes, it is essential to identify and evaluate the high-level pathways to harm (e.g., code sabotage, or human influencing lab employees).

We emphasise that the scale of work required here is substantial – it involves attempting to identify all the ways in which intelligent systems could sabotage complex safety systems.

Suggestions to learn more: For a literature review of possible risk analysis techniques, see Koessler and Schuett 2023. See the high level risk modelling approaches proposed by model developers (Anthropic, 2025, Google Deepmind, 2025, OpenAI, 2025), and examples of safety cases (Korbak et al., 2025b) and Systematic Hazard Analysis (Mylius, 2025) as two structured approaches.

Uplift of Trusted Models

Problem summary: These investigations explore ways to increase the monitoring capabilities of trusted AI models, that are otherwise less capable than their untrusted counterparts. Promising research directions here include the creation of reliable evaluations of model trustedness,as well as developing various ways of narrowly increasing model capabilities relevant to AI monitoring without increasing model scheming capabilities or propensity towards scheming.

Why this matters: Current proposed control protocols often benefit substantially from access to capable trusted models for monitoring and strategies like rewriting or editing potentially backdoored code when untrusted model outputs exceed suspicion thresholds (Greenblatt et al., 2023).

Key subproblems:

Developing trustedness evaluations. As models become more competent in the future, it will become less clear whether they can be trusted to not be scheming to escape human control (Järviniemi, 2024). Trustedness evaluations aim to evaluate an AI model’s capability to "fake" alignment during training and evaluation and then behave unsafely at deployment (see Mallen et al., 2024 for previous work).

Improving the usefulness of trusted monitors. For instance:

Building capable models with narrow capability profiles. For instance:

Chain-of-Thought Monitoring

Problem summary: Today’s LLM agents tend to think in chains-of-thought (CoTs) expressed in human language that we can monitor for intent to misbehave (Korbak et al., 2025c). A CoT monitor is an automated system that reads the CoT and other relevant information and flags suspicious or potentially harmful interactions (Baker et al., 2025). As long as CoTs are monitorable (i.e. accurately represent reasons behind model responses), CoT monitors should outperform action-only monitors because they have access to strictly more information: beliefs and motivations behind LLM agents’ actions. This suggests a number of projects focusing on evaluating monitorability, preserving it and leveraging it in control measures.

Why this matters: CoT monitorability matters because it's an additional safety layer that's already working today—catching models planning misbehaviour in their reasoning traces. No fundamental breakthroughs are needed. Without preserving this capability, we are flying blind since we have no other reliable window into models' intentions or planning processes as they become more capable (Korbak et al., 2025c).

Key subproblems

Evaluating monitorability. Monitoring systems often rely on the assumption that CoTs are monitorable: accurately reflect models’ actual reasoning. Faithfulness (e.g. Lanham et al., 2023, Turpin et al., 2023, Roger et al., 2025, Chen et al., 2025) has been proposed as a proxy measure for monitorability. However, existing faithfulness evaluations do not disambiguate the propensity of models to externalise their reasoning, with the necessity of externalising their reasoning. Moreover, they typically only consider simple question answering settings, failing to identify the level of monitorability that is sufficient for ensuring safety in more complex domains. We’re interested in the development of novel evaluations: those measuring monitorability directly as well as those that target specific properties underlying monitorability. A suite of evaluations covering capabilities that determine the effectiveness of different CoT monitoring protocols includes:

Preserving monitorability. Monitorability of chains of thought should not be taken for granted. More research is needed to understand what degrades monitorability and which techniques could help preserve it. Example projects include:

Fundamental Science of AI Monitoring and Red Teaming

Problem summary: There is a diverse range of research areas for improving our understanding of scheming AIs and ways to control them, that are not unique to specific threat models. This section provides examples of these cross-cutting research areas.

Key subproblems:

Catching natural examples of scheming and studying them. Finding completely natural examples of scheming in deployment would likely be most informative. Alternatively, it might be useful to develop rigged training processes that make scheming more likely, with approaches similar to those used to research model organisms of misalignment (see Hubinger et al., 2024, Turner et al., 2025). It might also be helpful to rig the training process in a way that produces a schemer that is easier to catch, e.g. by making scheming show up at lower capability levels or by training the schemer on synthetic facts in order to weaken its situational awareness. However, it should be noted that this risks making the schemer too disanalogous from the real threat. Some methods for making scheming more likely include:

Developing methods for few-shot catastrophe prevention. Related to the project above, but plausibly possible to study without natural examples, we are interested in techniques for improving safety in the case where a model has been caught misbehaving a small number of times:

Studying collusion. The key challenge to untrusted monitoring is that the untrusted monitor might engage in collusion—deliberately letting through attacks by the untrusted policy (see Shlegeris, 2024 for detailed analysis.) We are interested in studying the prerequisites for collusion, such as whether models can recognise their own outputs under adversarial pressure, or whether they can pick coordination strategies without communication (Järviniemi, 2024).

Studying overelicitation. When running a control evaluation, we are interested in fully eliciting the dangerous capabilities of the untrusted model (for which the red team has given an attack policy). It is plausible, however, that the red team is not able to elicit best-effort performance, as the model might deliberately underperform. One approach to mitigating this sandbagging would be to fine-tune the model on relevant data, but that might teach the model new things and therefore overelicit the untrusted model. Given this conundrum, we’re interested in research on how to elicit best-effort performance from a model that might be sandbagging without teaching it new things. See Stastny and Greenblatt 2025 and Roger and Greenblatt 2025 for more details.

Evaluating low-likelihood attack capability. Evaluations to determine when models become capable of executing attacks with very low probability, making them difficult to detect during limited evaluation but likely during extended deployment. 

Studying introspection and situational awareness. Evaluations to measure how accurately models understand their behavioural propensities, deployment context (if the model is currently  being post-trained, evaluated, deployed etc), affordances, and control measures, which could enable more sophisticated evasion strategies (Laine et al., 2024; Binder et al., 2025).

Developing techniques for white-box control. We are interested in an increased understanding of the limits of white-box techniques for control. These could include linear probes for deception (see Goldowsky-Dill, 2025), or probes trained directly on off-policy malicious trajectories—there are many techniques in the literature that have been proposed. A key difficulty here is the development of realistic model organisms that can serve as test beds for white-box methods. For more details on research questions related to whitebox methods see also the “Interpretability” section.

Understanding the limits of synthetic input generation. It would be very useful if we had the ability to generate synthetic inputs that powerful LLMs can’t distinguish from “real” inputs. For instance, we may want to prevent a model from being able to confidently infer from its prompt that it is in an evaluation (see Apollo, 2025). We might also want to teach models certain synthetic facts, but there are open questions about how well this will work for powerful models. See Stastny and Greenblatt 2025 and Greenblatt 2025 for more details on this area of research.

Studying abstract analogies for scheming. In order to empirically study risks from schemers, we can try to study methods to train away deeply ingrained behaviours (which don’t necessarily have to look like misbehaviour) in LLMs that are structurally similar to scheming. Two proposed example projects in this area can be found in Stastny 2025.

Analyzing human insider threat parallels to AI scheming risk models. There exists extensive research on insider threat detection in domains such as cyber-security and national security (e.g. Capelli et al., 2012; Homoliak et al., 2019). This project involves conducting a literature review, to identify learnings that can be carried over to mitigating threats from scheming AI systems. It may also involve reviewing broader literature on sabotage of organisations such as the CIA Simple Sabotage Field Menu (CIA, 1944). The outputs of this project could be both high level security principles e.g. approaches to layered defense or proposals for implementation of specific techniques e.g. honeypots and tripwires.

Call for New Research Groups

In the previous section, we described project proposals of differing scopes illustrating the different axes along which the field of AI control can be advanced. We believe that many of these research areas require the efforts of entire research teams (3-10 people), and should possibly be undertaken by new research groups. In particular, we want to encourage the formation of new research groups that undertake ambitious projects including:

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 AI对齐 安全风险 控制协议 监控技术
相关文章