Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Cognitive science

Systematic Human Error

Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However, there are many reasons why humans could introduce systematic error into a training signal: they have idiosyncratic priors, limited attention, systematic cognitive biases, reasoning flaws, moral blind spots, etc. How can we design evaluation pipelines that mitigate these flaws to minimise systematic error, so that the models are safe for deployment?

For the sub-problems below, we specify the problem as follows:

There is a to-be-aligned AI model that produces outputs (text, code, or actions).We assume that these outputs have some in-theory knowable ground truth quality, even if this is not immediately discernible (e.g. their consequences may be unambiguously good or bad).There are one or more human raters, who express preferences about these outputs, with or without confidences.The model is assumed to have more expertise or domain specific knowledge than the humans on the task being performed.These raters may be assisted by other AI agents, who are also less expert than the model being aligned. They may advise individually or collectively.The judgments of the human raters are used to align the model in some way.

The core question then becomes: given the potential limitations of the human judges, how can we align the model to produce outputs that are ultimately good rather than bad? We think of this as a mechanism design problem: we want to find ways to harness human metacognition, the wisdom of the crowds, collective deliberation or debate, and the use of AI models as thought partners to improve the alignment problem.

We have both scientific goals and engineering goals. The engineering goals are focussed on post-training pipeline design and are geared to find ways that improve alignment. The scientific questions are about the nature of the cognitive or social processes that make weak-to-strong alignment either more or less challenging. We think that this is an interesting problem in its own right (for example, you can see representative democracy as a special case of weak-to-strong alignment). Some of the scientific goals build on work already conducted in other domains.

Why this matters: In some cases, the model being trained may be more capable than the humans providing training signals. For example, the model may be writing code for human judges, who cannot tell whether or how the code will execute just from reading it. The problem of correctly rewarding desired behaviours for AI systems when those behaviours are beyond humans’ ability to efficiently judge is called “scalable oversight”. The idea behind scalable oversight methods is to reduce harder problems to easier ones where human judgment can be relied upon. For example, judges might be assisted by another AI model (a rater assistant), or judge the outcome of a debate between different strong models (safety by debate).

Examples of subproblems:

The problems summarised below are inspired by the initial thinking from Irving and Askell, 2019 (this paper focuses on debate, but the ideas are relevant to other methods).

Bridgers et al., 2024

Irving et al., 2018

Michael et al., 2023

Suggestions to learn more: We’d recommend reading Bridgers et al., 2024 and Irving and Askell (2019). 

Running Scalable Oversight Protocols With Humans

Problem summary: As scalable oversight protocols are developed, protocols should be empirically tested – and where LLMs are not yet capable enough to carry out the protocols, human participants can be used instead. 

Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A scalable oversight method is therefore needed to supervise for honesty, helpfulness and harmlessness in models with capabilities exceeding human judges.

Examples of subproblems:

Established subproblems:

Barnes, 2020

Barnes and Christiano, 2020

Michael et al., 2023

Brown-Cohen et al., 2025

New questions where we are not aware of prior work:

Scalable oversight protocols depend on stability (that small changes in inputs shouldn't produce drastically different outputs). We may not be able to prove the stability of our algorithms. Conduct empirical experiments with human subjects to examine the effects of small perturbations on scalable oversight protocols in practice.

Suggestions to learn more: We’d recommend reading Irving and Askell (2019) . Previous research in this area includes Parrish et al., 2022, Michael et al., 2023, and Rahman et al., 2025).

‍Developing and Evaluating Processes for Human Auditing of Potentially Scheming Models

In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques like monitoring by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt 2024).

The early stages of an AI control protocol will usually involve automated filters and modification via machine learning classifiers or other tools. However, we expect that any costly action taken as a result of these automated filters will eventually involve human input for confirmation. Therefore, it will be critical to understand how control protocols involving humans operate, both by applying prior knowledge from cognitive science about human behaviour and cognitive biases, and via end-to-end tests of control protocols involving humans.

Why this matters: Control measures provide a practical safety approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using current capabilities, making them particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems see Korbak et al., 2025a).

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing.

Some valuable research directions here include:

Kutasov et al., 2025

Developing auxiliary methods to aid with few-shot catastrophe prevention (Greenblatt and Shlegeris, 2024), such as proliferation and interrogation techniques (focusing on methods that particularly benefit from execution by or with the help of competent humans).

View The Alignment Project website to learn more or apply for funding here.

Discuss

Cognitive science

Systematic Human Error

Running Scalable Oversight Protocols With Humans

‍Developing and Evaluating Processes for Human Auditing of Potentially Scheming Models

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签