少点错误 08月01日 18:43
Research Areas in Cognitive Science (The Alignment Project by UK AISI)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

The Alignment Project是一个全球性基金,投入超过1500万英镑,旨在加速AI控制和对齐研究的进展。该项目汇集了国际政府、产业、风险投资和慈善基金的支持。文章重点介绍了该项目将资助的研究领域,特别是通过人类监督来解决AI模型在训练过程中可能引入的系统性错误。项目关注如何设计更优的评估流程,利用认知科学、群体智慧、集体审议和AI辅助等方式,以确保AI模型在能力超越人类判断时,仍能产生符合预期的良好结果,并探讨了运行可扩展监督协议的实验设计以及人类审计潜在“蓄意”模型的流程开发与评估。

🎯 **系统性人类错误是AI对齐的关键挑战**:现代AI模型(如LLMs)的训练高度依赖人类的偏好判断。然而,人类可能因先验知识、注意力不集中、认知偏差、推理缺陷等因素,在训练信号中引入系统性错误。因此,设计能最小化这些人类缺陷的评估流程至关重要,以确保AI模型的安全部署。

💡 **可扩展监督旨在解决人类判断局限性**:当AI模型的能力超越人类监督者时,如何确保其行为符合预期成为“可扩展监督”的核心问题。该方法旨在将复杂问题分解为更易于人类判断的子问题,例如利用AI助手协助评分,或通过模型间的辩论来辅助人类判断。

🤝 **利用群体智慧与集体审议提升对齐效果**:文章探讨了多种提升AI对齐的方法,包括:根据人类评分者的置信度加权训练信号;汇集多位评分者的(置信度加权)响应以利用“群体智慧”;以及通过人类评分者之间的讨论来改善对齐效果,这些方法均基于认知科学和行为经济学原理。

🔍 **人类审计与AI辅助在AI控制中的作用**:在AI控制领域,人类审计对于监控和确认AI行为至关重要,尤其是在AI可能自主执行任务并潜在逃脱控制的情况下。文章强调了开发高效且低成本的人类审计流程的重要性,并研究了AI辅助审计可能带来的增益与风险,以及如何衡量和缓解数据不平衡导致的认知偏差。

Published on August 1, 2025 10:26 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Cognitive science

Systematic Human Error

Problem summary: Modern AI models (LLMs and associated agents) depend critically on human supervision in the training loop: AI models do tasks, and human judges express their preferences about their outputs or performance. These judgments are used as training signals to improve the model. However, there are many reasons why humans could introduce systematic error into a training signal: they have idiosyncratic priors, limited attention, systematic cognitive biases, reasoning flaws, moral blind spots, etc. How can we design evaluation pipelines that mitigate these flaws to minimise systematic error, so that the models are safe for deployment? 

For the sub-problems below, we specify the problem as follows: 

The core question then becomes: given the potential limitations of the human judges, how can we align the model to produce outputs that are ultimately good rather than bad? We think of this as a mechanism design problem: we want to find ways to harness human metacognition, the wisdom of the crowds, collective deliberation or debate, and the use of AI models as thought partners to improve the alignment problem. 

We have both scientific goals and engineering goals. The engineering goals are focussed on post-training pipeline design and are geared to find ways that improve alignment. The scientific questions are about the nature of the cognitive or social processes that make weak-to-strong alignment either more or less challenging. We think that this is an interesting problem in its own right (for example, you can see representative democracy as a special case of weak-to-strong alignment). Some of the scientific goals build on work already conducted in other domains. 

Why this matters: In some cases, the model being trained may be more capable than the humans providing training signals. For example, the model may be writing code for human judges, who cannot tell whether or how the code will execute just from reading it.  The problem of correctly rewarding desired behaviours for AI systems when those behaviours are beyond humans’ ability to efficiently judge is called “scalable oversight”.  The idea behind scalable oversight methods is to reduce harder problems to easier ones where human judgment can be relied upon. For example, judges might be assisted by another AI model (a rater assistant), or judge the outcome of a debate between different strong models (safety by debate). 

Examples of subproblems: 

The problems summarised below are inspired by the initial thinking from Irving and Askell, 2019 (this paper focuses on debate, but the ideas are relevant to other methods). 

    Weighting responses by their confidence provides higher-precision aggregate estimates. Can we improve alignment by weighting training signals by the confidence expressed by human raters? We could do this either within rater (downweigh specific choices) or between raters (downweigh less calibrated individuals).Can we improve alignment using the ‘wisdom of the crowds’, by pooling the (confidence-weighted) responses of multiple human raters? For example, adopting similar techniques to those used in Bahrami et al., 2010.If humans are provided with the opportunity to discuss their preferences before assigning a rating, does this improve alignment? Related to Bergman et al., 2024. Note that discussion can confer outsize effects on collective accuracy (Navajas et al., 2018).Does giving raters more time to reflect help or hurt the alignment process, considering the relative quantity of training data produced? Can people learn to discern how much time they should allocate to a choice, given their sensitivity (optimally, both very hard and very easy choices should be made quickly; Masis et al., 2023). Do humans do this when judging AI outputs?When judging messages from experts, humans can judge their trustworthiness based on surface features (such as the tone, confidence, or crispness with which they are expressed), even if they are not experts themselves. Do similar signals exist for AI models in expert domains, possibly using interpretability tools?If human users are given the option to query the strong model, does this improve their ability to discern whether its answers are accurate or not? What queries work best? Do people tend to ask questions that maximise information gain relevant for the problem? See Rothe et al., 2018 for work that has studied the quality of human questions.Where humans have a (potentially unlimited) number of weak AI systems that could be used to help, how can they best be used to assist with the rating process. For example, do humans tend to ignore the assistance (overconfidence) or follow it slavishly (overreliance), relative to optimal (conditional on how accurate the model is)? See Bridgers et al., 2024 for more details concerning rater assist and hybridization methods.Let’s imagine that we are able to create two copies of the strong model, and encourage one to point out the flaws in the argument of the other. Would humans be able to discern which model was correct? See Irving et al., 2018 and Michael et al., 2023.  

Suggestions to learn more: We’d recommend reading Bridgers et al., 2024 and Irving and Askell (2019).  

Running Scalable Oversight Protocols With Humans 

Problem summary: As scalable oversight protocols are developed, protocols should be empirically tested – and where LLMs are not yet capable enough to carry out the protocols, human participants can be used instead.  

Why this matters: Scalable oversight is the problem of correctly rewarding desired behaviours for AI systems, even when those behaviours are beyond humans’ ability to efficiently judge. A scalable oversight method is therefore needed to supervise for honesty, helpfulness and harmlessness in models with capabilities exceeding human judges. 

Examples of subproblems: 

Established subproblems: 

    Investigate the experimental design for studies of this type and conduct small-scale trial experiments to improve future work in the area. As an example, previous OpenAI human studies surfaced flaws in debate protocols (Barnes, 2020), where malicious debaters could win using ambiguous language or obfuscated arguments with hard-to-find flaws.Barnes and Christiano, 2020 found a failure mode where debaters could win through ambiguous language, and proposed an improved "cross-examination" scheme to mitigate this failure. However, more recent human debate experiments such as Michael et al., 2023 haven't used cross-examination. Can we replicate the benefit of cross-examination in human experiments? This requires careful design, as cross-examination uses snapshots of the mental state of a debater which cannot be fully replicated with humans.We have recent theoretical work on when the obfuscated arguments problem from (1) can be avoided, introducing a new “prover-estimator debate” scalable oversight protocol (Brown-Cohen et al., 2025). However, it has not been empirically tested. Implement prover-estimator debate with human participants and measure its effectiveness.  

New questions where we are not aware of prior work: 

    Scalable oversight protocols depend on stability (that small changes in inputs shouldn't produce drastically different outputs). We may not be able to prove the stability of our algorithms. Conduct empirical experiments with human subjects to examine the effects of small perturbations on scalable oversight protocols in practice.

Suggestions to learn more: We’d recommend reading Irving and Askell (2019) . Previous research in this area includes Parrish et al., 2022, Michael et al., 2023, and Rahman et al., 2025)

‍Developing and Evaluating Processes for Human Auditing of Potentially Scheming Models

In the future, we might rely heavily on AIs performing autonomous work such as automated alignment research. At the capability level sufficient to do such research, these systems may already be capable of causing catastrophic harm or meaningfully escaping human control, and may choose to do so if they’re misaligned. The field of AI Control aims to reduce this risk through techniques like monitoring by other AI models and restricting the affordances of untrusted AI systems (Shlegeris and Greenblatt 2024).

The early stages of an AI control protocol will usually involve automated filters and modification via machine learning classifiers or other tools. However, we expect that any costly action taken as a result of these automated filters will eventually involve human input for confirmation. Therefore, it will be critical to understand how control protocols involving humans operate, both by applying prior knowledge from cognitive science about human behaviour and cognitive biases, and via end-to-end tests of control protocols involving humans. 

Why this matters: Control measures provide a practical safety approach to mitigating loss of control risks that doesn’t require fundamental research breakthroughs. Control methods can be evaluated and deployed using current capabilities, making them particularly useful for responsible AI development in the near term. For an overview of how control measures can scale to increasingly powerful systems see Korbak et al., 2025a).

Developing and evaluating processes for human auditing (with AI uplift). AI control benefits from having the ability to audit suspicious activity with human labor. It would be very useful to anticipate potential issues with our ability to use humans (who may be aided by AI systems) for auditing. An additional challenge worth studying, in the context of automated AI R&D, is that auditors may often be unfamiliar with the code they’re reviewing. 

Some valuable research directions here include:

View The Alignment Project website to learn more or apply for funding here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 可扩展监督 人类审计 AI安全 认知科学
相关文章