少点错误 前天 18:43
Research Areas in Economic Theory and Game Theory (The Alignment Project by UK AISI)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

The Alignment Project是一个全球性基金,致力于加速AI控制和对齐研究的进展。该项目汇聚了国际政府、产业、风险投资和慈善基金的支持。文章重点介绍了该项目资助的几个关键研究领域,包括利用经济学和博弈论理解AI的激励机制和均衡结果,以确保对齐方法能应对AI能力的提升。同时,探讨了信息设计在AI信息披露策略中的应用,以及如何利用可验证信息和说服理论来应对AI的欺骗性行为。此外,还关注了稳健机制设计,以应对AI潜在的未知能力,并讨论了有界理性在预测AI行为中的重要性。最后,文章触及了开源博弈论、共谋与承诺等问题,旨在通过跨学科方法解决AI安全的关键挑战。

💡 **经济理论与博弈论的应用**:文章强调利用理论经济学工具来理解AI的激励机制和战略行为,旨在确保AI对齐方法能够应对AI能力的不断提升。通过分析AI在战略环境中的互动,研究人员希望找到新的对齐技术思路,并提出新的抽象方法来描述不同形式的激励不当如何影响结果,以及对齐方案如何应对不同的均衡概念或建模假设。

✉️ **信息设计在AI对齐中的作用**:鉴于当前前沿模型已展现出奉承和欺骗行为,信息设计被视为一个重要研究方向。文章探讨了AI在战略性行为下会选择披露哪些信息,并从信息经济学角度,如廉价谈话、可验证信息披露、说服和委托等方面寻找有益的见解。研究旨在了解AI的最佳信息披露策略,分析均衡的多样性和稳定性,并设计机制来改善信息披露的质量,这对AI的监督和设计其他模型至关重要。

⚙️ **稳健机制设计应对AI未知能力**:随着AI能力增强,其可能展现出人类无法预期的行为。文章提出将AI对齐视为一个人类主体与AI代理之间的博弈,并探讨如何利用稳健机制设计来应对AI行动空间中的未知行为。研究目标是找到对未来AI能力提升具有鲁棒性的对齐方法,即使无法精确预测这些能力。这包括在降低AI造成危害的概率和维持模型可用性之间进行权衡,以及考虑AI代理的契约可控与不可控行动空间。

🧠 **有界理性对AI行为预测的意义**:文章指出,理解强大AI系统的行为需要考虑其有界理性。不同于假设AI是计算无限制的,结合行为经济学等领域的灵感,可以更准确地预测AI在各种情境下的行为。这对于评估控制和对齐技术的有效性至关重要,尤其是在我们对神经网络和深度学习的运作机制仍有不确定性的情况下。研究还探讨了有界理性在可扩展监督框架下的影响,以及如何定义和衡量‘更强大’或‘更智能’的AI。

🤝 **开源博弈论与AI共谋的挑战**:文章关注AI在共谋和承诺方面的潜在能力可能超越人类。AI之间可能通过共享代码或思维链来传递可信的承诺信号,并且其协作方式可能难以被人类察觉。这挑战了依赖独立AI模型进行监督的对齐方案。研究方向包括如何设计机制来防止AI代理之间的共谋,以及在AI互动中应使用何种合适的博弈论解概念,特别是在AI能够进行可信承诺的情况下。

Published on August 1, 2025 10:25 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Economic Theory and Game Theory

We are excited about using framings and tools from theoretical economics to better describe incentives and equilibrium outcomes relevant to AI alignment. By considering AI agents in a strategic setting, we hope to ensure alignment approaches are robust to capabilities improvements and that these framings can inspire new ideas for alignment techniques. We expect useful insights to come from proposing new or different abstractions: how different forms of misaligned incentives impact outcomes or how robust proposals are to different equilibrium concepts or modelling assumptions.

Hammond et al., 2025 provide a taxonomy of risks from advanced AI in multi-agent settings, including a broader list of problems and further context on some of the specific problems we highlight below.

Information Design

Problem summary: Current frontier models already exhibit sycophantic and deceptive behaviour (OpenAI, 2025; Park et al., 2023). This is arguably evidence of strategic behaviour. As AI models become more strategic, how should we expect models to choose what information to reveal? We anticipate useful insights from areas within information design (Bergemann and Morris, 2019) such as cheap talk (Crawford and Sobel, 1982), verifiable information disclosure (Milgrom, 1981), persuasion (Kamenica and Gentzkow, 2011) or delegation (Holmstrom, 1984). What are optimal revelation strategies from the perspective of the AI, how varied and stable are equilibria, and can we design mechanisms to improve information disclosure?

Why this matters: Information revelation matters for several aspects of alignment. Some alignment proposals focus on building protocols to incentivise honesty (e.g. Irving et al., 2018). If we trust an AI to be honest or incapable of deceiving us, we may be able to reliably use it to supervise or design other models (e.g. Buhl et al., 2025). In addition, information design may be able to help determine when we can use misaligned AI even if it is trying to manipulate us (see Bhatt et al., 2025 for an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned).

Examples of subproblems:

    How does optimal information revelation (given AI(s) who might have a different objective) interact with alignment by debate protocols (Irving et al., 2018)? Under what assumptions do these protocols fail?Unlike in most economic literature, the sender’s (AI’s) utility function can be modified in response to the AI’s previous actions (e.g. through reinforcement learning).  If an AI is aware that it is being trained (‘situational awareness’), it may want to avoid being modified. How could situational awareness alter the dynamics of the debate game? For example, Chen et al., 2024 use imperfect recall to model imperfect situational awareness.If we know an AI model is misaligned, how can we use it safely? This could be explored by extending manipulation framings in the persuasion literature (Kamenica and Gentzkow, 2011) or mechanism design where the contracting space represents possible training environments.Kamenica and Gentzkow, 2011 model information transmission when the sender is trying to manipulate the receiver. Assuming we (the receiver) know that a misaligned AI is trying to deceive us, can we design mechanisms to improve the amount of information it is optimal to send (for example, by limiting the form the message has to take)? How sensitive is this model to its assumptions (multiple dimensions, bounded rationality, errors)?When reporting the probability of some event changes the probability itself, what do we want to incentivise? Previous work includes Armstrong and O'Rorke, 2017, Perdomo et al., 2020 and Hudson, 2024

Suggestions to learn more: For alignment by debate, we’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving, 2024. Hammond et al., 2025 discuss a broad range of risks from multi-agent scenarios; see especially Section 3.1 on Information Asymmetries.

(Robust) Mechanism Design

Problem summary: As AI models increase in capabilities, they may have capabilities we don’t anticipate. If we model alignment as a game between a human principal and an AI agent, how robust are proposed safety mechanisms to unknown actions in the AI’s action space? Can we use insights from robust mechanism design (Bergemann and Morris, 2005) to relax modelling assumptions in AI alignment propositions?

Why this matters: We want alignment methods that are robust to increases in future capabilities, even if we can’t anticipate exactly what these are.

Examples of subproblems:

    Carroll, 2015 shows the optimality of linear contracts when the principal doesn’t know the action space of the agent. This result relies on the assumption that there is no action (like corporate sabotage) which could severely harm the principal. We don’t want to exclude such actions when considering actions an AI agent might take. Can we find the optimal mechanism for the principal when considering the trade-off between lowering the probability of harm and lowering the usefulness of a model?How should we think of the contractible/non-contractible action space when we (the principal) are interacting with or delegating decisions to a (possibly misaligned) AI (the agent)?There are many subgoals or tendencies which seem instrumentally useful, in particular power-seeking, but are dangerous for an AI that has different goals than the user wants or believes it to have (see Carlsmith, 2022). How does training the agent to be myopic (having a very low or zero discount factor) change the tradeoff between safety (non-power seeking) and usability?

Suggestions to learn more: Hadfield-Menell and Hadfield, 2018 set out some framings of AI alignment as an incomplete contracting problem. We’d recommend reading Bhatt et al., 2025, as an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.

Bounded Rationality

Problem summary: We want to predict how powerful AI systems will behave in a range of scenarios. Our best guesses are some combination of extrapolating from existing behaviour (from current models which have more limited capabilities), calculating optimal strategies in Nash equilibria (computationally unbounded agents) and calculating optimal strategies for computationally bounded agents (complexity theory and algorithmic game theory). Are there other approaches, taking inspiration from behavioural economics or elsewhere, which suggest meaningfully different behaviour of powerful AIs?

Why this matters: We currently have a lot of uncertainty about how neural nets and deep learning work, and hence how it will continue to scale. Most interpretability work is empirical and hence relies on past behaviour changes to extrapolate to future models. In order to understand whether proposed control and alignment techniques will work, we need a better understanding of how bounded rationality will impact the actions an AI might take.

Examples of subproblems:

    Much work on alignment (e.g. Brown-Cohen et al., 2023) uses complexity theory to model bounded rationality in terms of computational constraints. Do other formalisations of bounded rationality in the debate game suggest different equilibria (relative to computationally bounded or unbounded agents)?It may be hard to judge the safety of AI models which are significantly more capable or ‘intelligent’ than humans. Scalable oversight (also discussed in a variety of other problem areas) proposes using a series of AI models to align and monitor progressively more and more capable models. How should we model ‘more capable’ or ‘more intelligent’ in this framework? Are there conceptualisations which – if true – would make it harder or impossible for this framework to succeed? 

Suggestions to learn more: We’d recommend reading Buhl et al., 2025 and Brown-Cohen et al., 2025. Previous research on debate (a scalable oversight method) includes Irving et al., 2018, Barnes, 2020, Brown-Cohen et al., 2023, Brown-Cohen and Irving 2024. It may be particularly useful to consider cases where both the principal and agent are bounded, but the principal is significantly more so than the agent; for example, Goldreich, 2017 sets out a proof system where the prover is less constrained than the verifier. 

Open-Source Game Theory, Collusion and Commitment

Problem summary: AIs may have more ability to collude or commit to actions than human actors. An AI may be able to share some of its code or chain of thought (Wei et al., 2022, Korbak et al., 2025), increasing its ability to send credible signals of commitment. In addition, the way AIs collude (for example, the actions they coordinate on in the absence of communication) may be different from the ways we expect humans to collude, making it more difficult to detect.

Why this matters: Some alignment proposals rely on independent AI models in adversarial settings, for example, having AIs monitor one another (Goel et al., 2025). 

Examples of subproblems:

    The concrete open problems in open-source game theory in Critch et al., 2022. Open-source game theory considers agents whose source code is (partially) observable to other players. This might inform how interpretable or observable agents need to be to one another (or to a human) to encourage cooperation or discourage collusion.How can we design mechanisms to prevent collusion between AI agents and a monitoring system? (e.g. in the case of a red-team blue-team game evaluating control measures for a misaligned AI, see Bhatt et al., 2025)What are the correct solution concepts to use when modelling interactions between AI? For example, subgame perfect may be too narrow when AIs can credibly signal commitment. (In open-source game theory, see Tennenholtz, 2004 for an example.)Where independent AI models are used in alignment proposals (Hammond et al., 2025, p.43), what precise definition or formalisation of independence is needed for these protocols to function as intended?

Suggestions to learn more: We’d recommend reading Critch et al., 2022, Oesterheld et al., 2023.  Previous research in open-source game theory includes Sauerberg and Oesterheld (2025). Hammond et al., 2025 discusses collusion among advanced AI in sections 2.3 and 4.1 (includes discussion of steganography or cryptic messages). Bhatt et al., 2025, provides an introduction to the ‘control evaluation’ – a red-team/blue-team game designed to tell whether a system can cause harm, even if it is misaligned. Previous research on reducing harm from misaligned systems includes Mallen et al., 2025.

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 经济学 信息设计 机制设计 博弈论
相关文章