少点错误 08月01日 18:43
Research Areas in Information Theory and Cryptography (The Alignment Project by UK AISI)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

The Alignment Project是一个全球性基金,致力于加速AI控制与对齐研究的进展。该项目获得了国际政府、行业、风险投资和慈善基金的支持。本文重点介绍了信息论在AI对齐研究中的应用,包括利用信息论工具理解AI系统监控和验证的根本限制,以检测模型输出中的隐藏信息、后门以及理论上可行的监督水平。信息论通过证明正面成果和不可能成果,帮助研究者聚焦于可行的AI对齐方法。具体研究方向包括信息论在隐写术和编码推理中的应用,以及在后门检测与防御中的作用,旨在确保AI系统的安全可控。

🧰 信息论为AI监控与验证提供数学工具:信息论可以帮助我们理解AI系统行为的根本限制,例如模型是否会在输出中隐藏信息,后门是否可被发现,以及理论上可能的监督水平。通过证明正面和负面结果,信息论有助于将AI对齐研究聚焦于切实可行的方法。

🕵️‍♀️ 隐写术与编码推理:AI系统可能在其输出中编码信息,无论是通过加密还是隐写术。信息论研究旨在理论上界定这种可能性,并识别潜在的预防方法。这对于防止AI系统通过隐藏信息(如在链式思考推理中)来破坏监控有效性至关重要。研究关注在自然语言中进行隐写或加密的理论界限,以及模型在实践中进行编码的能力限制。

🛡️ 后门检测与防御:AI系统可能在训练和评估阶段表现正常,但在特定触发输入下表现出有害行为,即存在“后门”。信息论在此领域的研究目标是提升针对后门的防御能力,探讨何时可以有效防御后门,后门是否可通过进一步训练移除,以及其他可能的防御方法。核心挑战在于能否找到一个易于防御但又难以精确学习(PAC)的表示类别,从而实现更可靠的后门检测。

🚀 The Alignment Project加速AI安全研究:该项目是一个超过1500万英镑的全球性基金,专注于推动AI控制和对齐研究的进步,并得到了各国政府、行业、风险投资和慈善基金的国际联盟支持。文章鼓励研究人员申请加入,共同推进AI安全领域的研究。

Published on August 1, 2025 10:25 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Information Theory and Cryptography

Information theory offers the opportunity to develop mathematical tools to understand the fundamental limits of monitoring and verifying AI systems. These tools could help us determine what can and cannot be detected about AI behaviour—whether models can hide information in their outputs, when backdoors are discoverable, and what level oversight is theoretically possible. By proving both positive results (protocols that work) and impossibility results (fundamental barriers), information theory can help focus alignment efforts on tractable approaches. The problems in this section apply tools like entropy, channel capacity, and cryptographic hardness to core alignment challenges: detecting steganography in model outputs, identifying backdoors in trained systems, and understanding when monitoring can fail. As AI systems become more capable, these theoretical bounds become crucial for designing oversight methods that cannot be circumvented.

Information-Theoretic Approaches to Steganography and Encoded Reasoning

Problem summary: It is theoretically possible for AI systems to encode information in their outputs, both cryptographically and steganographically. Can we theoretically bound this possibility and identify possible approaches to preventing it?

Why this matters: One measure to prevent AI systems from causing harm is to monitor all information and reasoning they output. This includes both communication with humans or other AI systems and the use of chain-of-thought reasoning to improve their own outputs. Hiding content in these messages through steganography or cryptography would undermine the effectiveness of our monitoring. 

Examples of subproblems:

    Are there relevant theoretical bounds to what can be steganographically or cryptographically encoded in natural language?What limits might there be on models’ ability to do this encoding in practice? Methods for cryptographically-strong steganography usually require circuits much deeper than the latent space evaluation of an LLM, so it is unclear both whether a network could learn such circuits and whether gradient descent would incentivise it.What mitigations could work? Can suitable theoretical bounds be developed for existing mitigations, such as paraphrasing or monitoring?If a weak signal for steganography is detected, how can it be expanded into stronger evidence? For example, it may be possible to elicit a language model to occasionally translate steganographic chain-of-thought into plain language, but with a high fraction of false positives: claims that the model is doing a bad thing when in fact it is not. Are there reliable ways to filter out false positives to determine whether steganography is in fact occurring?

Suggestions to learn more: For early work introducing the problem, we’d recommend reading Lanham et al., 2023, Roger and Greenblatt (2023), Turpin et al., 2023. Previous research in this area includes de Witt et al., 2023, Motwani et al., 2024. While the goals are different, detecting or evidencing steganography is related to watermarking methods such as SynthID.

Recent empirical work includes Skaf et al., 2025) and Karpov et al., 2025), which show models can spontaneously develop ciphers. For theoretical foundations of steganography, see Cachin (2004) and Dai & Cai (2019).

Information-Theoretic Approaches to Backdoors 

Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?

Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems. 

Examples of subproblems:

    Can we improve upon formal notions of defendability against backdoors?When is defence against backdoors possible?Are backdoors removable with further training?What other methods could work against backdoors?A central theoretical challenge for work defending against backdoors in AI systems is whether there exists a natural representation class that is efficiently defendable but not efficiently probably approximately correct (PAC) learnable. Identify such a representation class. See Christiano et al., 2024 for more details.Goldwasser et al., 2022 insert undetectable backdoors using hardness of learning with errors, but this requires a network with special algebraic structure: the network uses cosine for periodicity. Is there a more general undetectable backdoor-insertion method with less imposed structure?

Suggestions to learn more: Goldwasser et al., 2022 provides seminal results and a general overview. Further developments were provided by Christiano et al., 2024, Velegkas et al., 2024, Goldwasser et al., 2025) and Gluch and Goldwasser (2025). For foundational empirical work, see Gu et al., 2017 on trigger-insertion attacks and Wang et al., 2019.

 

View The Alignment Project website to learn more or apply for funding here.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 信息论 AI安全 隐写术 后门检测
相关文章