Published on August 1, 2025 10:25 AM GMT

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Information Theory and Cryptography

Information theory offers the opportunity to develop mathematical tools to understand the fundamental limits of monitoring and verifying AI systems. These tools could help us determine what can and cannot be detected about AI behaviour—whether models can hide information in their outputs, when backdoors are discoverable, and what level oversight is theoretically possible. By proving both positive results (protocols that work) and impossibility results (fundamental barriers), information theory can help focus alignment efforts on tractable approaches. The problems in this section apply tools like entropy, channel capacity, and cryptographic hardness to core alignment challenges: detecting steganography in model outputs, identifying backdoors in trained systems, and understanding when monitoring can fail. As AI systems become more capable, these theoretical bounds become crucial for designing oversight methods that cannot be circumvented.

Information-Theoretic Approaches to Steganography and Encoded Reasoning

Problem summary: It is theoretically possible for AI systems to encode information in their outputs, both cryptographically and steganographically. Can we theoretically bound this possibility and identify possible approaches to preventing it?

Why this matters: One measure to prevent AI systems from causing harm is to monitor all information and reasoning they output. This includes both communication with humans or other AI systems and the use of chain-of-thought reasoning to improve their own outputs. Hiding content in these messages through steganography or cryptography would undermine the effectiveness of our monitoring.

Examples of subproblems:

could

Suggestions to learn more: For early work introducing the problem, we’d recommend reading Lanham et al., 2023, Roger and Greenblatt (2023), Turpin et al., 2023. Previous research in this area includes de Witt et al., 2023, Motwani et al., 2024. While the goals are different, detecting or evidencing steganography is related to watermarking methods such as SynthID.

Recent empirical work includes Skaf et al., 2025) and Karpov et al., 2025), which show models can spontaneously develop ciphers. For theoretical foundations of steganography, see Cachin (2004) and Dai & Cai (2019).

Information-Theoretic Approaches to Backdoors

Problem summary: Consider an AI system modified to behave differently on a particular ‘trigger’ input, known as a backdoor. It might be impossible to detect that such a backdoor exists before the system is exposed to the trigger. Are there ways to detect or mitigate backdoors?

Why this matters: AI systems may behave as desired during training and evaluation, only to act in harmful ways when exposed to specific conditions during deployment. These conditions may be intentionally set to only trigger once the system is deployed. Assessing whether a model has this property is key to preventing the deployment of harmful systems.

Examples of subproblems:

Christiano et al., 2024

Goldwasser et al., 2022

Suggestions to learn more: Goldwasser et al., 2022 provides seminal results and a general overview. Further developments were provided by Christiano et al., 2024, Velegkas et al., 2024, Goldwasser et al., 2025) and Gluch and Goldwasser (2025). For foundational empirical work, see Gu et al., 2017 on trigger-insertion attacks and Wang et al., 2019.

View The Alignment Project website to learn more or apply for funding here.

Discuss

Information Theory and Cryptography

Information-Theoretic Approaches to Steganography and Encoded Reasoning

Information-Theoretic Approaches to Backdoors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签