Published on February 1, 2025 1:55 AM GMT
I recently started reading about alignment over Christmas as a PhD student, and something popped out particularly. At its core, much of alignment seems to be asking "pretty please follow what we mean" in increasingly sophisticated ways. This reminded me of Wittgenstein's insights about language, how words get their meaning from use rather than fixed definitions, and how this creates fundamental ambiguities in meaning.
I've been thinking through more formally why we can't just tell AI systems "don't do harmful things" and expect that to work, even in principle. While this might seem obvious to many, I thought there's value in trying to reason through exactly why it's impossible and quantifying the gap between our specifications and reality in an info theory framework.
Here's the core:
For any system trying to follow rules about avoiding harm, there's a fundamental information-theoretic gap between:
1. The information we can encode in our specifications
2. The ground truth about what actually constitutes harm
This isn't just a practical limitation (we know it's hard), it's a mathematical impossibility conjecture, similar to the halting problem or Gödel incompleteness. Here's why:
The Problem
Consider a medical AI making triage decisions. We want it to "minimize harm." Seems straightforward enough. But even in this highly constrained domain with centuries of ethical precedent, we run into a fundamental limitation:
The AI has to choose between treating five patients with moderate injuries or one with severe injuries. Even if it could perfectly predict medical outcomes, there's an irreducible ambiguity in comparing different configurations of harm. Should it optimize for minimizing maximum individual harm (saving the severe case) or minimizing total harm (treating the five)?
You might think this is literally just a trolley problem variant, but it points to something deeper. It's possible to prove that this ambiguity isn't just about hard choices. It's about fundamental limits in information theory.
The Shoddy Math
The conjecture: For any specification system S with ground truth O about what constitutes harm, here's a conjecture that:
I(O;I) < H(O)
In English: The mutual information between our specifications and ground truth must be less than the entropy of ground truth harm. There's always an unbridgeable gap.
This gap emerges from what we might call semantic entropy, the inherent ambiguity in concepts like "harm" that resist complete formal specification. Just as physical entropy puts fundamental limits on thermodynamic efficiency, semantic entropy puts fundamental limits on our ability to specify what we mean by "harm."
This isn't just another "AI alignment is hard" result, it's suggesting that "perfect" alignment, so to speak, where we are sure that a model is completely safe, is mathematically impossible for any system where harm is defined external to its specifications.
So?
This has direct implications for current approaches to AI alignment:
1. Approaches that try to specify or empirically approximate complete rules about harm avoidance are fundamentally limited
2. No amount of training data or specification refinement can close this gap
3. Even systems that can "learn what we mean" hit an information-theoretic wall
The Safety-Capability Ratio
To quantify this limitation, an idea:
Safety-Capability Ratio = I(O;I)/H(O)
In English: We compare how much our specifications actually capture about true harm versus how complex/uncertain the full concept of harm is.
This ratio must always be < 1, but higher values indicate less risky (better aligned?) systems. We can use this to:
- Compare different alignment approaches, think about system safety over time
- Track how "alignment" changes as capabilities increase
- Identify when systems are operating too close to their specification limits
It would be pretty hard to actually estimate this, but it's just a framing for thinking about risk in a way that forces us to acknowledge that there exists no assured safety.
Löbian Traps
It gets more interesting. When an AI system tries to verify its own alignment, to check if it's really avoiding harm, it runs into a practical Löbian trap.
Consider a language model trying to avoid generating harmful content. Even if it builds a perfect model of human preferences from its training data, it still faces an irreducible uncertainty about whether novel outputs are truly harmful. This isn't because the model is flawed, it's because the very concept of "harm" has non-zero semantic entropy.
This creates a double-bind: AI systems sophisticated enough to notice these specification gaps can't bootstrap certainty about their own alignment. Not only can't we specify everything perfectly, but systems can't even be completely sure about their understanding of the specifications we do provide.
This has three key implications:
1. Systems that get more capable don't necessarily get better aligned
2. The smarter an AI gets, the more likely it is to notice specification gaps
3. We can't solve this by just adding more rules, that probably increases semantic entropy
So... P(doom)=1?
Not exactly. I hope not. I'm only 25, hoping to last a lot longer than that. But we should think harder about how we approach alignment. Instead of trying to create perfect specifications (which is probably impossible, not just very hard), we should:
1. Design systems that explicitly track their uncertainty about harm
2. Build in mechanisms for graceful failure when they hit specification boundaries
3. Develop uncertainty-aware architectures that can operate safely despite inevitable specification gaps
4. Maintain meaningful human oversight even as capabilities increase
This might mean accepting fundamental limits on how quickly we can scale while maintaining alignment. The safety-capability ratio gives us a formal way to think about these tradeoffs.
The Problem with Good Enough
An anticipated response is that perfect alignment/specification isn't necessary, that we can achieve good enough through iterative improvements. Humans aren't perfectly aligned with our own values, yet we manage to cope. My reasoning on this:
Scale and Capability Differentials
Human misalignment has natural bounds through our limitations and social structures. AI systems operating at civilization-scale impact levels won't have these constraints. When a system can affect millions of decisions simultaneously, even minor misalignments become existential risks.
The Brittleness of Iterative Safety
We know how quickly models can break even with safety measures. Red teamers crack the newest model drop often within hours. This isn't just insufficient engineering; it's a fundamental problem. As capabilities increase, more will be at stake than "haha guys Pliny got the newest OpenAI model to say funny words again."
Open Questions
Some things I'm still wrestling with:
1. Can we prove lower bounds on semantic entropy for different types of harm? Or indeed, words in general?
2. How does this interact with mesa-optimization?
3. Are there domains where we can get arbitrarily close to complete specification, even if we can't reach it?
I have a draft exploring these ideas more formally, but I wanted to post this less technical version to get feedback on potential holes, or if someone's literally wrote this already in 2012 or something. In that case, my defense is that I was 12. Maybe I can ragebait someone who is smarter than I am into proving this conjecture?
Discuss