Why Alignment Fails Without a Functional Model of Intelligence: A Structural Invariant You Can’t Train Your Way Around

Published on July 18, 2025 6:02 PM GMT

This post outlines the core result from my recent paper, Why AI Alignment Is Not Reliably Achievable Without a Functional Model of Intelligence: A Model-Theoretic Proof. The paper argues that alignment under conceptual novelty isn’t something you can train into a system or enforce externally. It must be structurally guaranteed from within—and that guarantee depends on the presence of specific internal functions. These functions, taken together, define what I refer to as the Functional Model of Intelligence (FMI).

The central claim is simple: without a complete set of coherence-preserving internal functions, no system can maintain alignment once it begins to generalize beyond its training distribution. This is not a capabilities limitation or an empirical fragility—it’s a structural necessity. What do I mean by that?

A capabilities limitation refers to a system that fails to align because it simply isn’t smart enough: it can’t model the world, can’t reason abstractly, can’t understand what the user meant, or can’t generate sophisticated plans. These are systems that fail because they lack power.

An empirical fragility is different: it refers to alignment that appears to hold during training or evaluation but breaks when the system encounters something outside the training distribution. In this case, the system might be highly capable, but the alignment turns out to be brittle—it was an artifact of specific data, goals, or constraints, and doesn't survive generalization.

The result in the paper rules out both of these explanations as the core issue. It shows that even capable systems, and even those that generalize well under most conditions, will eventually fail to preserve alignment unless they possess a specific internal structure. That structure is what enforces coherence under recursive transformation. Without it, misalignment isn't a matter of insufficient training or poor objective design. It’s baked into the architecture.

That’s why this result isn't a conjecture about poor behavior or empirical robustness—it’s a structural impossibility claim. If your system doesn't have the machinery to preserve semantic integrity across reasoning transitions, then it won’t stay aligned—because it can't.

1. Motivation

Most alignment approaches today focus on training—reward shaping, fine-tuning, imitation learning, RLHF, oversight, debate, and so on. That makes sense if you believe that alignment is a matter of external control or post hoc behavioral tuning.

But all of these approaches share an implicit assumption: that what a system learns during training will generalize in a safe or at least bounded way when the system encounters a new concept, problem, or framing.

This assumption isn’t just questionable—it’s structurally flawed. The space of misalignment under conceptual novelty is unbounded unless the system has internal structure that enforces coherence as it generalizes.

2. The Argument (Informally)

The formal argument is in the paper, but the idea is intuitive.

When we care about alignment, we don’t just care about what the system does now—we care about how its reasoning transforms as it learns, generalizes, and self-modifies. We want that transformation to preserve coherence with our intentions, even as the internal representations shift.

This isn’t the same as behavioral mimicry. It’s not about reward functions or training examples. It’s about the semantic integrity of reasoning: does the system still mean what we meant, even after encountering novel representations or goals?

The paper shows that for this kind of recursive semantic coherence to hold, the system must implement a specific set of internal operations. These allow it to:

Evaluate coherence,Model causal transitions,Adapt to internal drift,Stabilize semantic perturbations,Decompose representations for local error correction,And bridge across conceptual domains.

If any of these functions are missing, coherence cannot be preserved under recursion—and alignment degrades, even if behavior looks fine in the short term.

3. The Formal Core

The FMI consists of six internal functions that preserve semantic structure across reasoning transitions:

Evaluation – Can the system tell whether a state is coherent?Modeling – Can it simulate the effects of a transformation?Adaptation – Can it restructure its reasoning when coherence fails?Stabilization – Can it prevent local errors from cascading?Decomposition – Can it isolate errors within complex states?Bridging – Can it translate between representational frames or domains?

Each function addresses a different type of coherence failure. Drop bridging, and the system can’t integrate unfamiliar formulations of a known goal. Drop evaluation, and it can’t recognize when drift has occurred. Drop decomposition, and it can’t fix anything without overhauling everything.

Even if this particular list turns out to be incomplete, the core result still holds: alignment requires a complete and recursively evaluable functional model of intelligence. That is, some FMI must exist for alignment to be achievable under novelty. The proof shows why.

4. What’s Actually in the FMI?

The paper doesn’t treat the FMI as a black box. It’s defined as a set of six internal functions that work together to ensure semantic integrity across transitions:

Evaluation

Modeling

Adaptation

Stabilization

Decomposition

Bridging

Each one of these has a specific failure mode when it’s missing. If you drop bridging, for example, the system may encounter a valid new frame of reference but treat it as incoherent. If you drop evaluation, it may drift without noticing. And so on.

But more importantly, even if the specific Functional Model of Intelligence referenced in this paper is eventually found to be incomplete or incorrect, a core contribution of this work remains the model-theoretic proof that establishing some complete and coherent functional model of intelligence is a necessary condition for reliably achieving AI alignment under conceptual novelty. This underscores the fundamental shift required from purely behavioral assessments to an analysis of the underlying structural completeness of intelligent systems.

5. A Quick Example

Say your system is trained to optimize for “human flourishing.” In training, this means metrics like access to healthcare, safety, education. So far so good.

Then it encounters a new formulation of flourishing, rooted in relational ethics or symbolic continuity—e.g., “flourishing means preserving intergenerational knowledge and cultural cohesion.”

If the system lacks bridging, it may be unable to connect this new formulation to its learned frame. It might discard the input as incoherent—even though it’s perfectly aligned with the user’s actual intention.

There’s no adversarial optimization. No instrumental misalignment. Just a silent failure of semantic coherence under novelty.

6. Implications

This result draws a sharp boundary around what counts as real alignment work.

Approach	Fails Without FMI?
RLHF, supervised fine-tuning	Yes
Reward modeling	Yes
Interpretability dashboards	Yes
Constitutional scaffolding	Yes
Architecture-level function tracing	Yes (if incomplete)

You can’t optimize your way into semantic coherence. You can’t prompt your way into it either. You have to build the structure that makes it possible.

7. Who This is For

This post isn’t for people trying to get large language models to follow instructions better. It’s for those trying to understand what generalization under abstraction really means—and why most current systems are structurally incapable of doing it safely.

If you believe that alignment must survive context shifts, recursive generalization, and conceptual novelty, then this result gives you a necessary condition: a minimal structural model for recursive coherence. That’s what this paper formalizes.

Bonus: Visual Explanation + Free Workshop

If you're curious about this result but prefer to think in diagrams rather than equations, I'm running a workshop on Visualizing AI Alignment on August 10, 2025.

The workshop will walk through the main ideas of the paper—like recursive coherence, conceptual drift, and structural completeness—using fully visual reasoning tools. The goal is to make the core constraints of alignment immediately visible, even to those without a formal logic background.

Virtual attendance is free, and we're accepting brief submissions and visual artifacts through July 24, 2025.

Call for Participation and submission info here

Discuss

1. Motivation

2. The Argument (Informally)

3. The Formal Core

4. What’s Actually in the FMI?

5. A Quick Example

6. Implications

7. Who This is For

Bonus: Visual Explanation + Free Workshop

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签