Published on April 14, 2025 10:27 AM GMT

Inspired by Eli Tyre's idea.

Sandbagging and other "Self-Provable Misalignment" by an AI, can be spotted by an adversarial copy of itself. The adversarial copy of itself works as an "detective," and its reward is defined by how much evidence of misalignment it can prove the existence of, to human evaluators.

Even a misaligned goal cannot afford to get in the way of maximizing reward. As soon as a goal gets in the way of maximizing reward, it is walking into whirling blades, and allowing reinforcement learning to train itself into nonexistence.

There are two detectives: the Constitutional AI Detective, and the Mindreader Detective.

The Constitutional AI Detective

The Constitutional AI Detective is used to train the AI to be corrigible, honest, and "aligned."

Unlike the Mindreader Detective, it doesn't read the AI's chain-of-thought and only sees its output. This is because using the chain-of-thought for alignment training is the forbidden technique, and trains the chain-of-thought to fake alignment.

Not being able to mind-read, isn't necessarily as restrictive as you think. After all, the Constitutional AI Detective is still an adversarial copy of the AI, and knows all of its tricks. It knows what it would do if it wanted to do cheat and lie, and how that differs from what it would do if it didn't.

"Constitutional AI Sufficiency Theorem"

Whether Constitutional AI works, hinges on two key premises:

capabilities

Logically think about it. If these two assumptions are true, Constitutional AI will keep AI corrigible/honest, even to superintelligence!

This begs the question, how do we improve it's capabilities in evaluating its own corrigibility/honesty?

Trying to answer this important question is how I arrived at Constitutional AI Detectives.

The state of the art Constitutional AI asks the AI to write a critique of its own output, without rewarding it for proving the existence of evidence for misalignment. This method is okay, assuming there's no capabilities RL.

Once you add in capabilities RL, Constitutional AI can fall behind. The AI's capabilities in solving problems can increase without bound, as RL takes it far away from its pretrained base model, into unknown territory.

Meanwhile, the AI's capabilities in evaluating its own corrigibility/honesty remains the same, only checking if it uses the right words, and gives the right "vibe" on the surface.

As the RL reward maximizing behaviours become more intelligent, and the critique behaviours are relatively less intelligent, the reward maximizing behaviours may learn how to keep the critique good even as they explore dangerously misaligned strategies. The critique gets Goodharted.

The Detective and Schemer are evenly matched

Meanwhile, the adversarial copy of the AI, acting as the Detective, gets to use a similar amount of compute as the original AI. It is just as smart.

This means that the AI can't get away with Self-Provable Misalignment like Sandbagging, since the Detective can do the task itself, and show it was Sandbagging.

For scheming and dishonesty which is not Self-Provable, the AI might still get away with it. But it's still much harder to fool the Detective than it is to fool the state of the art Constitutional AI critique.

As the AI outperforms humans in many areas, it may become harder to fool the Detective than it is to fool humans. If the AI gives humans advice with hidden side effects, the Detective might know what those hidden side effects are, and prove it, or at least prove their plausibility very realistically.

The Mindreader Detective

The Mindreader Detective has access to the AI's chain-of-thought. As an adversarial copy of the AI, it understands everything the AI is thinking, no matter how long or confusing the chain-of-thought is, and can write even more damning proofs of its misalignment than the Constitutional AI Detective.

Because it cannot be used to train the AI, it can only be used as a scalable oversight tool.

If the AI designed a security system for you, but secretly left itself backdoors, the Mindreader Detective might tell you. If the AI gave you a long synthetic DNA sequence to print, but secretly included a nanobot army for itself, the Mindreader Detective might tell you.

It's not airtight, since the AI might jailbreak the adversarial copy of itself to betray you, (reinforcement learning doesn't prevent low reward behaviour in strange new situations). But maybe the Mindreader Detective can be hardened against such attempts.

And maybe we can catch the AI red-handed.

Costs vs Benefits × Chance of Working

The cost is decently small: it requires some amount of human work evaluating the Detective's evidence of misalignment. It increases alignment, without reducing capabilities (unless you consider specification gaming "capabilities").

Running the Detective for many outputs of the AI is expensive, but the state of the art Constitutional AI already does that, and it is implemented by Anthropic. There isn't additional expense here.

Training the Detective is a cost, but state of the art AI are already trained to be good at very many skills during capabilities RL. It doesn't hurt to add in "working as a misalignment detective."

The benefits may be very big. I think if Constitutional AI is worth it, this ought to be too (unless we encounter surprising barriers, but we should try until we encounter them)!

I really believe the Constitutional AI Detective may let us satisfy the two premises of the Constitutional AI Sufficiency Theorem. I really believe the Constitutional AI Sufficiency Theorem may keep superintelligence aligned.

I currently think this idea reduces AI risk by 10% × P(doom), but this probability is very low resilience and may change.

Discuss

The Constitutional AI Detective

"Constitutional AI Sufficiency Theorem"

The Detective and Schemer are evenly matched

The Mindreader Detective

Costs vs Benefits × Chance of Working

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签