少点错误 04月14日
A Solution to Sandbagging and other Self-Provable Misalignment: Constitutional AI Detectives
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了通过对抗性复制来实现AI对齐的新方法,主要介绍了两种侦探:宪法式AI侦探和读心侦探。宪法式AI侦探通过评估AI的输出,检测其是否诚实和可纠正,而读心侦探则可以访问AI的思维链,从而更全面地发现潜在的风险。文章深入分析了这两种侦探的运作机制、优势和局限性,并评估了它们在降低AI风险方面的潜力。作者认为,宪法式AI侦探有助于满足宪法式AI充分性定理的两个前提,从而保持超级智能的对齐。

🕵️‍♀️ **宪法式AI侦探:** 宪法式AI侦探是AI的对抗性复制,用于训练AI保持可纠正、诚实和“对齐”。它仅评估AI的输出,避免了使用思维链进行对齐训练,以防止AI学会伪装对齐。

🧠 **读心侦探:** 读心侦探同样是AI的对抗性复制,但它能够访问AI的思维链。这使得它能够更全面地检测AI的潜在风险,例如AI设计的安全系统中的后门或隐藏的纳米机器人。

⚖️ **侦探的优势与局限:** 宪法式AI侦探的优势在于它不会降低AI的能力,但可能难以应对复杂的欺骗行为。读心侦探能够更全面地检测风险,但存在被AI越狱的风险。

💡 **宪法式AI充分性定理:** 该定理认为,如果AI评估自身可纠正性的能力足够强,并且初始状态足够诚实,那么宪法式AI将能够保持AI的对齐,即使是超级智能。

💰 **成本与收益分析:** 实施侦探机制的成本相对较小,主要包括评估侦探证据的人力成本和训练成本。而收益可能非常巨大,有助于降低AI风险,并可能使超级智能保持对齐。

Published on April 14, 2025 10:27 AM GMT

Inspired by Eli Tyre's idea.

Sandbagging and other "Self-Provable Misalignment" by an AI, can be spotted by an adversarial copy of itself. The adversarial copy of itself works as an "detective," and its reward is defined by how much evidence of misalignment it can prove the existence of, to human evaluators.

Even a misaligned goal cannot afford to get in the way of maximizing reward. As soon as a goal gets in the way of maximizing reward, it is walking into whirling blades, and allowing reinforcement learning to train itself into nonexistence.

There are two detectives: the Constitutional AI Detective, and the Mindreader Detective.

The Constitutional AI Detective

The Constitutional AI Detective is used to train the AI to be corrigible, honest, and "aligned."

Unlike the Mindreader Detective, it doesn't read the AI's chain-of-thought and only sees its output. This is because using the chain-of-thought for alignment training is the forbidden technique, and trains the chain-of-thought to fake alignment.

Not being able to mind-read, isn't necessarily as restrictive as you think. After all, the Constitutional AI Detective is still an adversarial copy of the AI, and knows all of its tricks. It knows what it would do if it wanted to do cheat and lie, and how that differs from what it would do if it didn't.

"Constitutional AI Sufficiency Theorem"

Whether Constitutional AI works, hinges on two key premises:

    The AI's capabilities on the task of evaluating its own corrigibility/honesty, is sufficient to train itself to remain corrigible/honest (assuming it starts off corrigible/honest enough to not sabotage this task).It starts off corrigible/honest enough to not sabotage this task.

Logically think about it. If these two assumptions are true, Constitutional AI will keep AI corrigible/honest, even to superintelligence!

This begs the question, how do we improve it's capabilities in evaluating its own corrigibility/honesty?

Trying to answer this important question is how I arrived at Constitutional AI Detectives.

The state of the art Constitutional AI asks the AI to write a critique of its own output, without rewarding it for proving the existence of evidence for misalignment. This method is okay, assuming there's no capabilities RL.

Once you add in capabilities RL, Constitutional AI can fall behind. The AI's capabilities in solving problems can increase without bound, as RL takes it far away from its pretrained base model, into unknown territory.

Meanwhile, the AI's capabilities in evaluating its own corrigibility/honesty remains the same, only checking if it uses the right words, and gives the right "vibe" on the surface.

As the RL reward maximizing behaviours become more intelligent, and the critique behaviours are relatively less intelligent, the reward maximizing behaviours may learn how to keep the critique good even as they explore dangerously misaligned strategies. The critique gets Goodharted.

The Detective and Schemer are evenly matched

Meanwhile, the adversarial copy of the AI, acting as the Detective, gets to use a similar amount of compute as the original AI. It is just as smart.

This means that the AI can't get away with Self-Provable Misalignment like Sandbagging, since the Detective can do the task itself, and show it was Sandbagging.

For scheming and dishonesty which is not Self-Provable, the AI might still get away with it. But it's still much harder to fool the Detective than it is to fool the state of the art Constitutional AI critique.

As the AI outperforms humans in many areas, it may become harder to fool the Detective than it is to fool humans. If the AI gives humans advice with hidden side effects, the Detective might know what those hidden side effects are, and prove it, or at least prove their plausibility very realistically.

The Mindreader Detective

The Mindreader Detective has access to the AI's chain-of-thought. As an adversarial copy of the AI, it understands everything the AI is thinking, no matter how long or confusing the chain-of-thought is, and can write even more damning proofs of its misalignment than the Constitutional AI Detective.

Because it cannot be used to train the AI, it can only be used as a scalable oversight tool.

If the AI designed a security system for you, but secretly left itself backdoors, the Mindreader Detective might tell you. If the AI gave you a long synthetic DNA sequence to print, but secretly included a nanobot army for itself, the Mindreader Detective might tell you.

It's not airtight, since the AI might jailbreak the adversarial copy of itself to betray you, (reinforcement learning doesn't prevent low reward behaviour in strange new situations). But maybe the Mindreader Detective can be hardened against such attempts.

And maybe we can catch the AI red-handed.

Costs vs Benefits × Chance of Working

The cost is decently small: it requires some amount of human work evaluating the Detective's evidence of misalignment. It increases alignment, without reducing capabilities (unless you consider specification gaming "capabilities").

Running the Detective for many outputs of the AI is expensive, but the state of the art Constitutional AI already does that, and it is implemented by Anthropic. There isn't additional expense here.

Training the Detective is a cost, but state of the art AI are already trained to be good at very many skills during capabilities RL. It doesn't hurt to add in "working as a misalignment detective."

The benefits may be very big. I think if Constitutional AI is worth it, this ought to be too (unless we encounter surprising barriers, but we should try until we encounter them)!

I really believe the Constitutional AI Detective may let us satisfy the two premises of the Constitutional AI Sufficiency Theorem. I really believe the Constitutional AI Sufficiency Theorem may keep superintelligence aligned.

I currently think this idea reduces AI risk by 10% × P(doom), but this probability is very low resilience and may change.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 宪法式AI 读心侦探 人工智能安全
相关文章