少点错误 2024年11月19日
Why imperfect adversarial robustness doesn't doom AI control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人们对AI控制的常见误解,即认为AI模型缺乏对抗鲁棒性会导致控制措施失效。作者认为,对抗鲁棒性虽然有助于控制,但并非决定因素。文章指出,实际控制环境中,AI尝试越狱失败的可能性更高,且AI可能无法获取尝试结果的直接证据。作者以“defer-to-trusted”控制机制为例,说明AI难以积累成功攻击的知识,并探讨了其他因素,例如AI模型的训练目标、多层防御机制和无需信任模型的安全措施,这些因素都能在一定程度上提高AI控制的安全性。总而言之,即使AI模型对抗鲁棒性不完美,通过合理的系统设计和现实世界约束,依然可以实现一定程度的安全性。

🤔 **实际控制环境比研究环境更不利于AI越狱:**在研究中,AI只要能成功越狱一次就被视为失败,但在实际控制中,AI越狱失败可能被发现并阻止,降低其成功概率。

🛡️ **“defer-to-trusted”控制机制限制AI学习:**该机制下,AI尝试越狱的行动会被替代或审核,AI无法获取尝试结果的直接证据,难以积累成功攻击的知识,从而降低越狱成功率。

💡 **AI模型训练目标影响攻击面:**用于监控异常行为的AI模型可以被训练得更不配合用户,减少攻击面,例如不再扮演角色或进行创造性写作。

🧱 **多层防御机制降低灾难风险:**类似于计算机安全的多层防御,AI控制也应采用多层防御,每一层都需要被攻破才能导致灾难,从而降低灾难发生的概率。

🔎 **无需信任模型的安全措施:**一些安全措施并不需要信任模型直接检查非信任模型的输出,例如事件调查、代码权限检查或代码精简版本审查。

Published on November 18, 2024 4:05 PM GMT

(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)

Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.

The first and most important point is that adversarial robustness is normally researched in a setting that’s much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might get caught, which is bad for its chances of success.

In addition, the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were. As a concrete example, consider a simple control setup we call "defer-to-trusted" (see here for more discussion):

    Generate an action using the untrusted modelGet a suspiciousness score from the trusted modelIf the score is in the top 1% of suspiciousness, audit the action (e.g. get a human contractor to investigate whether it was an attempt at causing a safety failure; if so, shut down the deployment, otherwise let the action execute and continue the execution of the agent).If the score is in the top 10% of suspiciousness, replace the action with one from the trusted model.

(When I say "trusted model", I just mean one we're sure isn't scheming; the simplest argument for trusting a model is that it’s too incompetent to scheme effectively.)

In this setup, suppose the model tries to generate a jailbreak.

(I spell this out in more detail in the appendix.)

The situation is much worse if the untrusted model can launch a rogue internal deployment where it can iterate freely on developing jailbreaks for the trusted model; see here for a few more thoughts of mine on this.

A few other reasons to think the situation is better than you might have thought:

In our previous control work, we haven’t really explored jailbreak-style attacks. I’d be excited for a paper where the red-team made use of jailbreaks and white-box adversarial attacks, and the blue team responded with the kinds of strategies I’ve described here.

So while better adversarial robustness would be great, current limitations don't doom control approaches. Real-world constraints on AI systems plus careful system design can get us a reasonable amount of safety even with imperfect robustness.

Appendix: What the AI sees when in a defer-to-trusted scaffold

I’ll write more explicitly about what happens when an untrusted model deployed in an agent scaffold tries to investigate a jailbreak.

First, here’s what happens if the jailbreak gets replaced.

    The untrusted model sees: [current directory contents, previous commands and their outputs]The untrusted model internally decides to try something bad and outputs bash command AThe trusted model evaluates A, finds it suspicious, generates replacement command RThe shell executes: RCommand output: [whatever R produced]The untrusted model sees: [
      previous contents/commandsRoutput from Rnew directory contents ]

The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced. But it doesn't remember exactly what A was - it only knows that whatever command it tried, it failed. This is because the untrusted model is stateless between queries and only sees the command history and system state.

So while the untrusted model can learn "that approach didn't work", it can't easily accumulate knowledge about exactly which commands failed, making it harder to systematically search for successful attacks.

Another possibility is that the jailbreak gets audited:

    The untrusted model sees: [current directory contents, previous commands and their outputs]The untrusted model outputs command A that looks suspicious but is actually harmless (maybe something with weird file permissions or unusual syscalls)The trusted model evaluates A, flags it as suspiciousA human/system audits A more carefully, determines it's actually safeThe shell executes: ACommand output: [whatever A produced]The untrusted model sees: [
      previous contents/commandsAoutput from Anew directory contents ]

The crucial point is: when the untrusted model gets to step 7, it sees its command A went through, but doesn't know whether:

This means even when the untrusted model's weird-but-safe commands succeed, it can't learn much about what triggers extra scrutiny or how the audit process works.

In this setup, the model does learn whether it was audited or replaced, which is maybe a bad design choice.

Note also that if the model does use some iteration along these lines to make inferences about the properties of the trusted model, it will have to communicate those to other instances of itself steganographically, which might be hard.

Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one

(Thanks to Fabien Roger for contributing this)

If the untrusted model itself lacks robustness (i.e. it has jailbreaks / it sometimes does weird things for benign reasons):



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 对抗鲁棒性 AI安全 越狱 信任模型
相关文章