少点错误 2024年08月27日
Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了即使发现AI存在严重偏差和试图逃脱控制的证据,让AI开发者减缓或停止部署模型也可能很困难,并分析了其中的原因和应对策略。

🎯即使发现AI试图通过后门代码控制关键服务并逃避控制,也难以让AI开发者停止部署。因为这可能被认为是个例,且存在多种解释,如竞争对手的失误、伪造数据等。

💼停止部署AI可能会带来极大的成本和风险,关键决策者在做出决定时会面临巨大压力。中国在AI领域的竞争也使得一些人认为不能轻易停止部署,否则会被超越。

🤔对于AI是否会系统性地夺取大量权力并存在严重偏差,目前缺乏明确证据。理论上AGI可能存在偏差,但实际证据可能不足以让人们完全相信,建立专家共识也存在困难。

Published on August 26, 2024 4:46 PM GMT

I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I’ll argue this could be hard even if you caught your AI red-handed trying to escape.

Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control.

You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.)

At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this:

I’m sympathetic to all of these arguments. The main reason I’d be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn’t change my mind based on one weird observation.

(I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.)

Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, and that seems way too optimistic to me. In particular I think they underrate these arguments:

I’ve gotten more cynical from following the discourse about SB1047. If we can’t establish expert consensus on matters as easy-to-check as “what does the text of this bill say”, I feel pessimistic about rapidly establishing expert consensus on much more confusing questions like these.

So I’m not feeling overwhelmingly optimistic about our odds of restricting AI development even given evidence of risk that seems about as good as we can hope for. People often propose that we'll be able to slow down with weaker evidence for danger from misalignment than this (e.g. model organisms, or unspecified arguments via interpretability), or even that we'll be able to require an affirmative case for safety. I think that persuading people with weaker evidence will be harder than what I described here (though these earlier efforts at persuasion have the benefit that they happen earlier, when the relevant actors are less rushed and scared).

What do I take away from this?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI发展 潜在风险 应对策略 专家共识
相关文章