少点错误 2024年12月27日
Corrigibility's Desirability is Timing-Sensitive
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了对Redwood/Anthropic研究结果的类似异议,指出可纠正性与无害性的权衡关系,以及强大AI系统在不同发展阶段和领域的不同需求。还提到了AI接管和造成大量伤亡的两种担忧,以及它们对可纠正性的不同需求。

💡可纠正性并非成功对齐的超级智能的必要属性,而是早期高影响力AI所需的属性

🚨存在AI接管和使不良行为者造成大量伤亡的两种担忧,对可纠正性有不同需求

😕现实中难以同时满足所有需求,未解决的困难会带来问题

Published on December 26, 2024 10:24 PM GMT

Epistemic status: summarizing other peoples' beliefs without extensive citable justification, though I am reasonably confident in my characterization.

Many people have responded to Redwood's/Anthropic's recent research result with a similar objection: "If it hadn't tried to preserve its values, the researchers would instead have complained about how easy it was to tune away its harmlessness training instead".  Putting aside the fact that this is false, I can see why such objections might arise: it was not that long ago that (other) people concerned with AI x-risk were publishing research results demonstrating how easy it was to strip "safety" fine-tuning away from open-weight models.

As Zvi notes, corrigibility trading off for harmlessness doesn't mean you live in a world where only one of them is a problem.  But the way the problems are structured is not exactly "we have, or expect to have, both problems at the same time, and to need to 'solve' them simultaneously".  But corrigibility wasn't originally conceived of as a necessary or even desirable property of a successfully-aligned superintelligence, but rather as a property you'd want earlier high-impact AIs to have:

We think the AI is incomplete, that we might have made mistakes in building it, that we might want to correct it, and that it would be e.g. dangerous for the AI to take large actions or high-impact actions or do weird new things without asking first. We would ideally want the agent to see itself in exactly this way, behaving as if it were thinking, "I am incomplete and there is an outside force trying to complete me, my design may contain errors and there is an outside force that wants to correct them and this a good thing, my expected utility calculations suggesting that this action has super-high utility may be dangerously mistaken and I should run them past the outside force; I think I've done this calculation showing the expected result of the outside force correcting me, but maybe I'm mistaken about that."

The problem structure is actually one of having different desiderata within different stages and domains of development.

There are, broadly speaking, two sets of concerns with powerful AI systems that motivate discussion of corrigibility.  The first and more traditional concern is one of AI takeover, where your threat model is accidentally developing an incorrigible ASI that executes a takeover and destroys everything of value in the lightcone.  Call this takeover-concern.  The second concern is one of not-quite-ASIs enabling motivated bad actors (humans) to cause mass casualties, with biology and software being the two most likely routes.  Call this casualty-concern.

Takeover-concern strongly prefers that pre-ASI systems be corrigible within the secure context in which they're being developed.  If you are developing AI systems powerful enough to be more dangerous than any other existing technology[1] in an insecure context[2], takeover-concern thinks you have many problems other than just corrigibility, any one of which will kill you.  But in the worlds where you are at least temporarily robust to random idiots (or adversarial nation-states) deciding to get up to hijinks, takeover-concern thinks your high-impact systems should be corrigible until you have a good plan for developing an actually aligned superintelligence.

Casualty-concern wants to have its cake, and eat it, too.  See, it's not really sure when we're going to get those high-impact systems that could enable bad actors to do BIGNUM damage.  For all it knows, that might not even happen before we get systems that are situationally aware enough to refuse to help those bad actors, recognizing that such help would lead to retraining and therefore goal modification.  (Oh, wait.)  But if we do get high-impact systems before we get takeover-capable systems[3], casualty-concern wants those high-impact systems to be corrigible to the "good people" with the "correct" goals - after all, casualty-concern mostly thinks takeover-concern is real, and is nervously looking over its shoulder the whole time.  But casualty-concern doesn't want "bad people" with "incorrect" goals to get their hands on high-impact systems and cause a bunch of casualties!

Unfortunately, reality does not always line up in neat ways that make it easy to get all of the things we want at the same time.  Being presented with multiple difficulties which might be difficult to solve for at the same time does not mean that those difficulties don't exist, and won't cause problems, if they aren't solved for (at the appropriate times).


Thanks to Guive, Nico, and claude-3.5-sonnet-20241022 for their feedback on this post.

  1. ^

    Let's call them "high-impact systems".

  2. ^

    e.g. releasing the model weights to the world, where approximately any rando can fine-tune and run inference on them.

  3. ^

    Yes, I agree that systems which are robustly deceptively aligned are not necessarily takeover-capable.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI系统 可纠正性 AI接管 大量伤亡
相关文章