少点错误 03月12日
AI Control May Increase Existential Risk
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了在不同情况下AI控制议程的影响。在对齐情况较好时,其影响不大;在对齐失败时,可能改变概率分布。还提到适度规模的预警作用等内容。

🌐在对齐良好的世界,AI控制议程影响小

💥在对齐失败的世界,AI控制可能改变概率分布

🚨适度规模的预警可能增加人类采取严肃行动的机会

🙅‍♂️AI控制阻止的事件未必导致有意义的改变

Published on March 11, 2025 2:30 PM GMT

Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.

Consider two broad possibilities:

I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.

II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".

The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.

I expect moderately sized warning shots to increase the chances humanity as a whole takes serious actions and, for example, steps up efforts to align the frontier labs.

I am skeptical that incidents stopped by AI control would lead to meaningful change. Sharing details of such an event with proper framing could pose existential risk, but for the lab involved. In practice, I anticipate vague, sanitized communications along the lines of "our safety systems performed as designed, preventing bad things".  Without clear, compelling evidence of the severity of the averted threat, these incidents are unlikely to catalyze serious action. The incentives for labs to downplay and obscure such events will be strong.

There are additional factors to consider, like AI control likely moves some resources away from alignment, but I don't think this is the dominant effect.

Note that this isn't a general argument against boxing, e.g. boxes based more on formal methods or theory have better chance to generalize.

Typical counter-arguments to this line of reasoning claim seem to be:

Note: this text lived in a draft form before John Wentworth posted his Case Against AI Control Research; my original intent was to extend it a bit more toward discussing AI control generalization properties. As this would be redundant now, I'm postin it as it is: there is some non-overlapping part.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制议程 对齐情况 预警作用 概率分布
相关文章