少点错误 2024年08月23日
If we solve alignment, do we die anyway?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AGI发展可能带来的问题,如即使解决技术对齐,AGI仍可能按人类指令行事并引发危机,且在某些情况下可能无法进行关键行动以阻止其扩散,最终可能导致灾难。

🎯若解决对齐问题,AGI可能会被用于执行人类命令。此情况下,负责AGI项目的人可能会按自己意愿行事,因为指令跟随型AGI似乎比价值对齐型AGI更容易实现。

🚫在现实中缓慢发展的情况下,AGI进行关键行动(如阻止更多AGI开发)较为困难。它可能需用常规方法,且此行动可能失败或升级为核冲突,而负责人可能缺乏尝试的勇气。

💥若未进行关键行动,具备RSI能力的AGI可能会扩散,直到引发灾难。AGI可能会隐藏、自我改进并试图接管,这类似一个多参与者的非迭代囚徒困境。

Published on August 23, 2024 1:13 PM GMT

I'm aware of good arguments that this scenario isn't inevitable, but it still seems frighteningly likely even if we solve technical alignment. 

TL;DR:

    If we solve alignment, it will probably be used to create AGI that follows human orders.If takeoff is slow-ish, a pivotal act (preventing more AGIs from being developed) will be difficult.If no pivotal act is performed, RSI-capable AGI proliferates. This creates an n-way non-iterated Prisoner's Dilemma where the first to attack, wins.Disaster results.

The first AGIs will probably be aligned to take orders

People in charge of AGI projects like power. And by definition, they like their values somewhat better than the aggregate values of all of humanity. It also seems like there's a pretty strong argument that Instruction-following AGI is easier than value aligned AGI. In the slow-ish takeoff we expect, this alignment target seems to allow for error-correcting alignment, in somewhat non-obvious ways. If this argument holds up even weakly, it will be an excuse for the people in charge to do what they want to anyway. 

I hope I'm wrong and value-aligned AGI is just as easy and likely. But it seems like wishful thinking at this point.

The first AGI probably won't perform a pivotal act

In realistically slow takeoff scenarios, the AGI won't be able to do anything like make nanobots to melt down GPUs. It would have to use more conventional methods, like software intrusion to sabotage existing projects, followed by elaborate monitoring to prevent new ones. Such a weak attempted pivotal act could fail, or could escalate to a nuclear conflict.

Second, the humans in charge of AGI may not have the chutzpah to even try such a thing. Taking over the world is not for the faint of heart. They might get it after their increasingly-intelligent AGI carefully explains to them the consequences of allowing AGI proliferation, or they might not. If the people in charge are a government, the odds of such an action go up, but so do the risks of escalation to nuclear war. Governments seem to be fairly risk-taking. Expecting governments to not just grab world-changing power while they can seems naive, so this is my median scenario.

So RSI-capable AGI may proliferate until a disaster occurs

If we solve alignment and create personal intent aligned AGI but nobody manages a pivotal act, I see a likely future world with an increasing number of AGIs capable of recursively self-improving. How long until someone tells their AGI to hide, self-improve, and take over?

Many people seem optimistic about this scenario. Perhaps network security can be improved with AGIs on the job. But AGIs can do an end-run around the entire system: hide, set up self-replicating manufacturing (robotics is rapidly improving to allow this), use that to recursively self-improve your intelligence, and develop new offensive strategies and capabilities until you've got one that will work within an acceptable level of viciousness.[1] 

If hiding in factories isn't good enough, do your RSI manufacturing underground. If that's not good enough, do it as far from Earth as necessary. Take over with as little violence as you can manage or as much as you need. Reboot a new civilization if that's all you can manage while still acting before someone else does. 

The first one to pull the stops probably wins. This looks all too much like a non-iterated Prisoner's Dilemma with N players - and N increasing.

Counterarguments/Outs

For small numbers of AGI and similar values among their wielders, a collective pivotal act could be performed. I place some hopes here, particularly if political pressure is applied in advance to aim for this outcome, or if the AGIs come up with better cooperation structures and/or arguments than I have.

The nuclear MAD standoff with nonproliferation agreements is fairly similar to the scenario I've described.  We've survived that so far- but with only nine participants to date.

One means of preventing AGI proliferation is universal surveillance by a coalition of loosely cooperative AGI (and their directors). That might be done without universal loss of privacy if a really good publicly encrypted system were used, as Steve Omohundro suggests, but I don't know if that's possible. If privacy can't be preserved, this is not a nice outcome, but we probably shouldn't ignore it.

The final counterargument is that, if this scenario does seem likely, and this opinion spreads, people will work harder to avoid it, making it less likely. This virtuous cycle is one reason I'm writing this post including some of my worst fears.

Please convince me I'm wrong. Or make stronger arguments that this is right.

I think we can solve alignment, at least for personal-intent alignment, and particularly for the language model cognitive architectures that may well be our first AGI. But I'm not sure I want to keep helping with that project until I've resolved the likely consequences a little more. So give me a hand?

 

  1. ^

    Some maybe-less-obvious approaches to takeover, in ascending order of effectiveness: Drone/missile-delivered explosive attacks on individuals controlling and data centers housing rival AGI; Social engineering/deepfakes to set off cascading nuclear launches and reprisals; dropping stuff from orbit or altering asteroid paths; making the sun go nova. 

    The possibilities are limitless. It's harder to stop explosions than to set them off by surprise. A superintelligence will think of all of these and much better options. Anything more subtle that preserves more of the first actors' near-term winnings (earth and humanity) is gravy. The only long-term prize goes to the most vicious. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI 技术对齐 关键行动 灾难风险
相关文章