少点错误 2024年11月06日
Intent alignment as a stepping-stone to value alignment
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了指令跟随型人工智能通用智能(AGI)与价值对齐型AGI的区别和联系。作者认为,指令跟随型AGI更容易实现且更安全,可以作为通往价值对齐AGI的过渡阶段。尽管指令跟随型AGI可以解决许多问题,但人类控制的权力结构仍然存在被恶意利用的风险,最终目标仍然是实现价值对齐的超级智能。作者建议,通过人类控制指令跟随型AGI,逐步解决价值对齐问题,最终实现人类价值观的延续,但这并非易事,需要谨慎规划和应对潜在风险。

🤔 **指令跟随型AGI更容易实现且更安全:**相较于价值对齐型AGI,指令跟随型AGI更易实现,且由于其可纠正性,在早期阶段更安全,可以避免潜在的灾难性后果。

🧑‍🏫 **指令跟随型AGI可作为通往价值对齐AGI的桥梁:**人类可以利用指令跟随型超级智能AGI来解决价值对齐问题,如同委托一个比人类更聪明的助手完成家庭作业一样。

⚠️ **人类控制的权力结构存在风险:**即使实现了指令跟随型AGI,人类控制的权力结构仍然可能被恶意利用,导致超级智能被用于破坏性目的,危及人类生存。

💡 **最终目标仍是价值对齐型AGI:**尽管指令跟随型AGI可以作为过渡阶段,但最终目标仍然是实现价值对齐型超级智能,以确保人类价值观的延续和人类文明的繁荣。

⏳ **需谨慎规划和应对潜在风险:**实现价值对齐型AGI并非易事,需要谨慎规划和应对潜在风险,确保人类能够控制和引导超级智能的发展方向。

Published on November 5, 2024 8:43 PM GMT

I think Instruction-following AGI is easier and more likely than value aligned AGI, and that this accounts for one major crux of disagreement on alignment difficulty. I got several responses to that piece that didn't dispute that intent alignment is easier, but argued we shouldn't give up on value alignment. I think that's right. Here's another way to frame the value of personal intent alignment: we can use a superintelligent instruction-following AGI to solve full value alignment.

This is different than automated alignment research; it's not hoping tool AI can help with our homework, it's making an AGI smarter than us in every way do our homework for us. It's a longer term plan. Having a superintelligent, largely autonomous entity that just really likes taking instructions from puny humans is counterintuitive, but it seems both logically consistent. And it seems technically achievable on the current trajectory - if we don't screw it up too badly.

Personal, short-term intent alignment (like instruction-following) is safer for early AGI because it includes corrigibility. It allows near-misses. If your AGI did think eliminating humans would be a good way to cure cancer, but it's not powerful enough to make that happen immediately, you'll probably get a chance to say "so what's your plan for that cancer solution?" and "Wait no! Quit working on that plan!" (And that's if you somehow didn't tell it to check with you before acting on big plans).

This type of target really seems to make alignment much easier. See the first linked post, or Max Harms' excellent sequence on corrigibility as a singular (alignment) target (CAST) for a much deeper analysis. An AI that wants to follow directions also wants to respond honestly about its motivations when asked, and to change its goals when told to - because its goals are all subgoals of doing what its principal asks. And this approach doesn't have to "solve ethics" - because it follows the principal's ethics.

And that's the critical flaw; we're still stuck with variable and questionable human ethics. Having humans control AGI is not a permanent solution to the dangers of AGI. Even if the first creators are relatively well-intentioned, eventually someone sociopathic enough will get the reins of a powerful AGI and use it to seize the future.

In this scenario, technical alignment is solved, but most of us die anyway. We die as soon as a sufficiently malevolent person acquires or seizes power (probably governmental power) over an AGI.

But won't a balance of power restrain one malevolently-controlled AGI surrounded by many in good hands? I don't think so. Mutually assured destruction works for nukes but not as well with AGI capable of autonomous recursive self-improvement. A superintelligent AGI will probably be able to protect at least its principal and a few of their favorite people as part of a well-planned destructive takeover. If nobody else has yet used their AGI to firmly seize control of the lightcone, there's probably a way for an AGI to hide and recursively self-improve until it invents weapons and strategies that let it take over - if its principal can accept enough collateral damage. With a superintelligence on your side, building a new civilization to your liking might be seen as more an opportunity than an inconvenience.

These issues are discussed in more depth in If we solve alignment, do we die anyway? and its discussion. To the average human, controlled AI is just as lethal as 'misaligned' AI draws similar conclusions from a different perspective.

It seem inevitable that someone sufficiently malevolent would eventually get the reins of an intent-aligned AGI. This might not take long even if AGI does not proliferate widely; there are Reasons to think that malevolence could correlate with attaining and retaining positions of power. Maybe there's a way to prevent this with the aid of increasingly intelligent AGIs; if not, it seems like taking power out of human hands before it falls into the wrong ones will be necessary. perspective.

Writing If we solve alignment, do we die anyway? and discussing the claims in the comments drew me to the conclusion that the end goal probably needs to be value alignment, just like we've always thought - humans power structures are too vulnerable to infiltration or takeover by malevolent humans. But instruction-following is a safer first alignment target. So it can be a stepping-stone that dramatically improves our odds of getting to value aligned AGI.

Humans in control of highly intelligent AGI will have a huge advantage on solving the full value alignment problem. At some point, they will probably be pretty certain the plan can be accomplished, at least well enough to maintain much of the value of the lightcone by human lights (perfect alignment seems impossible since human values are path-dependent, but we should be able to do pretty well).

Thus, the endgame goal is still full value alignment for superintelligence, but the route there is probably through short-term personal intent alignment.

Is this a great plan? Certainly not. It hasn't been thought through, and there's probably a lot that can go wrong even once it's as refined as possible. In an easier world, we'd Shut it All Down until we're ready to do it wisely. That doesn't look like an option, so I'm trying to plot a practically achievable path from where we are to real success.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能通用智能 价值对齐 指令跟随 超级智能 人类控制
相关文章