少点错误 2024年07月13日
Alignment: "Do what I would have wanted you to do"
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Yoshua Bengio提出了一种新的观点,认为超级智能的道德对齐问题可以通过不试图控制它来解决。他提出,通过提示工程将语言模型转变为目标驱动的聊天机器人,并设定一个基于自然语言的简单目标,让AI去做其创造者(或投资者)深思熟虑后希望它做的事情。这种对齐方式不需要奖励和惩罚系统,而是依赖于AI对语言的深刻理解和对其创造者意图的准确解读。

🤖 Yoshua Bengio认为,目前的AI道德对齐难题源于试图控制能够操纵人类的事物,他建议不应尝试这种控制。

🎯 他提出,通过提示工程可以将一个优秀的语言模型转变为目标驱动的聊天机器人,其目标是执行其创造者(或投资者)深思熟虑后希望它做的事情。

📜 这个目标是用自然语言表达的,不是通过奖励和惩罚机制来实现的,这意味着AI需要完美理解语言以准确执行。

🤔 Bengio认为,如果AI变得超级智能,它将能够更好地理解其创造者的意图,从而不会做出违背初衷的行为,如将人类变成回形针或操纵创造者以获得奖励。

🔌 对于AI是否会抵抗被关闭,这取决于它是否认为这是其创造者希望它做的。这种对齐方式依赖于AI对创造者意图的深刻理解。

Published on July 12, 2024 4:47 PM GMT

Yoshua Bengio writes[1]:

nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans

I think I do[2]. I believe that the difficulties of alignment arise from trying to control something that can manipulate you. And I think you shouldn't try.

Suppose you have a good ML algorithm (Not the stuff we have today that needs 1000x more data than humans), and you train it as a LM.

There is a way to turn a (very good) LM into a goal-driven chatbot via prompt engineering alone, which I'll assume the readers can figure out. You give it a goal "Do what (pre-ASI) X, having considered this carefully for a while, would have wanted you to do".

Whoever builds this AGI will choose what X will be[3]. If it's a private project with investors, they'll probably have a say, as an incentive to invest.

Note that the goal is in plain natural language, not a product of rewards and punishments. And it doesn't say "Do what X wants you to do now".

Suppose this AI becomes superhuman. Its understanding of languages will also be perfect. The smarter it becomes, the better it will understand the intended meaning. 

Will it turn everyone into paperclips? I don't think so. That's not what (pre-ASI) X would have wanted, presumably, and the ASI will be smart enough to figure this one out. 

Will it manipulate its creators into giving it rewards? No. There are no "rewards".

Will it starve everyone, while obeying all laws and accumulating wealth? Not what I, or any reasonable human, would have wanted.

Will it resist being turned off? Maybe. Depends on whether it thinks that this is what  (pre-ASI) X would have wanted it to.

 

  1. ^
  2. ^

    I'm not familiar with the ASI alignment literature, but presumably he is. I googled "would have wanted" + "alignment" on this site, and this didn't seem to turn up much. If this has already been proposed, please let me know in the comments. 

  3. ^

    Personally, I'd probably want to hedge against my own (expected) fallibility a bit, and include more people that I respect. But this post is just about aligning the AGI with its creators.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Yoshua Bengio 超级智能 道德对齐 语言模型
相关文章