少点错误 01月11日
Is AI Alignment Enough?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨AI安全问题,认为仅实现AI对齐不足以确保安全未来,实现人类对齐才是关键。还提到了人类终端目标、工具性目标、相关假设及始终运行的AI等内容。

人类终端目标是使AI执行关键超人类工程任务时,致人死亡风险低于50%

实现人类对齐可防止建造未对齐的超级智能AI,满足终端目标

始终运行的AI也需对齐,人类对齐是实现终端目标的必要充分条件

实现AI对齐后需有计划实现人类对齐,否则应先制定人类对齐计划

Published on January 10, 2025 6:57 PM GMT

Virtually everyone I see in the AI safety community seems to believe that working on AI alignment is the key to ensuring a safe future. However, it seems to me that AI alignment is at best a secondary instrumental goal that can't in and of itself achieve our terminal goal. At worst, it's a complete distraction.

Defining humanity's terminal goal

I'll define humanity's terminal goal in the context of AI as keeping the "price" of each pivotal superhuman engineering task that an AI does for us at a lower than 50% chance of a billion or more human deaths. These numbers come from the minimal acceptable definition of AI alignment in Yudkowsky's list of lethalities.

It'd be more precise to say that we care about an AI killing over a billion people or doing something else equally or more horrible than killing a billion people by human standards. I can't define exactly what those horrible actions are (if I could, alignment would perhaps be halfway solved). This disclaimer is also too long to keep writing out so please mentally add "or do something equally horrible by human standards" every time you see "kill over a billion people" below.

Defining humanity's instrumental goals

Logically, there are only two paths to ensuring that the probability that the price of a pivotal superhuman engineering task is not excessive.

    Achieving AI alignment: I'll again use Yudkowsky's minimal definition of "aligned" as the AI having less than a 50% chance of killing over a billion people per pivotal superhuman engineering task. This is the same definition as in our terminal goal.Achieving human alignment: Ensuring that humanity will not build (let alone turn on) an AI that has the cognitive power to have a 50% chance of killing over a billion people unless the following two conditions are met:
      The AI can be proven to be aligned before it's turned on.There's some pivotal superhuman engineering task that the AI will be capable of that's worth taking the risk for.

Assumptions

Why human alignment is the primary instrumental goal

Suppose that the above AI alignment goals are immediately achieved, today. Yudkowsky makes the point that this will not prevent AI labs that don't care about alignment from building an unaligned superintelligent AI. Therefore, an AI with a 50% or higher chance of killing over a billion people will still get built and turned on, and we will have failed in our terminal goal.

Yudkowsky solves this problem with a 'pivotal act', something you can get the aligned superintelligent AI to do to prevent any other labs from building unaligned AI. This shows that achieving AI alignment is not enough - you must then have a plan to do something with it. That something has to prevent humanity from building AIs with greater and greater cognitive powers with reckless abandon. If you tell the aligned superintelligent AI to "burn all GPUs" (to borrow Yudkowsky's example), what you're actually doing is achieving human alignment by force rather than with persuasion. I'm not saying this in a condemning manner at all, just pointing out that this is just another path to human alignment. In the end, it's the human alignment that's necessary to achieve our terminal goal.

On the other hand, if human alignment is achieved without AI alignment, then humanity will prevent a superintelligent AI from getting built until it can be proven to be aligned before it's turned on (which may be never). The terminal goal is satisfied.

In summary, human alignment is both necessary and sufficient to achieve our terminal goals. AI alignment is only possibly useful as a secondary instrumental goal to bring about human alignment.

What about always-on AIs?

This section doesn't have an effect on the overall argument, but I'm including it for logical completeness.

In addition to AIs used for pivotal superhuman engineering tasks, humanity will also have some (hopefully weaker) AIs constantly running, just maintaining stuff. We need to have a threshold of how many incidents we will tolerate from those always-on AIs that may not have the cognitive power to do pivotal superhuman engineering tasks, but may still well be capable of killing us all. I'll define our second terminal goal as having a lower than 1% chance of a billion or more human deaths per year from the combined total of these always-on AIs (the numbers are completely arbitrary).

To differentiate between the two types of AIs defined in the two terminal goals, I'll use the terms "superintelligent AIs" for the AIs that would be used for pivotal superhuman engineering tasks and "always-on AIs" for the AIs that would be always on. Unfortunately, there could be some overlap between the two sets, as there's nothing theoretically stopping humanity from keeping an AI capable of pivotal superhuman engineering tasks always on...

We'll define instrumental goals for the always-on AIs:

    Achieving always-on AI alignment: There are many always-on AIs, and their failures are almost certainly not independent variables, but somehow we work it out so that the combined set of all of them has a less than 1% chance of killing over a billion people in any given year. If we can achieve that, we call that combined set of always-on AIs aligned.Achieving human alignment: Ensuring that humanity will not continually run a combined total set of AIs that has a 1% or higher chance of killing over a billion people a year.

Without human alignment, there's nothing stopping labs from keeping an arbitrary number of superintelligent AIs that should normally be reserved for pivotal superhuman engineering tasks always on, thereby skyrocketing the risk. Therefore, human alignment is once again the necessary and sufficient condition of achieving our terminal goals.

Focus on human alignment

It's important for us to understand that AI alignment alone will not achieve our terminal goal(s).

Fast human alignment is possible without AI

I purposefully chose the cover of Hobbes' 1651 Leviathan as the preview image. The Leviathan is the origin of social contract theory and the front cover depicts the central idea of the book - many individual humans becoming a single entity - we could say becoming "aligned" as a single will. Hobbes wrote the Leviathan in response to his shock of witnessing the brutality of the English Civil War. To me, the Leviathan is a cry that anything, including submitting unconditionally to absolute power, is better than the horror of war.

The closest thing to Hobbes' experience for me personally is being born in the former Soviet Union. I'll relay my understanding of the Soviet experience that I absorbed through osmosis. If there are people here with more knowledge and experience, please correct me if I got it wrong.

Imagine that you were from a very early age being shown a tapestry that shows some glorious future of humanity. You were taught that you will contribute to that vision once you grow up. Then you grow up, full of bright-eyed dreams, and when you pull back the tapestry, there is is a giant meatgrinder. When the Party tells you to get in the meatgrinder, you get in. You don't ask how people getting ground in the meatgrinder will contribute to the future on the tapestry. You don't ask whether getting in the meatgrinder is the best use of your talents and aspirations as a human being. You let the Party worry about the future, you just worry about obeying the Party.

I believe that most of the revolutionaries of 1917 were full of good intentions to save humanity from what they saw as the meatgrinder of capitalism, and they really didn't intend to build an even worse meatgrinder themselves. Some may argue that human meatgrinders don't last forever, so it's better than extinction. To that I respond that AI safety may become associated with human meatgrinders in the same way communism now is in Eastern Europe. And if the idea of AI safety becomes resented on a visceral level by a large enough number of people, then humanity is probably still getting atomized by nanobots, just with extra steps.

It's intellectually dishonest to say that human alignment is impossible. Convincing people through rational arguments is not the only way to achieve human alignment, nor even in the top 10 of the most historically used ways. However, achieving global human alignment at the required speeds could easily end up worse than just getting atomized by nanobots. Still, if there's a viable path to achieving human alignment somewhere between "arguing with idiots on Twitter" and "literally 1984", then this could still be a surviving world. Survival is just not always pretty...



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 人类对齐 终端目标 AI对齐
相关文章