Is AI Alignment Enough?

Published on January 10, 2025 6:57 PM GMT

Virtually everyone I see in the AI safety community seems to believe that working on AI alignment is the key to ensuring a safe future. However, it seems to me that AI alignment is at best a secondary instrumental goal that can't in and of itself achieve our terminal goal. At worst, it's a complete distraction.

Defining humanity's terminal goal

I'll define humanity's terminal goal in the context of AI as keeping the "price" of each pivotal superhuman engineering task that an AI does for us at a lower than 50% chance of a billion or more human deaths. These numbers come from the minimal acceptable definition of AI alignment in Yudkowsky's list of lethalities.

It'd be more precise to say that we care about an AI killing over a billion people or doing something else equally or more horrible than killing a billion people by human standards. I can't define exactly what those horrible actions are (if I could, alignment would perhaps be halfway solved). This disclaimer is also too long to keep writing out so please mentally add "or do something equally horrible by human standards" every time you see "kill over a billion people" below.

Defining humanity's instrumental goals

Logically, there are only two paths to ensuring that the probability that the price of a pivotal superhuman engineering task is not excessive.

The AI can be proven to be aligned before it's turned on.There's some pivotal superhuman engineering task that the AI will be capable of that's worth taking the risk for.

Assumptions

By default, AIs capable of pivotal superhuman engineering tasks are not aligned. In other words, an AI with sufficiently high cognitive power to achieve a pivotal superhuman engineering task that is built without any concern for alignment has a 50% or higher chance of killing over a billion people.By default, humanity is not aligned - There are AI labs around the world that are working on building AIs with cognitive power as high as they can make them without any concern for alignment, and those labs are not being stopped.

Why human alignment is the primary instrumental goal

Suppose that the above AI alignment goals are immediately achieved, today. Yudkowsky makes the point that this will not prevent AI labs that don't care about alignment from building an unaligned superintelligent AI. Therefore, an AI with a 50% or higher chance of killing over a billion people will still get built and turned on, and we will have failed in our terminal goal.

Yudkowsky solves this problem with a 'pivotal act', something you can get the aligned superintelligent AI to do to prevent any other labs from building unaligned AI. This shows that achieving AI alignment is not enough - you must then have a plan to do something with it. That something has to prevent humanity from building AIs with greater and greater cognitive powers with reckless abandon. If you tell the aligned superintelligent AI to "burn all GPUs" (to borrow Yudkowsky's example), what you're actually doing is achieving human alignment by force rather than with persuasion. I'm not saying this in a condemning manner at all, just pointing out that this is just another path to human alignment. In the end, it's the human alignment that's necessary to achieve our terminal goal.

On the other hand, if human alignment is achieved without AI alignment, then humanity will prevent a superintelligent AI from getting built until it can be proven to be aligned before it's turned on (which may be never). The terminal goal is satisfied.

In summary, human alignment is both necessary and sufficient to achieve our terminal goals. AI alignment is only possibly useful as a secondary instrumental goal to bring about human alignment.

What about always-on AIs?

This section doesn't have an effect on the overall argument, but I'm including it for logical completeness.

In addition to AIs used for pivotal superhuman engineering tasks, humanity will also have some (hopefully weaker) AIs constantly running, just maintaining stuff. We need to have a threshold of how many incidents we will tolerate from those always-on AIs that may not have the cognitive power to do pivotal superhuman engineering tasks, but may still well be capable of killing us all. I'll define our second terminal goal as having a lower than 1% chance of a billion or more human deaths per year from the combined total of these always-on AIs (the numbers are completely arbitrary).

To differentiate between the two types of AIs defined in the two terminal goals, I'll use the terms "superintelligent AIs" for the AIs that would be used for pivotal superhuman engineering tasks and "always-on AIs" for the AIs that would be always on. Unfortunately, there could be some overlap between the two sets, as there's nothing theoretically stopping humanity from keeping an AI capable of pivotal superhuman engineering tasks always on...

We'll define instrumental goals for the always-on AIs:

Achieving always-on AI alignment: There are many always-on AIs, and their failures are almost certainly not independent variables, but somehow we work it out so that the combined set of all of them has a less than 1% chance of killing over a billion people in any given year. If we can achieve that, we call that combined set of always-on AIs aligned.Achieving human alignment: Ensuring that humanity will not continually run a combined total set of AIs that has a 1% or higher chance of killing over a billion people a year.

Without human alignment, there's nothing stopping labs from keeping an arbitrary number of superintelligent AIs that should normally be reserved for pivotal superhuman engineering tasks always on, thereby skyrocketing the risk. Therefore, human alignment is once again the necessary and sufficient condition of achieving our terminal goals.

Focus on human alignment

It's important for us to understand that AI alignment alone will not achieve our terminal goal(s).

If you are focusing your efforts on AI alignment efforts, you need to have a plan for how achieving AI alignment will subsequently help you achieve human alignment. The plan could be building a superintelligent aligned AI and giving it the directive "burn all GPUs". Note that giving an AI the ability to forcefully destroy the possessions of humans without their consent may be outside your prior definition of alignment, so plan accordingly.If you don't have a clear plan for how you'll achieve human alignment from achieving AI alignment, your efforts should not be focused on AI alignment. Focusing on AI alignment without a clear plan for how that'll help achieve human alignment will not help achieve the terminal goal(s). Your efforts should be focused on making a concrete plan to achieve human alignment as soon as possible instead. Once human alignment is achieved, you'll be free to pursue AI alignment at your leisure.

Fast human alignment is possible without AI

I purposefully chose the cover of Hobbes' 1651 Leviathan as the preview image. The Leviathan is the origin of social contract theory and the front cover depicts the central idea of the book - many individual humans becoming a single entity - we could say becoming "aligned" as a single will. Hobbes wrote the Leviathan in response to his shock of witnessing the brutality of the English Civil War. To me, the Leviathan is a cry that anything, including submitting unconditionally to absolute power, is better than the horror of war.

The closest thing to Hobbes' experience for me personally is being born in the former Soviet Union. I'll relay my understanding of the Soviet experience that I absorbed through osmosis. If there are people here with more knowledge and experience, please correct me if I got it wrong.

Imagine that you were from a very early age being shown a tapestry that shows some glorious future of humanity. You were taught that you will contribute to that vision once you grow up. Then you grow up, full of bright-eyed dreams, and when you pull back the tapestry, there is is a giant meatgrinder. When the Party tells you to get in the meatgrinder, you get in. You don't ask how people getting ground in the meatgrinder will contribute to the future on the tapestry. You don't ask whether getting in the meatgrinder is the best use of your talents and aspirations as a human being. You let the Party worry about the future, you just worry about obeying the Party.

I believe that most of the revolutionaries of 1917 were full of good intentions to save humanity from what they saw as the meatgrinder of capitalism, and they really didn't intend to build an even worse meatgrinder themselves. Some may argue that human meatgrinders don't last forever, so it's better than extinction. To that I respond that AI safety may become associated with human meatgrinders in the same way communism now is in Eastern Europe. And if the idea of AI safety becomes resented on a visceral level by a large enough number of people, then humanity is probably still getting atomized by nanobots, just with extra steps.

It's intellectually dishonest to say that human alignment is impossible. Convincing people through rational arguments is not the only way to achieve human alignment, nor even in the top 10 of the most historically used ways. However, achieving global human alignment at the required speeds could easily end up worse than just getting atomized by nanobots. Still, if there's a viable path to achieving human alignment somewhere between "arguing with idiots on Twitter" and "literally 1984", then this could still be a surviving world. Survival is just not always pretty...

Discuss

Defining humanity's terminal goal

Defining humanity's instrumental goals

Assumptions

Why human alignment is the primary instrumental goal

What about always-on AIs?

Focus on human alignment

Fast human alignment is possible without AI

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签