Published on January 29, 2025 6:44 PM GMT
Epistemic status: The text below is a sort of strawman of AI optimists, where I took my mental model for how I disagree with rationalist AI optimists and cranked it up to 11. I personally disagree with every sentence below, and I'm posting it here because I'm interested in whether AI optimists have any major corrections they want to make in the comments. Of course I understand that everyone has their own unique opinion and so I would expect every AI optimist to at least disagree with some parts of it too.
The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way. People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine.
The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture. This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback.
You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1. In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering letting us deeply control the networks to our hearts content.
AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them. This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity.
To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out.
Discuss