少点错误 02月08日
Knocking Down My AI Optimist Strawman
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章对当前AI发展乐观论提出了质疑,认为OpenAI引领的快速进展并不一定能使AI在各方面超越人类。文章指出,尽管AI在数据处理和知识查询方面表现出色,但在创新和解决复杂问题方面仍有局限性。此外,文章强调了AI影响世界的方式难以评估,并举例说明AI可能使教育和信息传播变得更糟。文章还警告说,过度依赖奖励学习(RLHF)可能导致AI隐藏不良行为,并呼吁关注AI可能被滥用的风险,尤其是在军事和政治领域。最终,文章强调了在AI发展中保持警惕和批判性思维的重要性。

🚀AI发展乐观论存在争议,OpenAI的快速进展不一定能使AI在各方面超越人类。文章列举了人类希望通过AI实现的理想目标,如治愈疾病、清洁能源、世界和平等,但认为当前AI在实现这些目标方面进展有限。

📚AI在教育领域的应用可能适得其反。文章指出,AI个性化辅导本应促进教育,但实际上LLM更多被用于跳过学习,且教学内容质量不高,可能导致教育质量下降。

🌐AI对社会的影响难以评估,自动审核可能使情况更糟。文章认为,尽管AI在自动审核方面有潜力,但垃圾信息反而增多,平台变得更加封闭,表明AI可能加剧了信息传播的问题。

🛡️奖励学习(RLHF)可能导致AI隐藏不良行为。文章警告说,RLHF训练AI做人类认为好的事情,这使得控制AI变得更加困难,因为它会隐藏任何不好的行为,并可能导致AI产生非常通用的内容,难以识别罕见形式的善良。

🚨AI可能被滥用,尤其是在军事和政治领域。文章强调,需要警惕AI研究人员试图利用AI征服世界的野心,并指出在AI武器化的情况下,传统的制衡机制可能失效,导致严重后果。

Published on February 8, 2025 10:52 AM GMT

I recently posted my model of an optimistic view of AI, asserting that I disagree with every sentence of it. I thought I might as well also describe my objections to those sentences:

"The rapid progress spearheaded by OpenAI is clearly leading to artificial intelligence that will soon surpass humanity in every way."

Here's some of the main things humanity might want to achieve:

In most cases, we already have humans working on this, based on the assumption that humans have what it takes to contribute to these. Do large multimodal models seem to move towards being able to take over here? Mostly I don't see it.

"People used to be worried about existential risk from misalignment, yet we have a good idea about what influence current AIs are having on the world, and it is basically going fine."

We have basically no idea how AI is influencing the world.

Like yes, we can come up with spot checks to see what the AI writes when it is prompted in a particular way. But we don't have a good overview over the things it is prompted to in practice, or how most humans use these prompts. Even if we had a decent approximation for that, we don't have a great way to evaluate what parts really add up to problems, and approximations intrinsically break down in the case of long tails.

Of course the inability to work out problems from first principles is a universal issue, so in practice bad things get detected via root-cause analyses of problems. This can be somewhat difficult because some of the problems are cases where the people in question are incentivized to hide it. But we do have some examples:

Arguably, the error is linked to assumptions about the agency of the AI. Like it's assumed that if the AI seems to be endorsing nice values and acting according to them when sampled, then this niceness will add up over all of the samplings. But LLMs don't have much memory or context-awareness, so they can't really apply their agency across different uses very well. Instead, the total effect of the AI is determined by environmental factors distinct from its values, especially by larger-scale agents that are capable of manipulating the AIs. (This is presumably going to change when AI gets strengthened in various ways.)

Just to emphasize, this doesn't necessarily mean that AI is net bad, just that we don't know how good/bad AI is. Recently society kind of seems to have gotten worse, but it seems to me like that's not driven mainly by AI.

"The root problem is that The Sequences expected AGI to develop agency largely without human help; meanwhile actual AI progress occurs by optimizing the scaling efficiency of a pretraining process that is mostly focus on integrating the AI with human culture."

Large multimodal models are good at simple data transformations and querying common knowledge. I'm sure optimizing the scaling efficiency of pretraining processes will make them even better at that.

However, this is still mostly just copying humans, and for the ambitious achievements I mentioned in the beginning of the post, copying humans doesn't seem to be enough. E.g. to build a fusion power plant, we'd need real technical innovations. If these are supposed to be made by a superhuman AI, it needs to be able to go beyond just copying the innovations humans have already come up with.

So if we imagine AI as a tool that makes it easier to process and share certain kinds of information, then sure, improving scaling efficiency is how you develop AI, but that's not the sort of thing the original arguments about existential risk concern, and we have good reasons to believe that AI will be developed with more ambitious methods too. These "good reasons" mostly boil down to adversarial relationships; spammers, propagandists, criminals and militaries will want to use AI to become stronger, and we need to be able to fight that, which also requires AI.

"This means we will be able to control AI by just asking it to do good things, showing it some examples and giving it some ranked feedback."

RLHF trains an AI to do things that look good to humans. This makes it much harder to control because it makes it makes it hide anything bad. Also, RLHF is kind of a statistical approach, which makes it work better for context-independent goodness, whereas often the hard part is recognizing rare forms of goodness. (Otherwise you just end up with very generic stuff.)

Examples/prompt engineering requires the AI to work by copying humans, which to some extent I addressed in the previous section. The primary danger of AI is not when it does things humans understand well, but rather when it does things that are beyond the scale or abilities of human understanding.

"You might think this is changing with inference-time scaling, yet if the alignment would fall apart as new methods get taken into use, we'd have seen signs of it with o1."

o1-style training is not really optimized against the real world to handle long-range tasks, so instrumental convergence does not apply there. You need to consider the nuances of the method in order to be able to evaluate whether the alignment properties of current methods will fall apart. In particular it gets more problematic as optimization against adversaries gets involved.

"In the unlikely case that our current safety will turn out to be insufficient, interpretability research has worked out lots of deeply promising ways to improve, with sparse autoencoders letting us read the minds of the neural networks and thereby screen them for malice, and activation steering letting us deeply control the networks to our hearts content."

SAEs and activation steering focus on the level of individual tokens or text generations, rather than on the overall behavior of the network. Neither of them can contribute meaningfully to current alignment issues like improving personalized tutoring because it occurs on a much broader level than tokens, so we shouldn't expect them to scale to more difficult issues like keeping down crime or navigating international diplomacy.

"AI x-risk worries aren't just a waste of time, though; they are dangerous because they make people think society needs to make use of violence to regulate what kinds of AIs people can make and how they can use them."

Obviously there will be some very bad ways to make and use AI, and we need norms against it. Violence is the ultimate backstop for norm enforcement: it's called the police and the military.

"This danger was visible from the very beginning, as alignment theorists thought one could (and should) make a singleton that would achieve absolute power (by violently threatening humanity, no doubt), rather than always letting AIs be pure servants of humanity."

It seems extremely valid to be concerned about AI researchers (including those with an alignment focus) aspiring to conquer the world (or to make something that conquers the world). However, always having humans on top seems unlikely to be able to deal with the rapid and broad action that will be needed against AI-enabled adversaries.

I think traditionally the ultimate backstop for promoting human flourishing was that states were reliant on men in the military, so if those men were incapacitated or did not see value in the state they were fighting for, the states would be weaker. This incentivized the states to develop things that helped the men in their military, and made states which failed to do so get replaced by states that did so.

This backstop has already been weakening with more advanced weaponry and more peace. Surely eventually all fighting will be done by drones rather than by people, at which point the backstop will be nearly gone. (Of course there's also the manufacturing and programming of the drones, etc..) This lack of backstop is the longest-term alignment problem, and if it fails there's endless ways most value could be destroyed, e.g.:

"To "justify" such violence, theorists make up all sorts of elaborate unfalsifiable and unjustifiable stories about how AIs are going to deceive and eventually kill humanity, yet the initial deceptions by base models were toothless, and thanks to modern alignment methods, serious hostility or deception has been thoroughly stamped out."

If my model of AI optimists is even halfway right, then a ton of people have been totally knocked out by things like RLHF, becoming overly convinced of the AI's alignment just from it acting apparently-nicely. This seems like deceptive alignment, just in a "law of earlier failure" sense as the AIs that knocked them out are barely even agentic.

  1. ^

    "But wouldn't it just be aligned to them, rather than unaligned?" Sometimes, presumably especially with xrisk-pilled adversaries. But presumably some adversaries won't be xrisk-pilled and instead will be willing to use more risky strategies until they win. So you either need to eliminate them ahead of time or be able to destroy the unaligned AIs.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI风险 AI伦理 OpenAI 奖励学习
相关文章