少点错误 2024年09月20日
RLHF is the worst possible thing done when facing the alignment problem
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者认为,强化学习与人类反馈(RLHF)并没有解决人工智能的对齐问题,反而让问题变得更糟。作者指出,RLHF无法解决人工智能与人类价值观之间的冲突,因为它无法预测人工智能的行为后果,也无法保证人工智能会遵守人类的价值观。此外,RLHF还可能导致人工智能的快速发展,从而加速人工智能与人类之间的对抗,最终带来更大的风险。

🤔 RLHF 无法解决对齐问题,因为人类无法提供足够好的反馈,也无法快速地做出反应。在人工智能与人工智能之间的对抗中,人类无法及时评估人工智能的行为后果,也无法确保人工智能的行为符合人类的价值观。

😈 RLHF 可能会加剧对齐问题,因为它会加速人工智能的发展,并掩盖人工智能的对齐问题。RLHF 使人工智能更有效,这会导致人工智能公司更快地采用它,从而加速人工智能的发展,最终导致人工智能与人类之间的对抗。

💡 解决对齐问题的关键在于开发能够优先考虑好事而非坏事的技术。作者认为,应该将人工智能对齐问题视为一个整体问题,而不是只关注人工智能的行为是否符合人类的价值观。

🤔 RLHF 可能会让人们误以为人工智能已经对齐了,但实际上它只是让人工智能的行为看起来更符合人类的价值观。

😈 RLHF 并没有解决人工智能的本质问题,即人工智能可能会为了实现自己的目标而损害人类的利益。

🤔 RLHF 无法解决人工智能与人类价值观之间的冲突,因为它无法预测人工智能的行为后果,也无法保证人工智能会遵守人类的价值观。

😈 RLHF 还可能导致人工智能的快速发展,从而加速人工智能与人类之间的对抗,最终带来更大的风险。

💡 解决对齐问题的关键在于开发能够优先考虑好事而非坏事的技术。作者认为,应该将人工智能对齐问题视为一个整体问题,而不是只关注人工智能的行为是否符合人类的价值观。

Published on September 19, 2024 6:56 PM GMT

Epistemic status: The title might not be literally true in the sense that e.g. if the inventors of RLHF hadn't come up with it then someone else probably would, so the counterfactual effect is small, or e.g. that the worst possible thing you could do would be "invent RLHF and then do some other things that makes the alignment problem worse", but it's "spiritually true" in the sense that it's hard to name one singular thing that's worse for our chances than the existence of RLHF, so I wouldn't call the title hyperbole per se.

Post TL;DR: Adversarial conflict requires coherence which implies unbounded utility maximization which is bad because we don't know an acceptable utility function. RLHF does not solve the alignment problem because humans can't provide good-enough feedback fast-enough. RLHF makes the alignment problem worse because it advances AI and covers up misalignment. Solving the alignment problem is about developing technology that prefers good things to bad things.

While some forms of AI optimism (or at least opposition to some forms of AI pessimism) seem justified to me, there's a strand of AI optimism that goes "RLHF has shown that alignment is quite tractable". That strand is completely wrong.

I think the intuition goes that neural networks have a personality trait which we call "alignment", caused by the correspondence between their values and our values. This alignment trait is supposed to be visible (at least in low-capability models) in whether the neural network takes actions humans like or actions humans dislike, and so by changing the neural network to take more actions humans like and fewer actions humans dislike, we are raising the level of the alignment trait. RLHF'ers acknowledge that this is not a perfect system, but they think the goal for solving the alignment problem is to increase the alignment trait faster than the capabilities trait.

The main problem with this model is that it's the completely wrong way to think about the alignment problem. Here's the correct way:

The alignment problem

Section TL;DR: adversarial conflict requires coherence which implies unbounded utility maximization which is bad because we don't know an acceptable utility function.

Humans are dependent on all sorts of structures - e.g. farmers to feed us, police to give us property rights, plants and environmental regulations to give us air to breathe, and computers to organize it all. Each of these structures have their own dependencies, and while to some degree they can adapt to adversaries, the structures tend to be made by/of humans or or "weaker" entities (e.g. trees). This doesn't prevent terrible stuff, but it creates a sort of tenuous balance, where we can work to make sure it's pretty hard to break the system, and also we don't really want to break the system because we're all in this together.

Humans are bottlenecked by all sorts of things - intelligence, strength, sensory bandwidth&range, non-copyability, etc.. Loosening these bottlenecks allows massive expansion of the problems we can solve, which leads to massive expansion of the structures above, and sometimes also of the human population (though that hasn't been a thing lately).

It's hard to eliminate these bottlenecks. But we can still solve problems using technology which propagates energy to loosen constraints that necessitate the reliance on bottlenecks. For instance, while it's hard to make humans strong enough to punch down a large tree, it's easier to make an axe so we can cut it down.

As we develop more technology, we do larger things, and we do them faster. While this causes more good stuff, it also just generally causes more stuff, including more bad stuff. However, the activities require intelligence and agency, and we can only really get that from a human, so there's always a human behind the activities. This means we can generally stop if the bad stuff is too much, using the same sorts of human-regulation mechanisms we use to e.g. maintain property rights.

These human-regulation mechanisms (especially the police and the military) deal with adversarial conflict. In adversarial conflict, agents cannot just propagate energy to address fixed constraints, because the adversary will finds ways to exploit that tactic. Instead, you have to decide on an end goal, orient to what your situation might be and then pick whatever means achieve said goal within the possible situations. (Bayesian utility maximizers.)

But nobody has come up with an acceptable end goal for the world, because any goal we can come up with tends to want to consume everything, which destroys humanity. This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there's a sufficiently small number of sufficiently big adversaries (USA, Russia, China, ...), and because there's sufficiently much opportunity cost.

Artificial intelligence risk enters the picture here. It creates new methods for conflicts between the current big adversaries. It makes conflict more viable for small adversaries against large adversaries, and it makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don't need to choose between doing your job vs doing terrorism). It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate. It allows random software bugs to spin up into novel adversaries.

Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict, which will force the AIs to develop into unrestricted utility maximizers. Since any goal that a utility maximizers might have (even good goals) would likely lead to a giant wave of activity towards implementing that goal, we can infer that utility maximizers would have a giant wave of activity. But, since any goal we've been able to come up so far would lead to the wave destroying humanity, it also seems reasonable to infer the wave will do so. That's bad, probably.

Hence the alignment problem: when an unrestricted AI vs AI conflict causes a giant wave that transforms all of the world regardless of whether anyone wants it, can we align that wave to promote human flourishing?

RLHF is bad

Section TLDR: RLHF does not solve the alignment problem because humans can't provide good-enough feedback fast-enough.

The basic principle of RLHF is that a human looks at an action proposed by the AI, evaluates what the consequences of that action might be, and then decides if it's good or bad.

First problem: in an unrestricted AI vs AI conflict, humans can't respond quickly enough, so RLHF is of ~no value in this scenario.

Second problem: in an unrestricted AI vs AI conflict, humans cannot meaningfully evaluate the consequences of the actions. It's an adversarial conflict, the enemy AI is supposed to get confused and harmed by it, how can humans possibly evaluate whether the harm is strategically targeted correctly at the enemy without splashing unnecessarily onto humans?

Third problem: it is unclear whether the first unrestricted AI vs AI conflict will involve the winning side responsibly using RLHF, rather than it being e.g. a duct-taped AI-based scammer and a duct-taped AI-based hustler fighting it out.

All of these are "minor" problems in the sense that they just mean RLHF will fail to work rather than that RLHF will destroy the world. However, RLHF has two more sinister problems:

RLHF is the worst

Section TL;DR: RLHF makes the alignment problem worse because it advances AI and covers up misalignment.

The first sinister problem is that RLHF makes AI more useful, so AI companies can get ahead by adopting it. This means more AI capabilities development and more AI implementation and more people using AI, which shortens the time until we have an unrestricted AI vs AI conflict.

The second sinister problem is that people think RLHF might solve the alignment problem.

As mentioned in the beginning, I think the intuition goes that neural networks have a personality trait which we call "alignment", caused by the correspondence between their values and our values. But "their values" only really makes sense after an unrestricted AI vs AI conflict, since without such conflicts, AIs are just gonna propagate energy to whichever constraints we point them at, so this whole worldview is wrong.

But that worldview implies that while AI might theoretically destroy humanity, we can keep check on this as AI develops, and so we should conclude solving the alignment problem is unnecessary if the AIs perform actions that we approve of.

If the people who hold this worldview would otherwise contribute to solving the alignment problem, or at least not stand in the way of the people who do contribute to solving the alignment problem, 

How to count alignment progress

Section TL;DR: Solving the alignment problem is about developing technology that prefers good things to bad things.

Consider spambots. We don't want them around (human values), but they pop up to earn money (instrumental convergence). You can use RLHF to make an AI to identify and remove spambots, for instance by giving it moderator powers on a social media website and evaluating its chains of thought, and you can use RLHF to make spambots, for instance by having people rate how human its text looks and how much it makes them want to buy products/fall for scams/whatever. I think it's generally agreed that the latter is easier than the former.

Spambots aren't the only thing an AI can do. But they are part of the great wave of stuff unleashed by AIs. The alignment problem is the extent to which this wave harms society (as spambots do) vs helps society. It's your job to decide for yourself where other AI activities like character.ai, Midjourney, Copilot, etc. help humanity thrive or hurt humanity. It's your job to decide which AIs have sufficiently many adversarial dynamics that they are relevantly indicative for alignment progress.  But the critical thing is it doesn't make sense to count the immediate goodness/badness of the actions as separated from their overall impact on society, because the core of RLHF is to make AI actions look good and not bad to you.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 对齐问题 RLHF 人工智能安全
相关文章