少点错误 04月23日 07:52
Is alignment reducible to becoming more coherent?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了人工智能对齐领域中的一个关键问题:如何构建一个“稳定指向价值”的指针,以便在构建人工智能时,能够确保其行为符合人类价值观。文章提出了一种通过终端控制来解决该问题的策略,即让人工智能服从通过特定终端发出的指令。然而,这种方法面临着挑战,例如终端控制权的转移以及人类价值观本身的不一致性。作者认为,虽然存在困难,但通过精心设计观察-效用(OU)智能体,并将其视为一种理性增强,可以部分克服这些挑战。文章强调了进一步研究和实验的必要性。

🧠对齐的核心挑战在于构建“稳定指向价值”的指针,以便在人工智能中识别和最大化人类价值观。

💡文章提出了一种通过终端控制来解决对齐问题的策略,即让人工智能服从通过特定终端发出的指令,从而学习和适应人类偏好。

⚠️这种方法面临的挑战包括终端控制权的转移,以及人类价值观本身的不一致性,例如受到家庭、文化等多种因素的影响。

🧐作者认为,通过精心设计观察-效用(OU)智能体,并将其视为一种理性增强,可以部分克服这些挑战,但需要进一步的研究和实验。

Published on April 22, 2025 11:47 PM GMT

Epistemic status: Like all alignment ideas, this one is incomplete and/or wrong, but I am hoping mostly incomplete and not wrong.

One of the hard subproblems of alignment is constructing a "stable pointer to value" (see this overview and this (sub)sequence discussing OU agents). The way that I think of this is that we want to be able to identify the concept of "human values" in the mind of any agent we build, so that we can design it to maximize human values (and not merely a leaky proxy).

The first obstacle is that it is very hard to identify any predetermined concept whatsoever in the mind of an agent with very rich ontology, and rich ontology seems to be necessary for powerful learning (this is one of the main intuitions I have gained by working on AIXI for years).

The second obstacle is that human values may not even be a very well-defined concept. In fact, I think this is almost a part of the first obstacle; one of the reasons it is hard to identify a predetermined concept is that it might not be as meaningful as you think. 

Now, I'd like to argue that maybe we can ignore the first obstacle by making the second obstacle harder, and further that this may be a good strategy because the second obstacle was already very hard. 

This strategy comes from a simple suggestion of @abramdemski: we can just have our (artificial) agent obey the commands issued through a given terminal (or other well-defined input channel). More precisely, we can design a protocol so that the agent learns the preferences expressed through the terminal. 

So if we put a human at that terminal, does this point to the values of that human? What if a lot of humans take turns at the terminal? What if the AI seizes control of the terminal?

I think that this is not an insurmountable problem for observation-utility agents (OU agents) which try to maximize the current utility function of the user. I think there is a meaningful (and maybe not so hard to formalize) sense in which the user's utility function changes in a very discontinuous way when a new person / nanobot-possessed zombie sits down at the terminal, shoving the last guy's cold limp body aside. A carefully designed OU agent shouldn't want to replace the user. But this is subtle, because the agent is built to obey the terminal, not the user. When the terminal changes hands, in a sense the agent "interprets" this as incoherence in the terminal's utility function, not a change in the terminal's utility function - the terminal's utility function already has changing users factored in.

Insofar as the agent expects to seize control of the terminal, despite the objections of the current controller of the terminal, this is a type of temporal incoherence in the terminal's utility function. It is a particularly bad type of incoherence because the agent has some control over it, and (at least under a careless implementation) is incentivized to exercise that control to make the utility function easy to satisfy. This is closely connected to corrigibility, though it is kind of reversed because we want the agent not to "correct" the user.

However, I am not sure that this problem introduces qualitatively new difficulties: the human's values where already incoherent, and part of that incoherence already came from the influence of other agents!

Every person's values (and expressed preferences) are shaped by many factors including their family, friends, culture, government, and sometimes religion. Because this shaping is often intentional, there are always other (inhuman) agents acting through the human user. Also, @Richard_Ngo would probably argue that a human is best understood as a coalition of (cognitive) subagents, each with their own (approximate) goals.

Therefore, there are always many agent's with some control over the terminal, and what we want is to identify the ones that we endorse and optimize some kind of aggregate of their interests. Personally, I'm enough of a Bayesian to suggest framing this problem as constructing an ideal utility function from highly, adversarially incoherent preferences. One possible solution is to identify the current dominant (sub)agent (hopefully the human user) and complete its preferences appropriately. This may of course fail if humans cannot be usefully understood as agents.   

Under this research program, alignment can be viewed as a type of rationality enhancement, in which an agent interacts with a user in order to elicit their true preferences on reflection. This is not a trivial problem; for instance, we should be comfortable with the human's local preferences changing as they gain more information, but we're generally not comfortable with the agent manipulating the human to change their preferences. We want a rationality enhancement protocol that we can trust

One advantage of this research program is that it seems fairly simple to solve easy versions of the problem. In fact, the easiest version is just CIRL; the major difference is that we view the human as very irrational and subject to various types of influence by other agents, and we are focused on enhancing the human's rationality, not just discovering their preferences. However, it's important not to equate progress on easy versions of the problem with progress on hard versions. A method that works on POMDPs may not transfer to mixtures of probabilistic programs; things fundamentally change when your ontology is very rich (also, when your action space contains many types of sophisticated influence). A massive amount of further mathematical and philosophical progress is necessary. Still, I think it is important for agent foundations research to reach the level of (useful) implementation and experimentation in the near future, so I would like to see / build some demos.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能对齐 稳定指向价值 终端控制 理性增强
相关文章