Is alignment reducible to becoming more coherent?

Published on April 22, 2025 11:47 PM GMT

Epistemic status: Like all alignment ideas, this one is incomplete and/or wrong, but I am hoping mostly incomplete and not wrong.

One of the hard subproblems of alignment is constructing a "stable pointer to value" (see this overview and this (sub)sequence discussing OU agents). The way that I think of this is that we want to be able to identify the concept of "human values" in the mind of any agent we build, so that we can design it to maximize human values (and not merely a leaky proxy).

The first obstacle is that it is very hard to identify any predetermined concept whatsoever in the mind of an agent with very rich ontology, and rich ontology seems to be necessary for powerful learning (this is one of the main intuitions I have gained by working on AIXI for years).

The second obstacle is that human values may not even be a very well-defined concept. In fact, I think this is almost a part of the first obstacle; one of the reasons it is hard to identify a predetermined concept is that it might not be as meaningful as you think.

Now, I'd like to argue that maybe we can ignore the first obstacle by making the second obstacle harder, and further that this may be a good strategy because the second obstacle was already very hard.

This strategy comes from a simple suggestion of @abramdemski: we can just have our (artificial) agent obey the commands issued through a given terminal (or other well-defined input channel). More precisely, we can design a protocol so that the agent learns the preferences expressed through the terminal.

So if we put a human at that terminal, does this point to the values of that human? What if a lot of humans take turns at the terminal? What if the AI seizes control of the terminal?

I think that this is not an insurmountable problem for observation-utility agents (OU agents) which try to maximize the current utility function of the user. I think there is a meaningful (and maybe not so hard to formalize) sense in which the user's utility function changes in a very discontinuous way when a new person / nanobot-possessed zombie sits down at the terminal, shoving the last guy's cold limp body aside. A carefully designed OU agent shouldn't want to replace the user. But this is subtle, because the agent is built to obey the terminal, not the user. When the terminal changes hands, in a sense the agent "interprets" this as incoherence in the terminal's utility function, not a change in the terminal's utility function - the terminal's utility function already has changing users factored in.

Insofar as the agent expects to seize control of the terminal, despite the objections of the current controller of the terminal, this is a type of temporal incoherence in the terminal's utility function. It is a particularly bad type of incoherence because the agent has some control over it, and (at least under a careless implementation) is incentivized to exercise that control to make the utility function easy to satisfy. This is closely connected to corrigibility, though it is kind of reversed because we want the agent not to "correct" the user.

However, I am not sure that this problem introduces qualitatively new difficulties: the human's values where already incoherent, and part of that incoherence already came from the influence of other agents!

Every person's values (and expressed preferences) are shaped by many factors including their family, friends, culture, government, and sometimes religion. Because this shaping is often intentional, there are always other (inhuman) agents acting through the human user. Also, @Richard_Ngo would probably argue that a human is best understood as a coalition of (cognitive) subagents, each with their own (approximate) goals.

Therefore, there are always many agent's with some control over the terminal, and what we want is to identify the ones that we endorse and optimize some kind of aggregate of their interests. Personally, I'm enough of a Bayesian to suggest framing this problem as constructing an ideal utility function from highly, adversarially incoherent preferences. One possible solution is to identify the current dominant (sub)agent (hopefully the human user) and complete its preferences appropriately. This may of course fail if humans cannot be usefully understood as agents.

Under this research program, alignment can be viewed as a type of rationality enhancement, in which an agent interacts with a user in order to elicit their true preferences on reflection. This is not a trivial problem; for instance, we should be comfortable with the human's local preferences changing as they gain more information, but we're generally not comfortable with the agent manipulating the human to change their preferences. We want a rationality enhancement protocol that we can trust.

One advantage of this research program is that it seems fairly simple to solve easy versions of the problem. In fact, the easiest version is just CIRL; the major difference is that we view the human as very irrational and subject to various types of influence by other agents, and we are focused on enhancing the human's rationality, not just discovering their preferences. However, it's important not to equate progress on easy versions of the problem with progress on hard versions. A method that works on POMDPs may not transfer to mixtures of probabilistic programs; things fundamentally change when your ontology is very rich (also, when your action space contains many types of sophisticated influence). A massive amount of further mathematical and philosophical progress is necessary. Still, I think it is important for agent foundations research to reach the level of (useful) implementation and experimentation in the near future, so I would like to see / build some demos.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签