Published on July 2, 2025 12:05 AM GMT
I want to retain the ability to update my values over time, but I don’t want those updates to be the result of manipulative optimization by a superintelligence. Instead, the superintelligence should supply me with accurate empirical data and valid inferences, while leaving the choice of normative assumptions—and thus my overall utility function and its proxy representation (i.e., my value structure)—under my control. I also want to engage in value discussions (with either humans or AIs) where the direction of value change is symmetric: both participants have roughly equal probability of updating, so that persuasive force isn’t one-sided. This dynamic can be formally modeled as two agents with evolving objectives or changing proxy representations of their objectives, interacting over time.
That's what alignment means to me: normative freedom with slowly evolving symmetric changes across agents.
In rare cases, strategic manipulation might be justified—e.g., if an agent’s values are extremely dangerous—but that would be a separate topic involving the deliberate use of misalignment, not alignment itself.
A natural concern is whether high intelligence and full information would cause agents to converge on the same values. But convergence is not guaranteed if agents differ in their terminal goals or if their value systems instantiate distinct proxy structures.
Still, suppose a superintelligence knows my values precisely (whether fixed or dynamically updated). It can then compute the optimal policy for achieving them and explain that policy to me. If I accept its reasoning, I follow the policy not due to coercion but because it best satisfies my own value function. In such a world, each agent can be helped to succeed according to their own values, and since utility isn’t necessarily zero-sum, widespread success is possible. This scenario suggests a pre-formal notion of alignment: the AI enables agents to achieve their goals by supplying accurate world models and optimal plans under user-specified normative assumptions, without hijacking or implicitly rewriting those assumptions.
Discuss