On The Formal Definition of Alignment

少点错误 07月02日 08:14

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

文章探讨了在人工智能发展中，如何确保人类能够自主地更新自己的价值观，而不是受到超级智能的操控。作者认为，理想的对齐方式是人工智能提供准确的数据和推理，而人类保留对价值观的选择权。此外，文章也强调了价值讨论的对称性，以及在极端情况下使用策略性操控的可能性。文章的核心在于，通过人工智能的帮助，使每个个体都能按照自己的价值观取得成功，实现广泛的共同进步。

🧠作者强调在人工智能发展中，人类应保持对自身价值观的控制权，避免被超级智能操纵。人工智能应提供准确的资料和推理，但人类保留对价值观的选择。

🤝作者主张在价值讨论中，参与者应有平等的机会更新价值观，从而实现对称性的价值改变。这可以通过两个具有不断变化的目标的智能体之间的互动来建模。

💡作者认为，即使超级智能了解人类的价值观，它也只能计算并解释实现这些价值观的最优策略，而不是强迫人类接受。人类接受策略是因为它最符合自己的价值观。

🌟文章提出，对齐意味着人工智能通过提供准确的世界模型和最优方案，帮助个体实现其目标，而不会劫持或改写这些目标。这种方式有助于实现个体成功和广泛的共同进步。

Published on July 2, 2025 12:05 AM GMT

I want to retain the ability to update my values over time, but I don’t want those updates to be the result of manipulative optimization by a superintelligence. Instead, the superintelligence should supply me with accurate empirical data and valid inferences, while leaving the choice of normative assumptions—and thus my overall utility function and its proxy representation (i.e., my value structure)—under my control. I also want to engage in value discussions (with either humans or AIs) where the direction of value change is symmetric: both participants have roughly equal probability of updating, so that persuasive force isn’t one-sided. This dynamic can be formally modeled as two agents with evolving objectives or changing proxy representations of their objectives, interacting over time.
That's what alignment means to me: normative freedom with slowly evolving symmetric changes across agents.
In rare cases, strategic manipulation might be justified—e.g., if an agent’s values are extremely dangerous—but that would be a separate topic involving the deliberate use of misalignment, not alignment itself.

A natural concern is whether high intelligence and full information would cause agents to converge on the same values. But convergence is not guaranteed if agents differ in their terminal goals or if their value systems instantiate distinct proxy structures.
Still, suppose a superintelligence knows my values precisely (whether fixed or dynamically updated). It can then compute the optimal policy for achieving them and explain that policy to me. If I accept its reasoning, I follow the policy not due to coercion but because it best satisfies my own value function. In such a world, each agent can be helped to succeed according to their own values, and since utility isn’t necessarily zero-sum, widespread success is possible. This scenario suggests a pre-formal notion of alignment: the AI enables agents to achieve their goals by supplying accurate world models and optimal plans under user-specified normative assumptions, without hijacking or implicitly rewriting those assumptions.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签