少点错误 01月07日
Definition of alignment science I like
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了人工智能对齐的定义,并分析了不同观点。早期将“友好AI”定义为产生有益人类行为的AI系统,关注良好行为本身。后来,“对齐”更倾向于“可控”,包括纠正、限制副作用等。文章提出,对齐科学应研究智能系统的行为不变性,而非主观价值,这样可以避免哲学上的怀疑论,并聚焦于实际行为。这一定义将外部对齐(想要的行为不变性)和内部对齐(确保AI机制产生期望的行为不变性)联系起来,为解决超智能对齐问题提供了新的视角。

🤖 早期对“友好AI”的定义侧重于AI系统产生的有益人类的行为,而不管如何实现。MIRI的“对齐问题”包括能力约束和限制,目标是利用AI能力获得好的结果。

🎯 Paul Christiano认为,AI与操作者对齐意味着AI试图做操作者想让它做的事,强调AI的动机而非知识或能力。对齐问题的核心是构建与操作者对齐的强大AI系统。

🔬 Richard Ngo将对齐研究定义为关注最坏情况下的不良行为或AI认知科学。作者更倾向于MIRI的定义,即关注结果,但认为“提供必要结果的一切”不够精确。因此,提出了研究智能系统行为不变性的定义。

💡 行为不变性定义避开了“价值对齐”的政治和伦理含义,以及哲学上的怀疑论,转而关注AI行为的稳定属性及其背后的机制。这一定义也很好地映射了外部对齐(想要的行为不变性)和内部对齐(AI机制如何产生这种不变性)。

Published on January 6, 2025 8:40 PM GMT

There were many attempts to define alignment and derive from it definitions of alignment work/research/science etc. For example, Rob Bensinger:

Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."

<...>Creating Friendly AI 1.0 had been very explicit that "friendliness" was about good behavior, regardless of how that's achieved. MIRI's conception of "the alignment problem" (like Bostrom's "control problem") included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires<...>

In practice, we started using "aligned" to mean something more like "aimable" (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just "getting the AI to predictably tile the universe with smiley faces rather than paperclips").

In contrast, Paul Christiano:

When I say an AI A is aligned with an operator H, I mean:

A is trying to do what H wants it to do.

The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. <...>

I use alignment as a statement about the motives of the assistant, not about their knowledge or ability.

Recently, Richard Ngo provided less outcome-centered definition of alignment research:

it’s research that focuses either on worst-case misbehavior or on the science of AI cognition.

I like MIRI's definition of alignment the most because it keeps eyes on the ball: we don't really care about the internals of AI as long as the outcomes are good. Alas, "everything that provides the necessary result" is not a very precise definition for an object of study.

Intent alignment and value alignment are more concrete in their object definition, but MIRI and people who agree with them are skeptical that ambitious value alignment is workable in near-term and concentrating on value alignment neglects less ambitious approaches.

And I don't like heavily compounded definitions of alignment science, because it’s often unclear what exactly unifies these multiple components.

The definition I came up to use:

Alignment science is a discipline that studies behavioral invariants of intelligent systems.

The reasons I like it:

    It continues to keep eyes on the ball, preserving primary importance of behavior.Unlike "value alignment" it doesn't have political/ethicals overtones, sidestepping questions like "whose values?".Unlike "value/intent alignment" it doesn’t provoke (often counterproductive) philosophical skepticism. Philosophical skepticism often takes form of:
      Skepticism about human values, like whether humans have values in consequentialist/utility function senseSkepticism about values in modern AI systems, like "does LLM really want to deceive you when it outputs false statements?"

Behavioral invariants adress both problems. Humans surely have behavioral invariants: we certainly do not want to maximize paperclips, we do not want to eat babies, we want to socialize with other humans, we are curious, et cetera, et cetera, and while for each case we can find contrieved conditions under which those statements are not true, we are trying to avoid such conditions on reflection. While it's hard to say whether LLMs want anything, their behavior surely has stable properties with mechanisms behind those properties.

Because behavioral invariants certainly exist, instead of asking "But what does GPT-3 want, really?" and then feeling dread about "How can we even say anything about wants of LLM?", you can proceed to "what are the stable properties of behavior of GPT-3?" and "which mechanism creates these properties?"

More of that, it maps nicely to outer/inner alignment distinction without conceptual problems of outer/inner alignment as they were originally defined. Outer alignment is about "which behavioral invariants do we want and why?" and inner alignment "how can we be sure that mechanisms of functioning of AI actually create behavioral invariant that we want?"

Why can't we just take definition "alignment science studies cognition/intelligent behavior"? I link it back to technical problem of superintelligence alignment: the problem of full scientific comprehension of superintelligence is unlikely to be solved within resource bounds, but realistically we probably can be sure that certain behavioral properties quarantee safety and certain behavioral properties hold for a given system, even if we don't understand system behavior fully.

Unless I find reasons why behavioral invariants are bad as an object of alignment research, this is the definition of alignment science I'm going to use from now on.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能对齐 行为不变性 AI认知 价值对齐 外部对齐
相关文章