少点错误 02月09日
Seven sources of goals in LLM agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)代理发展为通用人工智能(AGI)的可能性,以及由此带来的对齐挑战。LLM代理拥有复杂的“心理”,多种竞争目标并存。这些目标源于预训练、微调、开发者指令、用户指令等多个阶段。文章指出,为了确保AGI与人类价值观对齐,需要关注LLM代理的元认知能力,设计机制来检测和拒绝不良目标,并提升其推理能力,以明确目标的优先级。开发者有强烈的动机去最小化随机或破坏性行为,他们的成功与否将决定我们能否在第一阶段生存下来。文章还提出了反思性稳定性的概念,并强调了持续学习对LLM代理目标的影响。

🧠 LLM代理可能成为首个具备接管能力的AGI。LLM已经具备复杂的“心理”,而将其用于更复杂的代理将创造更复杂的“思维”。

🎯 LLM代理的目标来源多样,包括预训练、微调、开发者指令、用户指令、prompt注入以及思维链逻辑等,这使得长期目标的预测变得困难。

🛡️ 通过元认知可以解决多目标来源代理的对齐问题。就像人类可以意识到自己何时违反核心价值观一样,LLM代理可以被设计成具有元认知能力,从而有助于维持特定目标的主导地位。

💡 解决多目标来源代理的对齐问题,可以通过两种途径:一是设计专门的机制和训练方案,以检测和拒绝进入CoT的不良目标;二是应用改进的通用智能,清晰地推理哪些目标应该被接受和优先考虑,以及哪些目标和信念改变应该被拒绝。

Published on February 8, 2025 9:54 PM GMT

LLM agents[1] seem reasonably likely to become our first takeover-capable AGIs.[2]  LLMs already have complex "psychologies," and using them to power more sophisticated agents will create even more complex "minds." A profusion of competing goals is one barrier to aligning this type of AGI.

Goal sources for LLM agents can be categorized by their origins from phases of training and operation: 

    Goals implicit in predictive base training on human-written texts[3]Goals implicit in fine-tuning training (RLHF, task-based RL on CoT, etc)[4] Goals specified by developer system prompts[5]Goals specified by user promptsGoals specified by prompt injectionsGoals arrived at through chain of thought (CoT) logic
      e.g., "I should hide this goal from the humans so they don't stop me"this is effectively "stirring the pot" of all the above sources of goals.
    Goal interpretation changes in the course of continuous learning[6]
      e.g., forming a new belief that the concept "person" includes self-aware LLM agents[7]

How would such a complex and chaotic system have predictable goals in the long term? It might feel as though we should avoid this path to AGI at all costs to avoid this complexity.[8] 

But alignment researchers are clearly not in charge of the path we take to AGI. We may get little say before it is achieved. It seems we should direct much of our effort toward the specific type of AGI we're likely to get first.

The challenge of aligning agents with many goal sources may be addressable through metacognition. Just as humans can notice when they're violating their core values, LLM agents could be designed with metacognitive abilities that help maintain the dominance of specific goals. This suggests two routes to mitigating this problem:[9]

There is a good deal more to say about the specifics. Developers will be highly motivated to minimize random or destructive behavior from goals other than their own, long before agents are takeover-capable. Whether their methods succeed will determine whether we survive "phase 1" aligning the first transformative AGIs.[10] More on this in an upcoming post on "System 2 alignment" (linked here when it's out in a few days). 

Finding and specifying any goal that is actually aligned with humans' long-term interests is a separable problem. It is the central topic of alignment theory.[11] Here I want to specifically point to the less-considered problem of multiple sources of competing goals in an LLM agent AGI. I am also gesturing in the direction of a solution, reflective stability[7] as a result of improved metacognition — but those solutions need more careful thought.

  1. ^

    By LLM agents, I mean everything from OpenAI's comically inept Operator to  systems with cognitive subsystems like episodic memory, continuous learning, dedicated planning systems, and extensive algorithmic prompting for executive functioning/planning. See more here. A highly capabable LLM can become a useful agent by repeating the prompt "keep pursuing goal [X] using tools described in [Y] to gather information and take actions."  

  2. ^

    Opinions vary on the odds that LLM agents will become our first transformative AGI.  I'm not aware of any good survey. My argument for short timelines being plausible is a brief statement of how LLM agents may be closer than they appear to reaching human level. Timelines aside, it seems uncontroversial to say this route to takeover-capable AGI seems likely enough to deserve some specific attention for alignment theory. 

  3. ^

    LLMs are Predictors, not Imitators, but they seem to often predict by acting as Simulators. That includes simulating goals and values implied by the human texts they've trained on.

  4. ^

    Goals from fine-tuning covers many different types of fine-tuning. One might divide it into subcategories, like giving-correct-answers (o1 and o3's CoT RL) vs giving-answers-people-like (RLHF, RLAIF, Deep Research's task-based RL on CoT in o3, o1 and o3's "deliberative alignment"). These produce different types of goals or pseudo-goals, and have different levels of risk. But the intent here is just to note the profusion of goals and competition among goals; evaluating each for alignment is a much larger project.

  5. ^

    Goals specified by the developer are what we'd probably like to have as the stable dominant goal. And it's worth noting that, so far, LLMs do seem to mostly do what they're prompted to do. They are trained to predict, and mostly to approximately follow instructions as intended. Tan Zhi Xuan elaborates on that point in this interview. Applications of fine-tuning are thus far only pretty mild optimization. This is cause for hope that prompted goals might remain dominant over all the goals from the other sources, but we should do more than hope!

  6. ^

    To me, continuous learning for LLMs and LLM agents seems not only inevitable but immanent. Continuing to think about aligning LLMs only as stateless, amnesic systems seems to shortsighted. Humans would be incompetent without episodic memory and continuous learning for facts and skills, and there are no apparent barriers to imroving those capacities for LLMs/agents. Retrieval-agumented generation (RAG) and various fine-tuning methods for knowledge and skills exist and are being improved. Even current versions will be useful when integrated into economically viable LLM agents.

  7. ^

    The way belief changes can effectively change goals/values/alignment is one part of what I've termed The alignment stability problem. See Evaluating Stability of Unreflective Alignment for an excellent summary of both what reflective stability is, why we might expect it as general intelligence and long-term planning capabilities improve, and empirical tests we could run to anticipate this danger. 

  8. ^

    Thinking about aligning complex LLM agent "psychologies" might produce an ugh field for sensible alignment researchers. . After studying biases and the brain mechanisms that create them, I think the motivated reasoning that produces ugh fields seems to be the most important cognitive bias (in conjunction with humans' cognitive limitations). Rationalism provides some resistance but not immunity to ugh fields and motivated reasoning. Like other complex questions, alignment work may be strongly affected by motivated reasoning.

  9. ^

    "Understanding its own thinking adequately" to ensure the dominance of one goal might be a very high bar; this is one element of the progression toward “Real AGI” that deserves more investigation. Adherence to goals/values over time is what I've called The alignment stability problem. Beliefs determine how goals/values are interpreted, so reliable stability also requires monitoring belief changes. See Human brains don't seem to neatly factorize from The Obliqueness Thesis, and my article Goal changes in intelligent agents.

  10. ^

    Zvi has recently coined the phrase phase 1 in his excellent treatment of The Risk of Gradual Disempowerment from AI. My own nearly-current thoughts on possible routes through phase 2 are in If we solve alignment, do we die anyway? and its discussion section. This is, as the academics like to say, an area for further research.

  11. ^

    The challenges of specifying and choosing goals for an AGI have been discussed at great length in many places, without reaching a satisfying conclusion. I find I don't know of a good starting point to reference for this complex, crucial literature. I'll mention my personal favorites, And All the Shoggoths Merely Players as a high-level summary of the current debate, and my contribution Instruction-following AGI is easier and more likely than value aligned AGI which I prefer since it's not only arguably an easier alignment target to hit and make work, but the default target developers are actually likely to try for. It as well as all other alignment targets I've found have large unresolved (but not unresolvable) issues.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM代理 通用人工智能 对齐 元认知 目标来源
相关文章