Seven sources of goals in LLM agents

Published on February 8, 2025 9:54 PM GMT

LLM agents^[1] seem reasonably likely to become our first takeover-capable AGIs.^[2] LLMs already have complex "psychologies," and using them to power more sophisticated agents will create even more complex "minds." A profusion of competing goals is one barrier to aligning this type of AGI.

Goal sources for LLM agents can be categorized by their origins from phases of training and operation:

^[3]

^[4]

^[5]

e.g., "I should hide this goal from the humans so they don't stop me"this is effectively "stirring the pot" of all the above sources of goals.

^[6]

^[7]

How would such a complex and chaotic system have predictable goals in the long term? It might feel as though we should avoid this path to AGI at all costs to avoid this complexity.^[8]

But alignment researchers are clearly not in charge of the path we take to AGI. We may get little say before it is achieved. It seems we should direct much of our effort toward the specific type of AGI we're likely to get first.

The challenge of aligning agents with many goal sources may be addressable through metacognition. Just as humans can notice when they're violating their core values, LLM agents could be designed with metacognitive abilities that help maintain the dominance of specific goals. This suggests two routes to mitigating this problem:^[9]

Mechanisms and training regimes designed specifically

Improved general intelligence

There is a good deal more to say about the specifics. Developers will be highly motivated to minimize random or destructive behavior from goals other than their own, long before agents are takeover-capable. Whether their methods succeed will determine whether we survive "phase 1" aligning the first transformative AGIs.^[10] More on this in an upcoming post on "System 2 alignment" (linked here when it's out in a few days).

Finding and specifying any goal that is actually aligned with humans' long-term interests is a separable problem. It is the central topic of alignment theory.^[11] Here I want to specifically point to the less-considered problem of multiple sources of competing goals in an LLM agent AGI. I am also gesturing in the direction of a solution, reflective stability^[7] as a result of improved metacognition — but those solutions need more careful thought.

^{^}
By LLM agents, I mean everything from OpenAI's comically inept Operator to systems with cognitive subsystems like episodic memory, continuous learning, dedicated planning systems, and extensive algorithmic prompting for executive functioning/planning. See more here. A highly capabable LLM can become a useful agent by repeating the prompt "keep pursuing goal [X] using tools described in [Y] to gather information and take actions."
^{^}
Opinions vary on the odds that LLM agents will become our first transformative AGI. I'm not aware of any good survey. My argument for short timelines being plausible is a brief statement of how LLM agents may be closer than they appear to reaching human level. Timelines aside, it seems uncontroversial to say this route to takeover-capable AGI seems likely enough to deserve some specific attention for alignment theory.
^{^}
LLMs are Predictors, not Imitators, but they seem to often predict by acting as Simulators. That includes simulating goals and values implied by the human texts they've trained on.
^{^}
Goals from fine-tuning covers many different types of fine-tuning. One might divide it into subcategories, like giving-correct-answers (o1 and o3's CoT RL) vs giving-answers-people-like (RLHF, RLAIF, Deep Research's task-based RL on CoT in o3, o1 and o3's "deliberative alignment"). These produce different types of goals or pseudo-goals, and have different levels of risk. But the intent here is just to note the profusion of goals and competition among goals; evaluating each for alignment is a much larger project.
^{^}
Goals specified by the developer are what we'd probably like to have as the stable dominant goal. And it's worth noting that, so far, LLMs do seem to mostly do what they're prompted to do. They are trained to predict, and mostly to approximately follow instructions as intended. Tan Zhi Xuan elaborates on that point in this interview. Applications of fine-tuning are thus far only pretty mild optimization. This is cause for hope that prompted goals might remain dominant over all the goals from the other sources, but we should do more than hope!
^{^}
To me, continuous learning for LLMs and LLM agents seems not only inevitable but immanent. Continuing to think about aligning LLMs only as stateless, amnesic systems seems to shortsighted. Humans would be incompetent without episodic memory and continuous learning for facts and skills, and there are no apparent barriers to imroving those capacities for LLMs/agents. Retrieval-agumented generation (RAG) and various fine-tuning methods for knowledge and skills exist and are being improved. Even current versions will be useful when integrated into economically viable LLM agents.
^{^}
The way belief changes can effectively change goals/values/alignment is one part of what I've termed The alignment stability problem. See Evaluating Stability of Unreflective Alignment for an excellent summary of both what reflective stability is, why we might expect it as general intelligence and long-term planning capabilities improve, and empirical tests we could run to anticipate this danger.
^{^}
Thinking about aligning complex LLM agent "psychologies" might produce an ugh field for sensible alignment researchers. . After studying biases and the brain mechanisms that create them, I think the motivated reasoning that produces ugh fields seems to be the most important cognitive bias (in conjunction with humans' cognitive limitations). Rationalism provides some resistance but not immunity to ugh fields and motivated reasoning. Like other complex questions, alignment work may be strongly affected by motivated reasoning.
^{^}
"Understanding its own thinking adequately" to ensure the dominance of one goal might be a very high bar; this is one element of the progression toward “Real AGI” that deserves more investigation. Adherence to goals/values over time is what I've called The alignment stability problem. Beliefs determine how goals/values are interpreted, so reliable stability also requires monitoring belief changes. See Human brains don't seem to neatly factorize from The Obliqueness Thesis, and my article Goal changes in intelligent agents.
^{^}
Zvi has recently coined the phrase phase 1 in his excellent treatment of The Risk of Gradual Disempowerment from AI. My own nearly-current thoughts on possible routes through phase 2 are in If we solve alignment, do we die anyway? and its discussion section. This is, as the academics like to say, an area for further research.
^{^}
The challenges of specifying and choosing goals for an AGI have been discussed at great length in many places, without reaching a satisfying conclusion. I find I don't know of a good starting point to reference for this complex, crucial literature. I'll mention my personal favorites, And All the Shoggoths Merely Players as a high-level summary of the current debate, and my contribution Instruction-following AGI is easier and more likely than value aligned AGI which I prefer since it's not only arguably an easier alignment target to hit and make work, but the default target developers are actually likely to try for. It as well as all other alignment targets I've found have large unresolved (but not unresolvable) issues.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签