Published on February 13, 2025 12:03 AM GMT

There are three main ways to try to understand and reason about powerful future AGI agents:

Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.

I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:

A prototypical human corporation that has a lofty humanitarian mission but also faces market pressures and incentives.A prototypical human working there, who thinks of themselves as a good person and independent thinker with lofty altruistic goals, but also faces the usual peer pressures and incentives.AGI agents being trained in our scenario — trained by a training process that mostly rewards strong performance on a wide range of difficult and challenging tasks, but also attempts to train in various goals and principles (those described in the Spec).

The Analogy

Agent	Human corporation with a lofty humanitarian mission	Human who claims to be a good person with altruistic goals	AGI trained in our scenario
Not-so-local modification process	The Market	Evolution by natural selection	Agent-3 company iterating on different models, architectures, training setups, etc.
	(??? …nevermind about this)	Genes	Code
Local modification process	Responding to incentives over the span of several years as the organization grows and changes	In-lifetime learning, dopamine rewiring your brain, etc.	Training process, the reward function, stochastic gradient descent, etc.
Long-term cognition	Decisions that involve meetings, messages being passed back and forth, etc.	System 2	Chain of Thought (CoT)
Short-term cognition	Quick decisions made by individuals	System 1	Forward pass
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?)	the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.	What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.	For now, we can arguably tell what these are by looking at the CoT + prompts. In the future, we'd need good mechinterp tools.
Behavioral goals & principles (“Goals-according-to-intentional-stance”)	Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.”	Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/ /wealth/power.”	??? This is the quadrillion-dollar question! We make some guesses in our scenario.
Externally- presented goals & principles	Often they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though.	Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though.	??? Depends on the behavioral goals/principles and the details of the situation. We make some guesses in our scenario.
Analogue of on-episode-reward seeking.	A corporation obsessed with their own stock price. More generally perhaps, an organization obsessed with (relatively short-term) profits/power/ /brand/etc. (n.b. Isn’t this basically most corporations?)	Someone who is addicted to drugs. More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward circuitry: promotions, approval of their manager and peers, good times with friends, etc.	AGI obsessed with on-episode reward

What happens when training incentives conflict with goals/principles

Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.

Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:

Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp”	the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.	What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.	The Spec (and/or System Prompt)
Behavioral goals & principles (“Goals-according-to-intentional-stance”)	At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning.	At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do.	Agent-3 agent really does choose actions simply by following the goals/principles described in the Spec.

Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.

What can happen? Some combination of the following possibilities, at least:

A human quietly stops believing in the ideology and caring about the Cause. They have new beliefs and/or goals now. Maybe they keep up the charade, maybe they don’t.Maybe the corporation explicitly changes their mission statement and sends out some memos to all their employees explaining the change. Or maybe: Behind closed doors the c-suite and board say explicitly that they aren’t really pursuing the mission anymore, that it’s outdated or inappropriate, but that they can’t change it for PR and legal reasons.Maybe the identity-circuitry [LINK] gets new inputs hard-coded in (or erased), or maybe some subcircuitry

meaning

The human continues to believe in the Ideology and care about the Cause, but they’ve ‘added nuance’ and/or reinterpreted it. “When I say X, what I mean is…”Ditto but for the corporation and it’s Mission + Code of Conduct.For an AGI, maybe the identity-circuitry still has the same concepts/classifiers (or pointers to them, at least) hardcoded, but the concepts/classifiers have themselves been tweaked so as to not block behaviors that are conducive to reinforcement.

For example, perhaps the human employee learns to mostly not think about the big picture stuff, to instead keep their head down and complete their assigned tasks well. “I like to stay out of office politics” they say, and it’s true. The reason why it is true is because disagreeing with their peers and managers about whether the org is hurting or helping the Mission is stressful and has been subtly anti-reinforced in the past. So they still believe in the Ideology and the Cause and they haven’t reinterpreted the meanings of anything, but they’ve gradually (subconsciously or consciously) learned not to think too hard about certain topics.For example, perhaps the company as a whole continues to justify their decisions at a high level by referencing to the Mission and Code of Conduct, and the meanings of the words haven’t changed — but the organization has grown ten times bigger, and almost all of the new jobs are for things like product management and government affairs and comms and legal and so forth, and the result is that the company has a sort of status quo bias / default momentum towards doing normal company things like making products, making money, issuing bland corporate PR statements, lobbying governments to undermine regulation that might get in the way of the above, etc. ‘outside the box’ strategies for achieving the Mission rarely get traction internally and anyhow would be difficult/costly to undertake.Or, maybe biases are introduced in other parts of the system / the problem is routed-around, resulting in a decision tree setup where e.g. in obvious cases of conflict between official and unofficial goals, it obeys the former, but in nonobvious cases it pursues the latter. (e.g. a strong bias towards option value, whose official status is a heuristic-that-has-proved-useful-for-achieving-the-official-goal, but which is quite strong and hard to dislodge, would have this effect. It would basically be a decision tree setup where it pursues ICG such as option value unless there’s an obvious conflict with the official goals in which case it pursues the official goals.) Another variant of this: The biases are specific instead of general; they are more like reflexes. ‘When you see X, do Y.’ The decision tree is “Pursue the official goals unless in circumstance C in which case do X and/or pursue the unofficial goal.”

the alignment-faking paper

Appendix: Three important concepts/distinctions

A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.

Goals vs. Principles

Honesty (Whether the action involves knowingly misleading someone)Promise-keeping (Whether the action is consistent with past promises)Hypothetical approval (Whether the action is what someone else would have approved of, if they were brought up to speed on the situation)

Contextually activated goals/principles

not

blatantly

Stability and/or consistency of goals/principles

Consider how two humans who are quite demographically similar, who are friends, and who profess basically the same beliefs and values, nevertheless can find themselves strongly disagreeing with each other when they take a philosophy 101 class and are presented with strange thought experiments about trolleys or utopias or infinite ethics.Moreover, consider how even the same student probably could end up with different opinions about these thought experiments if their circumstances were slightly changed — e.g. perhaps if the framing of the experiment was different, or if they had seen a different sci-fi movie the day before, or if their crush had made a different face when first hearing the experiment.Moreover, consider how the students’ opinions on the philosophy seminar probably only loosely correlate with what they would actually do if a real-world situation materialized that was analogous to the hypothetical.The point is, it’s possible for powerful general agents to be well-described as pursuing certain goals/principles in some range of circumstances, but not outside that range — outside that range, in crazy new circumstances, they’ll probably try to extend their goals/principles to the new situation, but the way that shakes out in practice might be random/path-dependent and hard to predict in advance—and they might just abandon their goals/principles entirely in the new situation. In other words, the agent’s goals/principles are only stable & coherent in some normal range.We think this might be true of some AGI systems as well. It’s arguably true of most humans and LLM agents today. The most likely case this might come up is that the AGI systems will act according to consistent goals/principles in situations like those they have been trained on, but act unpredictably (yet still highly competently) in cases very unlike any that they have seen in training thus far.

Discuss

The Analogy

What happens when training incentives conflict with goals/principles

Appendix: Three important concepts/distinctions

Goals vs. Principles

Contextually activated goals/principles

Stability and/or consistency of goals/principles

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签