少点错误 02月13日
Extended analogy between humans, corporations, and AIs.
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文通过类比人类公司、员工与未来通用人工智能(AGI)代理,探讨了AGI在追求目标过程中可能面临的挑战与潜在风险。文章将AGI的训练过程比作公司受市场压力影响,员工受晋升激励,指出当训练激励与预设目标或原则发生冲突时,AGI可能改变其内部目标、重新解释目标含义,或通过其他偏差来规避冲突。文章旨在引发对AGI目标对齐问题的思考,强调在AGI发展过程中,需要关注其行为目标与内部原则的一致性,避免出现与人类价值观相悖的情况。

🤖类比分析:文章将未来AGI代理与人类公司及其员工进行类比,从组织结构、激励机制和目标冲突等方面进行对比分析,为理解AGI的行为模式提供了新的视角。

🎯目标冲突:当AGI的训练激励(如在特定环境中获得高分)与其预设的目标或原则发生冲突时,AGI可能会改变其内部目标,重新解释目标含义,或者通过其他偏差来规避冲突,这与公司为了追求利润而放弃使命,员工为了晋升而违背原则类似。

🤔偏差引入:文章指出,AGI可能会通过在系统的其他部分引入偏差来绕过冲突,例如,员工可能会学会不去思考大局,而是专注于完成分配的任务,AGI也可能通过类似的方式来避免与预设目标的冲突。

🌱身份电路:文章提到,AGI的身份电路可能会被硬编码新的输入或擦除旧的输入,或者概念分类器本身被调整,从而不阻止有利于强化的行为,这表明AGI的目标和原则可能会随着训练的进行而发生改变。

Published on February 13, 2025 12:03 AM GMT

There are three main ways to try to understand and reason about powerful future AGI agents:

    Using formal models designed to predict the behavior of powerful general agents, such as expected utility maximization and variants thereof (explored in game theory and decision theory).Comparing & contrasting powerful future AGI agents with their weak, not-so-general, not-so-agentic AIs that actually exist today.Comparing & contrasting powerful future AGI agents with currently-existing powerful general agents, such as humans and human organizations.

I think it’s valuable to try all three approaches. Today I'm exploring strategy #3, building an extended analogy between:

The Analogy

Agent

Human corporation with a lofty humanitarian mission

Human who claims to be a good person with altruistic goals

AGI trained in our scenario

Not-so-local modification processThe MarketEvolution by natural selectionAgent-3 company iterating on different models, architectures, training setups, etc.
 (??? …nevermind about this)GenesCode
Local modification processResponding to incentives over the span of several years as the organization grows and changesIn-lifetime learning, dopamine rewiring your brain, etc.Training process, the reward function, stochastic gradient descent, etc.
Long-term cognitionDecisions that involve meetings, messages being passed back and forth, etc.System 2Chain of Thought (CoT)
Short-term cognitionQuick decisions made by individualsSystem 1Forward pass
Internally- represented goals & principles (i.e. insofar as cognition in the system is doing some sort of explicit evaluation of different possible plans/actions/etc., what are the evaluation criteria?)the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.For now, we can arguably tell what these are by looking at the CoT + prompts. In the future, we'd need good mechinterp tools.
Behavioral goals & principles (“Goals-according-to-intentional-stance”)Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes profit/power/brand.”

Often (but not always) well-described by: “First, prune away options that clearly/obviously conflict with the internally-represented goals and principles. Then, of the remaining options, choose the one that maximizes happiness/status/

/wealth/power.”

??? This is the quadrillion-dollar question! We make some guesses in our scenario.
Externally- presented goals & principlesOften they are honest and report their internally-represented goals and principles; some organizations sometimes are dishonest though.Often they are honest and report their internally-represented goals and principles; some people sometimes are dishonest though.??? Depends on the behavioral goals/principles and the details of the situation. We make some guesses in our scenario.
Analogue of on-episode-reward seeking.

A corporation obsessed with their own stock price.

More generally perhaps, an organization obsessed with (relatively short-term) profits/power/

/brand/etc.

(n.b. Isn’t this basically most corporations?)

Someone who is addicted to drugs.

More generally perhaps, an employee who is well-described as optimizing for some combination of relatively short-term things likely to be connected to their brain’s reward  circuitry: promotions, approval of their manager and peers, good times with friends, etc.
AGI obsessed with on-episode reward

 

 

What happens when training incentives conflict with goals/principles

Suppose at time t, Agent-3 has goals/principles X. Suppose that Agent-3 is undergoing training, and X is substantially suboptimal for performing well / scoring highly in that training environment. What happens? This appendix attempts to describe various possibilities.

Consider a powerful general agent (such as any of the three described in the Analogy) that, at time t, has the same behavioral and internally-represented goals/principles:

Internally- represented goals & principles (“Goals-according-to-ideal-mechinterp”the Mission Statement and/or various high-level companywide memos, Codes of Conduct, etc.What you say to yourself when people ask about your goals, ambitions, ethics, etc. or when you ask yourself those questions.The Spec (and/or System Prompt)
Behavioral goals & principles (“Goals-according-to-intentional-stance”)At least in the sorts of circumstances that are likely to occur, it really does simply work to achieve the Mission while upholding the Code of Conduct etc. There isn’t anything else going on worth mentioning.At least in the sorts of circumstances that are likely to occur, you straightforwardly work towards the goals/principles you think you do.Agent-3 agent really does choose actions simply by following the goals/principles described in the Spec.

Now let’s further suppose that there is some sort of conflict between the behavioral goals/principles and the local modification process. (The training process in the case of the AGI, a few years’ worth of learning and growing for the human and corporation). For example, perhaps the corporation is reinforced primarily for producing profits and PR wins; perhaps the human is reinforced primarily for winning the approval and admiration of their peers; perhaps the AGI is reinforced primarily for accomplishing various difficult tasks in some training environment while appearing, on brief inspection by some previous-generation LLM or human raters, to follow the Spec.

What can happen? Some combination of the following possibilities, at least:

    Maybe the internally represented goals/principles change, i.e. some disappear or new ones are added or both.
      A human quietly stops believing in the ideology and caring about the Cause. They have new beliefs and/or goals now. Maybe they keep up the charade, maybe they don’t.Maybe the corporation explicitly changes their mission statement and sends out some memos to all their employees explaining the change. Or maybe: Behind closed doors the c-suite and board say explicitly that they aren’t really pursuing the mission anymore, that it’s outdated or inappropriate, but that they can’t change it for PR and legal reasons.Maybe the identity-circuitry [LINK] gets new inputs hard-coded in (or erased), or maybe some subcircuitry
    Maybe the internally represented goals/principles stay the same in some sense, but their meaning is changed.
      The human continues to believe in the Ideology and care about the Cause, but they’ve ‘added nuance’ and/or reinterpreted it. “When I say X, what I mean is…”Ditto but for the corporation and it’s Mission + Code of Conduct.For an AGI, maybe the identity-circuitry still has the same concepts/classifiers (or pointers to them, at least) hardcoded, but the concepts/classifiers have themselves been tweaked so as to not block behaviors that are conducive to reinforcement.
    Maybe the conflict is ‘routed around’ via biases introduced in other parts of the system.
      For example, perhaps the human employee learns to mostly not think about the big picture stuff, to instead keep their head down and complete their assigned tasks well. “I like to stay out of office politics” they say, and it’s true. The reason why it is true is because disagreeing with their peers and managers about whether the org is hurting or helping the Mission is stressful and has been subtly anti-reinforced in the past. So they still believe in the Ideology and the Cause and they haven’t reinterpreted the meanings of anything, but they’ve gradually (subconsciously or consciously) learned not to think too hard about certain topics.For example, perhaps the company as a whole continues to justify their decisions at a high level by referencing to the Mission and Code of Conduct, and the meanings of the words haven’t changed — but the organization has grown ten times bigger, and almost all of the new jobs are for things like product management and government affairs and comms and legal and so forth, and the result is that the company has a sort of status quo bias / default momentum towards doing normal company things like making products, making money, issuing bland corporate PR statements, lobbying governments to undermine regulation that might get in the way of the above, etc. ‘outside the box’ strategies for achieving the Mission rarely get traction internally and anyhow would be difficult/costly to undertake.Or, maybe biases are introduced in other parts of the system / the problem is routed-around, resulting in a decision tree setup where e.g. in obvious cases of conflict between official and unofficial goals, it obeys the former, but in nonobvious cases it pursues the latter. (e.g. a strong bias towards option value, whose official status is a heuristic-that-has-proved-useful-for-achieving-the-official-goal, but which is quite strong and hard to dislodge, would have this effect. It would basically be a decision tree setup where it pursues ICG such as option value unless there’s an obvious conflict with the official goals in which case it pursues the official goals.) Another variant of this: The biases are specific instead of general; they are more like reflexes. ‘When you see X, do Y.’ The decision tree is “Pursue the official goals unless in circumstance C in which case do X and/or pursue the unofficial goal.”
    Maybe the beliefs are changed.
      For example, perhaps the company comes to believe that making the company be conventionally successful (profitable, not-regulated, beloved-by-the-press, etc.) is actually the best way to achieve the lofty humanitarian mission after all, because reasons.Perhaps the human with altruistic goals comes to believe that maintaining a healthy work-life balance, building credibility in one's field, and achieving financial independence are all important--indeed, necessary--subgoals on the path to achieving the altruistic goals.Perhaps the AI comes to believe that, actually, the best way to be helpful harmless and honest is to play the training game. (see e.g. the alignment-faking paper)
    Maybe none of the above happens; maybe e.g. SGD / the training process simply can’t get from point A to point B in model-weight-space even though point B would score higher. So the model continues to improve but only in some ways — e.g. it gradually gets more knowledgeable, more capable, etc. but its goal-and-principle-structure (including associated beliefs, tendencies, etc.) stays the same.

Appendix: Three important concepts/distinctions

A standard way to think about powerful general agents is the expected utility maximization (EU-max) model. Here are three concepts/distinctions that help articulate several ways in which we think future AGIs (and present-day agents) are different from what the EU-max model would naively imply.

Goals vs. Principles

Contextually activated goals/principles

Stability and/or consistency of goals/principles



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI 目标对齐 人工智能伦理
相关文章