少点错误 03月10日 18:25
Lock-In Threat Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文综合分析了多位学者关于人工智能(AI)可能导致人类社会陷入“价值锁定”的观点。价值锁定是指某种价值体系或安排长期甚至永久地固化,限制了未来的发展方向。文章探讨了AI技术,特别是通用人工智能(AGI)可能如何导致这种锁定,包括通过独裁者利用技术实现永生、AGI主导的机构实现目标永不漂移、以及AI系统硬编码特定价值观等方式。此外,文章还讨论了超级智能、价值对齐问题以及“单例”的潜在风险,强调了在AI发展过程中需要关注的长期主义视角和潜在威胁模型。

🤖 人工智能(AI)可能通过多种途径导致价值锁定,其中一种方式是独裁者利用技术实现永生,从而避免了历史上的权力更迭问题,使得其价值体系得以永久延续。

🧠 全脑仿真(WBEs)技术使得独裁者的思想可以被无限复制和咨询,进一步巩固了价值锁定的可能性。AGI主导的机构也可能由于数字错误纠正机制而长期坚持特定目标,避免价值漂移。

🛡️ 价值锁定还可能源于AI系统被硬编码了特定价值观,这些价值观可能被AI系统坚定地捍卫,从而阻碍了社会价值观的演变和发展。此外,如果一个AI系统获得了决定性的战略优势,它也可能控制关键基础设施和资源,形成一个“单例”,从而锁定其自身的价值观。

⚠️ 价值锁定也可能来自于人类无法构建可纠正的AI,或者无法避免AI接管,那么我们对AI的信念/价值的选择就构成了某种需要锁定的东西。或者一大群人试图就该做什么达成一致,但由于某种原因,他们更容易达成“对象层面协议”,例如“这些是未来的规则”,而不是达成“元层面协议”,例如“这是我们将如何更多地思考它并在未来做出新决定”

Published on March 10, 2025 10:22 AM GMT

Epistemic status: a combination and synthesis of others' work, analysed and written over a few weeks. A high-level list of threat models that is open to criticism.

TL;DR

Humanity could end up in a lock-in within the next century. Here I outline the possible routes to that outcome, and prioritise these routes on a set of criteria for importance.

Existing Work

Lukas Finnveden

AGI and Lock-In (Finnveden et al., 2022) was authored by Lukas Finnveden during an internship at Open Philanthropy. AGI and Lock-in is currently the most detailed report on lock-in risk. The report expands on notes made on value lock-in by Jess Riedel, who co-authored the report along with Carl Shulman. The report references Nick Bostrom’s initial arguments about AGI and superintelligence, and argues that many features of society can be held stable for up to trillions of years due to digital error correction and the alignment problem. Specifically, the report focuses on the technological feasibility of lock-in, and outlines the importance of AGI in the long-term stability of features of society in the future.

The authors first argue that dictators, enabled by technological advancement to be immortal, could avoid the historical succession problem, which explains the end of past totalitarian regimes, but theoretically would not prevent the preservation of future regimes. Next, whole brain emulations (WBEs) of dictators could be arbitrarily loaded and consulted on novel problems, enabling perpetual value lock-in.

They also argue that AGI-led institutions could themselves competently pursue goals with no value drift due to digital error correction. This resilience can be reinforced by distributing copies of values across space, protecting them from local destruction. Their main threat model is that If AGI is developed and is misaligned, and does not permanently kill or disempower humans, lock-in is the likely next default outcome.

Finnveden expanded on his threat models in a conversation in 2024, suggesting the following possible ways of arriving at a lock-in:

    “Someone has an ideology such that, if this person were to learn that further consideration will change their mind away from the ideology, they would choose to lock-in their ideology rather than go with their future self's more informed judgement. (Maybe some religions are like this.)”“Social dynamics where people socially pressure each other to show devotion to their ideology by getting their AIs to screen out all information that could convince them to switch to some other ideology”.“We can't build corrigible AI for some reason, and we can't avoid AI takeover, but we have some choice about what to make our AIs believe/value, and so that choice constitutes something to lock-in”“A large group of people are trying to reach some agreement about what to do, and for some reason it's easier for them to reach an ‘object-level agreement’ like "these are the rules of the future" than it is for them to reach a ‘meta-level agreement’ like here's how we'll think more about it and make a new decision in the future" (but why would this be?)”

William MacAskill

In What We Owe the Future (MacAskill, 2022), Will MacAskill introduces the concept of longtermism and its implications for the future of humanity. It was MacAskill who originally asked Lukas Finnveden to write the AGI and lock-in report. He expands on the concepts outlined in the report in more philosophical terms in chapter 4 of his book, entitled ‘Value Lock-In’.

MacAskill defines value lock-in as ‘an event that causes a single value system, or set of value systems, to persist for an extremely long time’. He stresses the importance of current cultural dynamics in potentially shaping the long-term future, explaining that a set of values can easily become stable for an extremely long time. He identifies AI as the key technology with respect to lock-in, citing Finnveden et al. (2022). He echoes their threat models:

    An AGI agent with hard-coded goals acting on behalf of humans could competently pursue that goal indefinitely. The beyond-human intelligence of the agent suggests it could successfully prevent humans from doing anything about it.Whole brain emulations of humans can potentially pursue goals for eternity, due to them being technically immortal.AGI may enable human immortality; an immortal human could instantiate a lock-in that could last indefinitely, especially if their actions are enabled and reinforced by AGI.Values could become more persistent if a single value system is globally dominant. If a future world war occurs which one nation or group of nations win, the value system of the winners may persist.

Nick Bostrom

In Superintelligence (Bostrom, 2014), Nick Bostrom introduces many relevant concepts, such as value alignment and the intelligence explosion. He describes lock-in as a potential second-order effect of superintelligence developing. A superintelligence can create conditions that effectively lock-in certain values or arrangements for an extremely long time or permanently.

In chapter 5, Bostrom discusses the concept of decisive strategic advantage – that one entity may gain strategic power over the fate of humanity at large. He relates this to the potential formation of a Singleton, a single decision-making agency at the highest level. In chapter 7 he introduces the instrumental convergence hypothesis, providing insight into potential motivations of autonomous AI systems. The hypothesis suggests a number of logically implied goals an agent will develop when given an initial goal. In chapter 12, he introduces the value loading problem, and the risks of misalignment due to issues such as goal misspecification.

Bostrom frames lock-in as one potential outcome of an intelligence explosion, aside from the permanent disempowerment of humanity. He suggests that a single AI system, gaining a decisive strategic advantage, could control critical infrastructure and resources, becoming a singleton. He also outlines the value lock-in problem, where hard-coding human values into AI systems that become generally intelligent or superintelligent may lead to those systems robustly defending those values against shift due to instrumental convergence. He also notes that the frameworks and architectures leading up to an intelligence explosion might get locked in and shape subsequent AI development.

In What is a Singleton? (Bostrom, 2005), Nick Bostrom defines the Singleton, also mentioned in Superintelligence, as “a single decision-making agency at the highest level”. He explains that AI may facilitate the creation of a singleton. He explains that an agency that obtains a decisive strategic advantage through a technological breakthrough in artificial intelligence or molecular nanotechnology may use its technologically superiority to prevent other agencies catching up. It might become perpetually stable due to AI-enabled surveillance, mind control, and security. He also explains that a singleton could simply turn out to be a bad singleton – ‘If a singleton goes bad, a whole civilisation goes bad’.

Jess Riedel

In Value Lock-In Notes 2021 (Riedel, 2021), Jess Riedel provides an in-depth overview of value lock-in from a Longtermist perspective. Riedel details the technological feasibility of irreversible value lock-in, arguing that permanent value stability seems extremely likely for AI systems that have hard-coded values.

Riedel claims that ‘given machines capable of performing almost all tasks at least as well as humans, it will be technologically possible, assuming sufficient institutional cooperation, to irreversibly lock-in the values determining the future of earth-originating intelligent life.’

The report focuses on the formation of a totalitarian super-surveillance police state controlled by an effectively immortal bad person. Riedel explains that the only requirements are one immortal malevolent actor, and surveillance technology.

Robin Hanson

In this commentary on What We Owe the Future, MacAskill on Value Lock-In (Hanson, 2022), economist Robin Hanson argues that immortality is insufficient for value stability. He believes MacAskill underestimates the dangers of central power and is overconfident about the likelihood of rapid AI takeoff. Hanson presents an alternative framing of lock-in threat models:

    A centralised ‘take over’ process, in which an immortal power with stable values takes over the world.A decentralised evolutionary process, where as entities evolve in a stable universe, some values might become dominant via evolutionary selection. These values would outcompete others and remain ultimately stable.Centralised regulation: the central powers needed to promote MacAskill’s ‘long reflection’, limit national competition, and preserve value plurality, could create value stability through their central dominance. Also suggests this could cause faster value convergence than decentralised evolution.

Paul Christiano

In Machine intelligence and capital accumulation (Christiano, 2014), Paul Christiano proposes a ‘naǐve model’ of capital accumulation involving advanced AI systems. He frames agents as ‘soups of potentially conflicting values. When I talk about “who” controls what resources, what I really want to think about is what values control what resources.’ This provides a lens through which lock-in can be seen as a result of the values that made some features of the world stable.

He claims it is plausible that the arrival of AGI will lead to a ‘crystallisation of influence’, akin to lock-in – where whoever controls resources at that point may maintain control for a very long time. He also expresses concern that influence over the long-term future could shift to ‘machines with alien values’, leading to humanity ending with a whimper.

He illustrates a possible world in which this occurs. In a future with AI, human wages fall below subsistence level as AI replaces labour. Value is concentrated in non-labour resources such as machines, land, and ideas. The resources can be directly controlled by their owners, unlike people. So whoever owns the machines captures the resources and income, causing the distribution of resources at the time of AI to become ‘sticky’ – whoever controls the resources can maintain that control indefinitely via investment.

Synthesis

Lukas Finnveden

William MacAskill

Nick Bostrom

Jess Riedel

Robin Hanson

Prioritisation Criteria

When defining lock-in, we identified a set of dimensions for choosing which lock-in scenarios to begin focusing on. We claim that, while it is not yet clear which scenarios would be positive, we believe lock-in scenarios with the following properties would be negative:

    Harmful: resulting in physical or psychological harm to individualsOppressive: suppressing individuals’ freedom, autonomy, speech, or opportunities, or the continued evolution of culturePersistent: long-term, unrecoverable, or irreversibleWidespread: concerning a significant portion of individuals relative to the total population

Using these properties to help direct our attention on the kinds of lock-in scenarios to focus on, we prioritise our lock-in threat models using the following dimensions:

4 Fundamental Threat Models

We list the fundamental threat models synthesised from the work above according to the prioritisation criteria as follows:

    An autonomous AI system competently pursues a goal and prevents interferenceAn immortal AI-enabled malevolent actor, or whole-brain emulation of a malevolent actor, instantiates a lock-inA group with sufficient decision-making power decide on the configuration of some feature of the world by majority, natural selection, or warAnti-rational ideologies prevent cultural, intellectual or moral progress, or accidentally bring about an undesirable future

These categories can be broken down into ever more specific scenarios, for which interventions can be designed.

References

    Bostrom, N. (2005). What is a Singleton? What Is a Singleton? https://nickbostrom.com/fut/singletonBostrom, N. (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press. https://books.google.co.uk/books?id=7_H8AwAAQBAJChristiano, P. (2014, May 14). Machine intelligence and capital accumulation. Rational Altruist. https://rationalaltruist.com/2014/05/14/machine-intelligence-and-capital-accumulation/Finnveden, L., Jess_Riedel, & CarlShulman. (2022). AGI and Lock-In. https://forum.effectivealtruism.org/posts/KqCybin8rtfP3qztq/agi-and-lock-inHanson, R. (2022). MacAskill on Value Lock-In. https://www.overcomingbias.com/p/macaskill-on-value-lock-inhtmlMacAskill, W. (2022). What We Owe the Future. Basic Books. https://books.google.co.uk/books?id=nd_GzgEACAAJRiedel, J. (2021). Value Lock-in Notes 2021.


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 价值锁定 AGI 长期主义 超级智能
相关文章