Gradual Disempowerment, Shell Games and Flinches

Published on February 2, 2025 2:47 PM GMT

Over the past year and half, I've had numerous conversations about the risks we describe in Gradual Disempowerment. (The shortest useful summary of the core argument is: To the extent human civilization is human-aligned, most of the reason for the alignment is that humans are extremely useful to various social systems like the economy, and states, or as substrate of cultural evolution. When human cognition ceases to be useful, we should expect these systems to become less aligned, leading to human disempowerment.) This post is not about repeating that argument - it might be quite helpful to read the paper first, it has more nuance and more than just the central claim - but mostly me ~~ranting~~ sharing some parts of the experience of working on this and discussing this.

What fascinates me isn't just the substance of these conversations, but relatively consistent patterns in how people avoid engaging with the core argument. I don't mean the cases where ~~stochastic parrots~~ people confused about AI progress repeat claims about what AIs can't do that were experimentally refuted half a year ago, but the cases where smart, thoughtful people who can engage with other arguments about existential risk from AI display surprisingly consistent barriers when confronting this particular scenario.

I found this frustrating, but over time, I began to see these reactions as interesting data points in themselves. In this post, I'll try to make explicit several patterns I've observed. This isn't meant as criticism. Rather, I hope that by making these patterns visible, we can better understand the epistemics of the space.

Before diving in, I should note that this is a subjective account, based on my personal observations and interpretations. It's not something agreed on or shared with the paper coauthors, although when we compared notes on this, we sometimes found surprisingly similar patterns. Think of this as one observer's attempt to make legible some consistently recurring dynamics. Let's start with what I call "shell games", after an excellent post by TsviBT.

Shell Games

The core principle of the shell games in alignment is that when people propose strategies for alignment, the hard part of aligning superintelligence is always happening in some other component of the system than what's analyzed. In gradual disempowerment scenarios, the shell game manifests as shifting the burden of maintaining human influence between different societal systems.

When you point out how automation might severely reduce human economic power, people often respond "but the state will handle redistribution." When you explain how states might become less responsive to human needs as they rely less on human labor and taxes, they suggest "but cultural values and democratic institutions will prevent that." When you point out how cultural evolution might drift memplexes away from human interests when human minds stop being the key substrate, maybe this has an economic solution or governance solution.

What makes this particularly seductive is that each individual response is reasonable. Yes, states can regulate economies. Yes, culture can influence states. Yes, economic power can shape culture. The shell game exploits the tendency to think about these systems in isolation, missing how the same underlying dynamic - decreased reliance on humans - affects all of them simultaneously, and how shifting the burden puts more strain on the system which ultimately has to keep humans in power.

I've found this pattern particularly common among people who work on one of the individual domains. Their framework gives them sophisticated tools for thinking about how one social system works, but usually the gradual disempowerment dynamic undermines some of the assumptions they start from, if multiple systems might fail in correlated ways.

The Flinch

Another interesting pattern in how people sometimes encounter the gradual disempowerment argument is a kind of cognitive flinch away from really engaging with it. It's not disagreement exactly; it's more like their attention suddenly slides elsewhere, often to more “comfortable”, familiar forms of AI risk.

This happens even with (maybe especially with) very smart people who are perfectly capable of understanding the argument. A researcher might nod along as we discuss how AI could reduce human economic relevance, but bounce off the implications for state or cultural evolution. Instead, they may want to focus on technical details of the econ model, how likely it is that machines will outcompete humans in virtually all tasks including massages or something like that.

Another flinch is something like just rounding it off to some other well known story - like “oh, you are discussing multipolar scenario” or "so you are retelling Paul's story about influence-seeking patterns." (Because the top comment on LessWrong is a bit like that, it is probably worth noting that while it fits the pattern, it is not the single or strongest piece of evidence.)

Delegating to Future AI

Another response, particularly from alignment researchers, is "This isn't really a top problem we need to worry about now - either future aligned AIs will solve it or we are doomed anyway."

This invites a rather unhelpful reaction of the type "Well, so the suggestion is we keep humans in control by humans doing exactly what the AIs tell them to do, and this way human power and autonomy is preserved?". But this is a strawman and there's something deeper here - maybe it really is just another problem, solvable by better cognition.

I think this is where the 'gradual' assumption is important. How did you get to the state of having superhuman intelligence aligned to you? If the current trajectory continues, it's not the case that the AI you have is a faithful representative of you, personally, run in your garage. Rather it seems there is a complex socio-economic process leading to the creation of the AIs, and the smarter they are, the more likely it is they were created by a powerful company or a government.

This process itself shapes what the AIs are "aligned" to. Even if we solve some parts of the technical alignment problem we still face the question of what is the sociotechnical process acting as “principal”. By the time we have superintelligent AI, the institutions creating them will have already been permeated by weaker AIs decreasing human relevance and changing the incentive landscape.

The idea that the principal is you, personally, implies that a somewhat radical restructuring of society somehow happened before you got such AI and that individuals gained a lot of power currently held by super-human entities like bureaucracies, states or corporations.

Also yes: it is true that capability jumps can lead to much sharper left turns. I think that risk is real and unacceptably high. I can easily agree that gradual disempowerment is most relevant in words where rapid loss of control does not happen first, but note that the gradual problem makes the risk of coups go up. There is actually substantial debate here I'm excited about.

Local Incentives

Let me get a bit more concrete and personal here. If you are a researcher at a frontier AI lab, I think it's not in your institution's self-interest for you to engage too deeply with gradual disempowerment arguments. The institutions were founded based on worries about power and technical risks of AGI, not worries about AI and macroeconomy. They have some influence over technical development, and their 'how we win' plans were mostly crafted in a period of time where it seemed this was sufficient. It is very unclear if they are helpful or have much leverage in the gradual disempowerment trajectories.

To give a concrete example, in my read of Dario Amodei's "Machines of Loving Grace" one of the more important things to notice is not what is there, like fairly detailed analysis of progress in biology, but what is not there, or is extremely vague. I appreciate it is at least gestured at:

At that point (...a little past the point where we reach "a country of geniuses in a datacenter"...) our current economic setup will no longer make sense, and there will be a need for a broader societal conversation about how the economy should be organized.

So, we will have nice, specific things like Prevention of Alzheimer's, or some safer, more reliable descendant of CRISPR may cure most genetic disease in existing people. Also, we will need to have some conversation because the human economy will be obsolete and incentives for states to care about people will be obsolete.

I love that it is a positive vision. Also, IDK, it seems like a kind of forced optimism about certain parts of the future. Yes, we can acknowledge specific technical challenges. Yes, we can worry about deceptive alignment or capability jumps. But questioning where the whole enterprise ends, even if everything works as intended? Seems harder to incorporate into institutional narratives and strategies.

Even for those not directly employed by AI labs, there are similar dynamics in the broader AI safety community. Careers, research funding, and professional networks are increasingly built around certain ways of thinking about AI risk. Gradual disempowerment doesn't fit neatly into these frameworks. It suggests we need different kinds of expertise and different approaches than what many have invested years developing. Academic incentives also currently do not point here - there are likely less than ten economists taking this seriously, trans-disciplinary nature of the problem makes it hard sell as a grant proposal.

To be clear this isn't about individual researchers making bad choices. It's about how institutional contexts shape what kinds of problems feel important or tractable, how funding landscape shapes what people work on, how memeplexes or ‘schools of thought’ shape attention. In a way, this itself illustrates some of the points about gradual disempowerment - how systems can shape human behavior and cognition in ways that reinforce their own trajectory.

Conclusion

Actually, I don't know what's really going on here. Mostly, in my life, I've seen a bunch of case studies of epistemic distortion fields - cases where incentives like money or power shape what people have trouble thinking about, or where memeplexes protect themselves from threatening ideas. The flinching moves I've described look somewhat familiar to those patterns.

Discuss

Shell Games

The Flinch

Delegating to Future AI

Local Incentives

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签