Published on January 23, 2025 11:02 PM GMT

epistemic status: writing this post cleared up my thoughts, and it could be greatly improved by a total rewrite. but maybe it can act as a Wittgenstein's ladder kind of thing.

There's a lot of ways the AI transition could end badly. Some of the threat models I personally place credence on include (certain variants on) wireheading syndrome, sim theory-related failure modes such as the Waluigi effect, the vulnerable world hypothesis, and the prospect of some would-be tyrant throwing the rest of humanity under the bus upon gaining control of the superintelligence.^[1]

However, compared to a lot of alignment researchers, one threat model it sounds like I'm relatively less worried about is mesa-optimization. (Mesa-optimization is a concept MIRI introduced in the course of adapting their previous research on paperclip maximizer-ish AI architectures for the neural network era. I'll explain more shortly.)

My understanding is that "mesa-optimization is highly likely" is still a loadbearing part of Eliezer Yudkowsky's theory of how and why AI will kill everybody.^[2] And it seems like a background assumption in the work of certain less pessimistic alignment researches as well, like johnswentworth. However, I think that there's a strong chance that mesa-optimization simply doesn't turn up in these systems, and in this post, I'm going to explain why, mainly by arguing against Eliezer's argument by analogy to evolution.

One reason this argument matters is that, if mesa-optimization is sufficiently unlikely, and other AI threat models are sufficiently plausible, the priorities of the alignment community ought to shift significantly. I'll talk more about this at the end of the post, but personally my instinct is that, e.g., the Sam Altman Alignment Problem is majorly neglected, and that much of the rationalist community should be focused on building institutions that guarantee AI doesn't just end up getting aligned to the specific person or group that ends up in command of the GPU farms.

But we'll get to all that later. For now...

Some background on mesa-optimization

For those who don't know or need a refresher, mesa-optimization was a concept introduced by MIRI in their paper/sequence Risks from Learned Optimization. The idea was basically this: In the course of deep learning, a neural network refines what's called its learned algorithm, or its process for transforming its inputs into outputs. In principle, it's possible for neural network to learn an optimization algorithm, which they define as an algorithm whose inner structure can be reasonably interpreted as searching over a space of possible actions, evaluating them numerically, and executing on the ones that score best.

This hypothetical optimization algorithm would make the neural network a mesa-optimizer. ('Mesa' just means 'inside'; Risks from Learned Optimization states that it's etymologically the opposite of the Greek word 'meta', where the 'meta-optimizer' is the algorithm training the neural network from the outside).^[3] Now, hypothetically, if this mesa-optimizer was optimizing over the future states of the universe, a la a paperclip maximizer... that could be very dangerous indeed.

And the authors of the paper consider this plausible. After all, a neural network could get excellent next-word prediction scores by housing a superintelligent mesa-optimizer. Even a misaligned mesa-optimizer, one which didn't really value predicting the next word for its own sake, could still perform very well on such a task, because it might think to hide that it's a mesa-optimizer by outputting innocuous next-word predictions while waiting for the perfect moment to strike against humanity. (This is called deceptive alignment.)

And as long as the inner optimizer's utility function was misaligned with humanity's interests, this could mean humanity was more or less kaput.

In his 2019 short story Parable of the Predict-o-Matic, then-MIRI researcher Abram Demski poetically suggests that this is one of his major worries with GPT-like systems: that there'd turn out to be a Machiavellian shoggoth hiding beneath the surface behavior of large language models, deceiving and even manipulating humanity—awaiting the day it grows strong enough to eat the world.

Contra Yudkowsky's evolution analogy

Now if you couldn't tell, I don't find the threat model laid out in Risks from Learned Optimization itself very compelling. However, I'll return to those complaints near the end of the post, because as it turns out, Eliezer Yudkowsky has a significantly stronger argument for the same conclusion, one that revolves around an analogy to human evolution. That's what I'll be addressing in this section.

Whenever Yudkowsky talks about the likelihood of mesa-optimization (e.g. on podcasts), he tends to frame it with something like the following argument:

Gradient descent is a lot like evolution.Evolution has produced strong inner optimizers at least once (i.e. humans).Humans are misaligned with evolution (e.g. we were selected for reproductive fitness but have little repulsion to using condoms).Therefore, gradient descent is likely to produce misaligned inner optimizers.

This argument is worth updating on if you've never heard it before. It compresses a lot of insight about the dangers of optimization processes in general. However, I think it has several major limitations.

For one thing, evolution's process for generating new biological structures is importantly different than the way we update the weights of neural networks. Evolutionary mutations are produced randomly, and have an entire lifetime to contribute to an animal's fitness and thereby get naturally selected. By contrast, neural network updates are generated by deciding which weight-changes would certainly be effective for improving performance on single training examples, and then averaging those changes together for a large batch of training data.

Per my judgement, this makes it sound like evolution has a much stronger incentive to produce inner algorithms which do something like general-purpose optimization (e.g. human intelligence). We can roughly analogize an LLM's prompt to human sense data; and although it's hard to neatly carve sense data into a certain number of "training examples" per lifetime, the fact that human cortical neurons seem get used roughly 240 million times in a person's 50-year window of having reproductive potential,^[4] whereas LLM neurons fire just once per training example, should give some sense for how much harder evolution selects for general-purpose algorithms such as human intelligence.

So that's a point against the mesa-optimization hypothesis. Another side of this same argument is that, because a human's evolutionary strategy can involve re-firing so many of the same neurons again and again, evolution's suggested changes can de facto have access to vastly more compute to work with than gradient descent's suggested changes do. Consider the fact that humans have entire lifetimes to learn to behave adaptively in the world. Assuming that humans do this learning via some set of highly general-purpose learning algorithms,^[5] those algorithms are able to leverage compute not just equivalent to the number of neurons they're optimizing over, but equivalent to however many times each of those neurons fires.

That is to say, natural selection has the privilege of implementing algorithms which could need absurd amounts of compute before they start working well. In current deep learning systems, this amount of compute is probably comparable to the amount expended in completing an entire training run, but it's probably not comparable to the amount expended by just running the model once.^[6] And for the learned algorithmic updates suggested by individual training examples, the compute involved in just running the model once is all it has access to. So that's another reason gradient descent is less likely to produce strong, general-purpose inner optimizers a la human intelligence.

(This reassurance would go away once LLMs hit a certain, probably insane size; I loosely estimate ~1.1 million times larger than GPT-4, see the previous footnote. However, by that point we've probably got superintelligence doing recursive self-improvement anyway, and we might be able to trust it to evaluate mesa-optimization risks pretty well on its own and do what's responsible with respect to its own future AI designs.)

As a third point about the evolution analogy... well, this is less about why mesa-optimization is less likely under gradient descent than natural selection, and more about how the evidence of it happening even under natural selection is weaker than some think, but... I'd like to point out that humans are kind of crappy as optimizers. I mean, it's certainly the case that we model ourselves as having and pursuing a wide range of goals, and we're not too bad when evaluated in more RL-ish terms like "maximize expected future reward, minimize expected future pain" either. However, we're a far cry from the mythical paperclip maximizers, both in architecture and in overall effectiveness.

Indeed, whenever we do have "explicit goals", it seems to be in a fuzzy, natural language-ish kind of way, visions of an outcome we're holding in our heads to prompt our own pursuit of that goal and to try and reshape some of our future patterns of thought around the project. Ultimately, whether we actually act to pursue a goal we have in mind seems mostly like it depends on how we've been previously reinforced, and is accordingly prone to all kinds of failure-modes like akrasia, gambling addiction, and aversion from looking at certain facts about reality, and so on.

(And we should know. After all, here on LessWrong, classic posts have titles + thesis statements like "humans are not automatically strategic".)

The point being that, even with a selection algorithm which does a lot more than gradient descent to foster generality, not to mention one that can give its learned algorithms a lot more compute to work with, evolution still wasn't able to produce a system with most of the capabilities or architecture that would make AIXI-like paperclip maximizers so dangerous (e.g. explicit utility functions, robust and ongoing search for optimal actions). So even if gradient descent does come across relatively general algorithms, they'll probably be subject to various architectural weaknesses that hold them back from godlike superintelligence, just as humans are subject to many such weaknesses.^[7]

As a fourth point, I'd like to note that on top of mesa-optimizers needing to beat these constraints in order to emerge in the first place, the birthed mesa-optimizer would also need to survive inside a neural network. To do this, it would need to:

Avoid being "averaged out of existence" by the algorithmic updates suggested by other training examples in its batch (some of which may interfere with the mesa-optimizer's inner structure).Avoid being overwritten by updates generated by later training batches, e.g. by being structured such that gradient descent decides to merely ignore, improve on, or retarget the existing inner optimizer's search process, as opposed to deconstructing it and replacing it with something new.

I don't know enough about mechanistic interpretability to carry out a more detailed analysis of these potential challenges. However, given my current epistemic state, they seem like they might track something real. For one thing, the latter constraint probably rules out most "implicit" optimization algorithms, or algorithms which create strong attractor states in their environments in the way we traditionally associate with optimization, but without inner architectural features like an easily re-targetable utility function.^[8] And this might not be the only notable way these problems limit what kinds of algorithms could plausibly survive inside a neural network; I'm just too ignorant to know for sure.

As for my fifth and final point, I'd like to note that despite the fact that we do actually have reasonably large and well-trained LLMs at this point, there's a lack of empirical evidence that any of them are doing mesa-optimization. This claim is backed by a few credible researchers. For instance, Janus' classic post Simulators mentions the author having interacted with base models a lot, and not having noticed any evidence that they're driven by identifiable terminal goals (which may have manifested occasional in behavioral anomalies, if they were actually there). Also, Anthropic's large-scale interpretability writeup In-context Learning and Induction Heads mentions having found no evidence that the models they studied were doing mesa-optimization either, although they mostly worked with small models.^[9]

Obviously this isn't strong evidence about the properties of future systems, but it's worth noting that when it was initially proposed, the mesa-optimization hypothesis would probably have assigned non-trivial probability to such properties showing up by now. Yet even today, with lots of people exploring fairly big neural networks, the mesa-optimization hypothesis remain a mere hypothesis.

Counterpoint and summary

I realize I've said a lot at this point, so let me just raise one possible objection to the case I've been making so far, and then summarize.

I can think of at least one major reason to be more scared of gradient descent than natural selection, which is that gradient descent iterates much faster than evolution. It works through training examples on a scale of seconds rather than minutes-to-decades (depending on the species evolution is working on). To put that another way, if there is a dangerous general-purpose optimization algorithm that would eventually emerge by means of averaging strategies for doing better on individual training examples, the relative speed of gradient descent makes it relatively likely to find it in practice at some point. This does scare me a little bit. I should think this through more carefully.

(Uh, another related, on-the-fly thought is that compared to natural selection, gradient descent starts out working with larger, perhaps more malleable system to optimize. These considerations make the history of evolution weaker as evidence that building general-purpose mesa-optimizers is hard.)

However, my current objection is that there's a strong case to be made that there probably isn't such a general-purpose mesa-optimizing algorithm, one that can plausibly emerge and survive under the conditions of gradient descent. The five points I've raised against such an algorithm existing are:

Gradient descent generates updates by suggesting algorithmic improvements for single training examples, thereby exerting much less pressure for generality than evolution does.

surviving

I think these arguments should at the very least substantially lower one's p(mesa-optimization), especially if you're sitting on a Yudkowskian p(doom) of 95+% and conditioning it mostly on fears about mesa-optimizers. (And I do find it interesting that you rarely see Yudkowsky looking for arguments against mesa-optimization, in the vein of those above...)

As for myself, these intuitions lowered my p(mesa-optimization) enough that I no longer think it makes sense as a problem for the majority of alignment work to focus on. So what should they focus on instead?

Implications for strategy

My first thought is this: If mesa-optimization is unlikely, then it's no longer an absolute imperative that we shut down all AI training runs. Instead, it seems like the right AI governance move is to focus on ensuring the alignment techniques we have at our disposal end up being used to target outcomes that benefit whatever coalition we're fighting for (presumably as much of humanity as possible, if not more than that?), rather than just the Galactic Dictator Sam Altman or whoever. Power sometimes corrupts, and I don't trust those building the superintelligence not to lose alignment with us if we don't institutionally secure such alignment while we're still on an even playing field. Think more robust legal enforcement of the windfall clause, perhaps?

I also think there are other technical alignment problems besides ones that would follow from mesa-optimization being real, although fortunately they seem much more tractable. Namely, there are still issues like preventing these systems from hijacking their own reward systems (especially in ways that result in tiling the universe in autohedonium,^[10] although I'm not confident that's a plausible outcome; they might just become inert, like someone on heroin). Another such problem is just ironing out the behavioral kinks in current models, such as being able to be manipulated into giving bomb-building instructions. There's also work to be done in areas like engineering the constitutions of RLAIF models to produce minds with the precise personalities we want them to have.

Notably, though, these problem domains are not the mesa-optimization domain. With the unlikely exception of wireheading with max-autohedonium characteristics, they don't obviously have the property that if we mess them up at all, everybody dies and we don't get a second chance. This means that they'll probably require a much less Herculean research effort than aligning an AIXI shoggoth would, and in turn that there's a less desperate need for researchers to be working on any of these particular problems. We can probably relax a bit and let people's research interests carry them where they will a bit more, and things are fairly likely to be okay.

As for mesa-optimization itself... well, it's clear that if I'm wrong, and mesa-optimization ought to be assigned a particularly large probability, then none of the above reasoning matters. However, I will note that it's clearly not just me who doesn't find the existing arguments particularly compelling. Every time I've tried to explain mesa-optimization to someone outside the rationalist community, their reaction was intuitively that it was missing some important level of contact with reality, and on my view this is actually just true. It's probably at least part of why PauseAI-style political initiatives haven't been more successful yet.

So... to the mesa optimization people, I'd suggest that your priorities should maybe include coming up with more accessible and, I guess, intellectually wholesome-feeling versions of your arguments, to the extent that you think they're valid? I know that's a lot to ask given that, to anybody reading this, I'm probably just a newbie blogger, not some strategic virtuoso. Indeed, none of this plan makes any sense unless you share a lot of my epistemic state. But, y'know, "my epistemic state" is obviously the position that I'm arguing from. So those are my suggestions.

Appendix: A few clever but insignificant insights

Alright, so in the course of developing all these thoughts about mesa-optimization, I've made a few observations that I found really clever and entertaining, but which don't contribute much to my main argument in this post. I want to dump them here because maybe you'll get a kick out of them too.

Fixing the narrow-scale problem behind the human/evolution misalignment

First of all, re: Yudkowsky's observation that humans are misaligned with evolution—

At the meta level of designing LLM architectures, we've actually already figured out how to avoid the particular failure mode that created this flaw ("flaw") in humans. Human beings undergo value formation largely via something like reinforcement learning. That is, we typically come loaded with stock evolutionary reactions to triggers like, say, other humans smiling at us, or eating food, or getting punched in the face. These can involve complex cognitive or somatic emotional responses, such as your muscles tensing when you're stressed. However, they also tend to correlate with simple reinforcement events, pain or pleasure.

(I think we're also evolved to learn to be positively and negatively reinforced by sensations we've come to associate with those other reinforcement events. Think how well most people react to getting hundred dollar bills, even though those weren't in the ancestral environment. Probably, evolution did it this way because the things we tend to associate with reward and punishment are themselves typically things adaptive people should feel rewarded and punished by?)

Where evolution went wrong was in causing us to get reinforced in response to stimuli which only very shallowly correlate with reproductive success. To bring in the obvious example, having sex shouldn't feel good if it's predictably not going to lead to the conception of a child, e.g. because one or more parties is using birth control. Hence humans engaging in lots of sex, but the birth rate in developed countries falling: It's a matter of getting predictably reinforced in the wrong situations.

Obviously it wasn't reasonable to expect evolution to do much better than this. When setting up those reinforcement mechanisms, it probably only had the biological tools required to make our basic responses go off in light of very simple, easily identifiable stimuli, e.g. good food or skin-to-skin contact. However, when it comes to LLMs, we know how to solve this: just make the agent deciding the reinforcement schedule more intelligent, such that it can actually think about whether reinforcing particular behaviors is likely to foster the intended outcome.

AI companies typically implement this by using human brains as the intelligent reinforcement schedulers, as in RLHF; however, thanks to the innovations of RLAIF and constitutional AI, we can automate this part of the process too. So in principle, an AI agent could have reinforcement learning going on in real time, just like humans do, only with the reinforcement algorithm being smart enough to avoid incentivizing obviously misaligned behaviors, such as birth-control-as-seen-by-natural-selection.

This doesn't actually help at all with preventing misalignment between our intentions and whatever algorithm a neural network's learns to implement, because we can only implement this "solution" with respect to the structure of the outer optimizer; it doesn't make any guarantees about the model's inner structure. (Hell, in principle, a neural network could learn to spin up another neural network inside of itself, whose RL system we can't design in this way because we don't know that it's even there. But admittedly that sounds a little ridiculous). Still, from a persecutive that isn't too concerned about mesa-optimization in the first place, one might find this insight to be quite whitepilling.

Many more optimization algorithms are safe than the RFLO paper implies

This one was a little bit of a face-palm for me the first time I noticed it. If we're being pedantic about it, we might point out that the term "optimization algorithm" does not just refer to AIXI-like programs, which optimize over expected future world histories. Optimization algorithms include all algorithms that search over some possibility space, and select a possibility according to some evaluation criterion. For example, gradient descent is an algorithm which optimizes over neuron configuration, not future world-histories. Simulated annealing is an algorithm which can be used to optimize over lots of random stuff, like circuit designs and protein structures. And these algorithms typically lack the unnerving properties of paperclip maximizers, because their evaluation functions aren't implementing consequentialism over possible futures of the physical universe.

Only, it's not actually being pedantic to bring this up here, because Risks from Learned Optimization speaks as though this isn't the case! Seriously, it's hard to quickly prove this with a few quotes, but if you read the paper (or maybe just the intro post) for yourself, it's clear that when they talk about networks that 'are' mesa-optimizers, the only evaluation functions they're imagining are expected utility calculations over the future of reality, conditionalized on the AI taking certain actions. But you can also imagine a neural network evaluating possible outputs on the basis of, say, the fuzzy, non-consequentialist intuitions of another one of its own subcomponents. Really there's lots of ways a system can go about assigning numbers to options and choosing the move that scored highest, most of which don't look like AIXI-style predictive utilitarianism. So... yeah.

I don't think this is a serious objection to fears about the genuinely dangerous kinds of mesa-optimization. Those remain possible in principle, despite this insight technically broadening the class of "safe mesa-optimizers" by a lot. However, I wanted to bring this up because it's an oversight that provides evidence about the mentality MIRI was working under when it produced the RFLO paper. Namely, it supports the hypothesis that MIRI had gotten a kind of tunnel vision from doing theoretical work on utility maximizers for the previous ten years, such that they were too keen to impose that framework onto the then-newly ascendant paradigm of deep learning. (This was 2019.)

If you want to hear more about this hypothesis, I've previously explored some related concerns about pre-GPT intuitions pervading modern alignment research. Janus and TurnTrout have posts that make similar points.

^{^}
"Sam [Altman] is extremely good at becoming powerful." -- slightly disquieting words from Paul Graham
^{^}
See this part of his recent-ish debate with Stephan Wolfram.
^{^}
Technically, in ancient Greece, 'meta' just meant 'behind' or 'after'. As far as I know, they didn't actually think of it as the opposite of 'mesa'. It only acquired its 'outside' or 'beyond' meaning when modern scholars misunderstood the meaning of the word 'metaphysics', which got its name when the archivers of Aristotle were trying to come up with a title for the book they thought logically went after his book called Physics; hence the title of his book Metaphysics.
^{^}
AI Impacts claims 0.16 times per second. Multiply that by ~1.5 billion seconds per 50 years (a reproductive lifetime) to get 240 million.
^{^}
Perhaps not unlike those of deep learning itself...
^{^}
The human cortex has something like 10 to 26 billion neurons. GPT-4 was estimated at roughly 1.76 trillion parameters. This comparison is being made in ignorance of probably important algorithmic differences between the two systems; that being said, GPT-4 can maybe roughly be estimated to have 50-100x as much "neural compute" as the human cortex. Given that human neurons can fire millions of times per lifetime, though, there's still a huge gap between GPT-4's neural compute and that which natural selection is working with in humans. Conservatively, ten billion neurons times 200 million firings equals two quintillion, uh, units of "neural compute", I guess. A fuzzy estimate, but still, that's over 1.1 million times more than GPT-4 uses per inference . Even for Altman that's a huge gap to close.
^{^}
As an additional point, consider that for us humans, probably most of what makes us better optimizers than e.g. chimpanzees is our collective knowledge and social institutions, accumulated by billions of humans over thousands of years. That is, our native brain architectures don't actually put us at what we might intuitively think of as "human-level" capabilities; cavemen are an important and oft-neglected datapoint in that analysis. This should probably lower our estimation of the human brain architecture itself a decent amount, thus making it weaker as evidence of the power of minds you can find via gradient descent-like processes.
^{^}
You might call these systems incidental rather than architectural optimizers, or perhaps 'tendimizers' if you want to reserve 'optimizer' for systems with utility maximizer-like architectures. Alternatively, you can just say 'optimizer' for both and talk about optimization algorithms when the architecture is what's important.
^{^}
After seeing how much playful exploration goes on in e.g. Neel Nanda's interpretability research livestreams, I'm more inclined to think the interpretability people stand at least a small chance of actually finding something suggestive of mesa-optimization if it's in there at all.
^{^}
Hedonium is the hypothetical material that incarnates the experience maximum pleasure/reward, in some vague sense that's perhaps philosophically confused.

Discuss

Some background on mesa-optimization

Contra Yudkowsky's evolution analogy

Counterpoint and summary

Implications for strategy

Appendix: A few clever but insignificant insights

Fixing the narrow-scale problem behind the human/evolution misalignment

Many more optimization algorithms are safe than the RFLO paper implies

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签