Published on November 13, 2024 3:02 AM GMT
I'd like to quickly highlight a discrepancy in how the concept of optimization is understood by different members of the alignment community. On one hand, you have definitions like that of sequences-era Yudkowsky, or that of Alex Flint. Yudkowsky once defined optimization in terms of systems which make certain unlikely outcomes likely, reliably hitting small targets in large search spaces. For example, if a knowledgeable and resource-rich human, wants to build a car, they'll reliably be able to build cars, even though there's lots of ways the car's atoms could be arranged, and the vast majority of them are not in fact cars. Therefore, when an informed and resource-rich human is motivated to build a car, it is acting as an optimizer, fairly reliably hitting small targets in large search spaces.
In The Ground of Optimization, Alex Flint gives a similar definition: "An optimizing system is a physical process in which the configuration of some part of the universe moves predictably towards a small set of target configurations from any point in a broad basin of optimization, despite perturbations during the optimization process." The focus here is still on systems which effectively embed robust attractor states into the larger systems they're a part of.
I'll call systems which meet this definition of optimization Level 1 Optimizers: they meet the criterion of being able to steer a large set of possible systems into a relatively small set of outcome-states. (Someone who knows, like, dynamical systems theory or something should hurry up and formalize this mathematically, if that hasn't already been done.) However, there's another common definition of optimization in the alignment community, which I'll call Level 2 Optimization. It has to do with systems featuring what are called explicit internal representations of their own optimization targets. They also run explicit search processes for actions that are likely to achieve those goals.
For example, MIRI's Risks from Learned Optimization paper uses a definition that falls into this category.
While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system.
The paper then gives examples of what I'd call Level 2 Optimizers.
Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective.
Here's an example that makes the distinction I'm drawing especially clear: a system like AIXI is a Level 2 Optimizer. In terms of its actual programming, it has a utility function, and it searches over ways it might reconfigure the world it's in to maximize expected value per that utility function. By contrast, you can imagine a lookup table which perfectly simulates the behavior of AIXI at some particular timestep; it observes the same world, and takes the same action. But since its actual code doesn't include a utility function, an expected utility calculator, or anything else like that, it's merely a Level 1 Optimizer.
(Another set of terms we might use to capture the same concept: the first system is a proper optimizer, and the second one is what I call a tendimizer. It still embeds certain attractor states, or evolutionary tendencies, into its surroundings, i.e. the ones outlined by the former system's utility function. However, it lacks internal symbols which, when naturally interpreted, straightforwardly tell you what those attractor states are going to be. I'll elaborate on what I mean by "when naturally interpreted" in a future post.)
Obviously, in the context of the previous example about AIXI, this distinction wouldn't matter at all in terms of whether the system in question e.g. posed an x-risk to whatever larger system it was optimizing over. Both behave identically, and so they're equally deadly. However, the distinction becomes much more relevant when discussing the risks involved with designing an AI system. Assuming you had an idea for a working AIXI-like architecture, it would be very easy to destroy the world with it. Even slightly screw up the system's utility function, and the cosmos could be paperclipped.
By contrast, with a lookup table, you would need to manually specify the system's reaction to literally every last situation it might encounter. Causing its reactions to be malicious and intelligent enough to paperclip the cosmos would be dramatically more difficult. Of course, many Level 1 Optimizers have much cleverer designs than mere lookup tables, which and therefore easier for humans to use to accomplish their own goals. However, the point is that no known Level 1 Optimizer architecture makes it easy to accidentally paperclip the cosmos, in the manner of a utility maximizer. The threat model has to look different.
The question explored by MIRI's Risks from Learned Optimization paper is whether deep learning systems might design Level 2 Optimizers, at which point it becomes highly plausible that those optimizers may end up with poorly specified utility functions, and thereby paperclip the cosmos. By contrast. If instead, they're mere tendimizers whose tendencies are slowly and steerably chiseled out by the deep learning process, then we at least shouldn't expect them to be dangerous in exactly the way called to mind by considering paperclip maximizers, unless we can come up with a really good reason they'd simulate maximizers in particular anyway.
As for myself, my suspicion is that current LLMs probably fall somewhere in the middle. It's entirely possible to prompt an LLM using text that contains an explicit, though informal, representation of some goal. For example, you can ask it to write you code for a webpage with a heading that says "WELCOME TO HELL" and ASCII art of a smiley face; it will then fulfill this goal when you hit enter. You could even prompt the model to generate multiple drafts, grade each of its own drafts, and select the one with the highest grade. This is effectively the system running a fuzzy facsimile of a Level 2 optimization process at the visible, natural language-driven layer of its own cognition.
However, within the model's hidden layers, I think there's a good chance that nothing like hardcore Level 2 Optimization is going on. Instead, my guess is that the hidden layers have been slowly sculpted to recognize and fulfill these kinds of natural-language requests in fairly benign ways; i.e., I think there's a good chance that this is accomplished using computational techniques which look nothing like rigorous utility maximization at all. (See Anthropic's research on induction heads for some intuition on what kinds of non-AIXI-like computations LLMs might use in practice.)
If this is true, then LLMs might not be very much more more likely to perform paperclipping than a human being. It strikes me that humans also basically just do a fuzzy facsimile of Level 2 Optimization; their own thoughts are, in a sense, their own natural language prompts. And the real reason our behavior ends up optimizing in a consistent direction is that we've been reinforcement-trained to behave in certain ways all our lives. Holding an "explicit goal" in mind only helps you go after it to the extent that you've been reinforced into wanting to pursue that goal, and RL seems like a much more safely steerable process defining an explicit utility function.
(Related: see Anna Salamon's post "Humans are not automatically strategic".)
Of course, I could be wrong, and LLMs could be doing (Level 2) mesa-optimization in precisely the sense outlined in Risks from Learned Optimization. I think reducing uncertainty about this is very important. However, there's a real chance that it turns out that LLMs don't have permanent, ruthless goals in the sense of AIXI, and instead have them at best in the same sense that humans have them; that is, they're more like high-level behavioral tendencies baked into us slowly and steerably by reinforcement learning. Only in certain circumstances do we think about goals that we have, and even then, we only pursue them if we've been reinforced into doing so.
(An LLM may self-modify to develop a stronger tendency to bring about certain attractor states in the environment, but perhaps it would only be motivated to do so under the same circumstances that a human would. Perhaps one can reinforce an LLM to simply not want to do this.)
If all of this is correct, we're going to need different threat models for why LLMs are dangerous, e.g. ones inspired by sim theory or LLMs' likely analogues to human psychology (humans sometimes being evil). So that's why I'm interested in distinguishing between these two types of optimization: only one of them implies the validity threat models developed at the beginning of the AI safety movement.
(post-script: i had the idea for the first half of this post months ago. i came up with the second half on the fly. the second half, particularly the conclusion that LLMs are in some sense in the middle between level 1 and level 2 optimizers, but still probably don't have the failure modes of the latter, makes me question the quality of the distinction i'm drawing here. i think it's an important first step but not actually the final framework i'll use for this topic. posting it anyway because i love wasting everybody's time ^^)
Discuss