Published on July 27, 2024 8:44 AM GMT
Editor's note: I was thinking this through as I was writing it, and I could probably make it much clearer if I rewrote it from scratch now. Still, I have various problems with perfectionism that make releasing this as-is the preferred alternative.
So, within the ratosphere, it's well-known that every physical object or set of objects is mathematically equivalent to some expected utility maximizer (or actually, an infinitely (or non-haltingly) large number of different expected utility maximizers). All you have to do is define a utility function which, at time T, takes in all the relevant context within and around a given physical system, and assigns the highest expected utility to whatever actions that system actually takes to produce its state at time T+1.
For example: a calculator takes inputs from the user and spits out numbers in response. You can imagine an expected utility maximizer that takes those same inputs, and assigns the highest expected utility to given the same outputs as the calculator would. For another example: the laws of physics take the current configuration of the universe as input, and spit out the configuration of the universe at the next moment in time as the output. You can imagine an expected utility maximizer doing exactly the same thing.
(For clarity, maybe you could picture the space of "possible" outputs this expected utility maximizer could produce as the entire space of possible ways the universe could be configured. Again, it just selects outputs in line with what we understand to be the laws of physics because that's what its utility function smiles upon.)
On some level, this stuff is just sheer mathematical playfulness. Stating that a system is "mathematically equivalent" to a certain class of expected utility maximizers doesn't let you make any new predictions about how it'll actually respond to any given input. However, I think that this framework can help point us to some interesting puzzles pertaining to, e.g., the AI alignment problem. For example: how do we reconcile the ideas that 1) most imaginable expected utility maximizers would drive humanity to extinction, and 2) we humans are still alive, even though every single physical system in existence is mathematically equivalent to some class of flawless expected utility maximizers?
Well, probably the most obvious place to start explaining this is that nobody ever says that all utility maximizers doom humanity by default. While systems like paperclip maximizers have good reason to take out humanity for the sake of converting us into paperclips, a utility maximizer that's in an equivalence class alongside "just implement the laws of physics as normal" typically doesn't have that same incentive to use the atoms that constitute You for something else. Whether it's the inputs and outputs of a calculator or the evolution of the universe itself, most physical systems simply are nowhere near "paperclip maximizer" levels of hostile to human beings; therefore, the utility maximizers in their equivalence classes just "happen to have" utility functions that are sufficiently delicately specified so as not to quickly result in human extinction.
(Notably, some physical systems running as normal do occasionally kill humans. And, if we're interpreting all physical systems as expected utility maximizers, this is simply a byproduct of the universe running its expected utility maximization algorithm as normal. This "kill humans as a byproduct of normal operation" property is the same as it is in a paperclip maximizer. However, most physical systems (with exceptions like "large explosions in close proximity to humans") don't quickly and relentlessly kill humans the way paperclip maximizers probably would, which means that the utility maximizers in their equivalence classes are relatively safe, compared to those in classic AI doom scenarios.)
Overall, one takeaway from this line of argument is that, if an actually existing physical system is merely equivalent to an expected utility maximizer, that's not a good reason to worry about it paperclipping the rest of the universe. Just by virtue of how few physical systems possess the relentless, terrifying properties of paperclip maximizers, most of them are actually equivalent to a whole class of utility functions which are reasonably well-aligned with human values, despite the common worry that "most conceivable utility functions" are incredibly dangerous.[1] But that then raises a question: when exactly do we have to worry about the failure modes traditionally associated with expected utility maximizers?
Well, I think one way to tease out some details is by comparing two strategies for solving some common, everyday problems, let's say getting your thoughts on some topic out into the world. A time-tested way of solving this problem is just to write up your thoughts in some document, and put that document somewhere others might read it. Remember that the document you produce is equivalent to some set of expected utility maximizers, taking in the state of its surroundings as inputs and affecting its future surroundings as outputs (e.g. by "detecting" nearby light rays at time T and then reflecting or absorbing them at time T+1, or detecting nearby objects at time T and then moving them via gravity at time T+1).
However, when you think about the utility functions the document's physical presence in the world is equivalent to, you'll notice that the types of interactions it typically are basically benign and trivial; a document sitting on your bookshelf is hardly any better or worse for the realization of human values than, say, an asteroid floating through some random sector of deep space. Really, the document's effects on the rest of the world only leave the basin of "negligible value in either direction" in the very specific context where humans are actually reading the document, the effects of which we can easily predict in advance and evaluate to positive. This is the nature of roughly all man-made tools: the utility functions they're equivalent to mostly result in the universe being "optimized" in ways that are no better or worse than it was before, with only a few narrow sets of "inputs" they can take (physical situations they can be in) where they predictably and reliably produce an unusual amount of value to humans.
In other words, by authoring a document, you're constructing an object whose "utility function" is, implicitly, extremely well-considered. That utility function certainly isn't an ultra-ambitious promoter of all human values, like the kinds of utopian superintelligences we're usually hoping for when we talk about utility maximization, but it also typically doesn't interfere with the pursuit of human values either, and in a few cases its "values" happen to be unusually well-aligned with ours (i.e. cases where a human is actually reading the document.) This kind of safe, implicit utility maximizer is what humans typically produce when they're trying to solve problems: objects that don't deviate too far from the fairly safe status quo of the universe, but which differ in just the right places such that we can expect them to largely be a net good for the satisfaction of our desires.
(Tangentially, I'm guessing we design objects this way thanks to the fact that our idea-generating processes are shaped by a mix of predictive learning, giving us the power to model the world and so realistically guess the effects of acting on some idea; and reinforcement learning, helping us act in ways that align with what humans find pleasurable and avoid what's painful (e.g. by coming up with good ideas.))[2]
So, what exactly makes the utility maximizers of AI doom scenarios so much scarier than those we're creating every time we reorganize a physical system? Well, the answer comes in two parts, the first of which is that in the doom scenarios, the architectures of the maximizing machines make it easy for humans to specify more or less arbitrary goals for the system to pursue. With systems that are merely equivalent to utility maximizers (keep the examples of books and tools in mind), you have to cobble together various hacks and highly situational insights to get a system's "goals" (which manifest in its behaviors in various situations) to even partly align with your own. Explicit utility maximizers, by contrast, promise a ~fully general method for building systems which pursue whatever goal you have in mind, assuming you can specify it precisely.
Of course, "assuming you can specify it precisely" is a pretty big assumption. We humans have spent our whole lives learning to doctor physical systems in ways that cause them to occasionally and incidentally behave in ways that are in line with our desires; we don't have as much practice explaining the desires themselves, certainly not with the level of precision traditionally associated with computer programming.[3] So, again consider the desire to get your thoughts out into the world. We're well-trained to build objects (e.g. books) which fulfill that desire, but we're not well-trained to put that desire into words well enough avoid building systems that, e.g., burrow into people's brains to convey your thoughts to them with maximum efficiency, and possibly hurt or kill them in the process. (That specific failure mode might be avertable, but the point is you probably can't catch them all.)
This is a pretty standard analysis of the failure modes of explicit utility maximizers, but I think we can use the comparison with merely implicit utility maximizers to gain some more insight. Again, really all systems are utility maximizers; what sets the dangerous ones apart is that their computational architecture makes their utility function relatively easy to read and modify in terms of which abstract, long-term outcomes they aspire to. This makes it possible, and therefore tempting, to try and design systems which constantly act in accordance with all of our values, rather occasionally acting in accordance with some of our values, and otherwise acting ~neutrally.
So, those are the two reasons why explicit utility maximizers are more dangerous than implicit ones:
- As tempting as it is, it's hard to explicitly specify all of your values, even harder than it is tp specify individual ones like "I want to get my thoughts out into the world". (This owes partly to the difficulties of defining what "all of your values" means to begin with.)A system which enacts a slightly misspecified version of your values is a lot more dangerous if it's enacting trying to enact values all the time, rather than only in particular circumstances, and otherwise acting benignly.
So, I suppose this points to an interesting way of framing the alignment problem. Instead of "How do we build systems that robustly pursue human values?", the question is more like, "How do we build systems whose behavior is aligned with our values in some cases, and simply neutral in all others?" Tools with these types of implicit utility functions, which are only aligned in some places but hardly ever misaligned, are near-universal across human history; the question is just whether we can safely scale that property up, slowly increasing the number of cases where can confidently predict a system will make "decisions" we endorse while leaving the leftover aspects of its behavior unhelpful but mostly harmless.
(Which isn't to say such a system has to be entirely harmless. Just as a book might fall off its bookshelf and bonk you on the head, a safe, thereby making a small "mistake" per human values in the process of perfectly maximizing its implicit utility function, an ~aligned AI system could in principle misbehave in small ways as well, without it causing catastrophic consequences.)
Some of me wonders whether current LLMs are already kind of like this, RLHF slowly bootstrapping them to behavior that's frequently aligned, sometimes useless, and rarely harmful; that's at least a decent description of how it works in theory. There's at least a chance that we're on track to building systems with implicit utility functions that often useful and rarely harmful, and even if not, it's at least an ideal to aspire towards. (For an argument that deep learning might develop highly "structured" and therefore probably dangerous ~[utility functions], the encoding of which we could simply fail to crack before we all die, see MIRI's Risks from Learned Optimization paper. A more precise definition of this post's idea of explicit vs. implicit utility functions might help assess those risks more clearly; that sounds like a good direction for future research.)
Anyway, as a kind of coda to this post, I'd like to offer up a remedy to some issues I see seeping into discourse about AI, utility functions, and optimization. In everyday usage, central examples of "optimization" include process carried out with an explicit goal in mind, e.g. optimizing your diet for longevity, or optimizing your technique for throwing harder punches. But LessWrong tends to embrace a somewhat looser usage of optimization; for example, there's Eliezer's old definition, which included forces like natural selection just because they hit relatively small targets in a large possibility space, and Alex Flint's only slightly more strict definition, which pertains to systems that robustly tend to cause certain outcomes even in the face of significant shocks to the system. Neither of these definitions requite an optimizer to have explicit goals.
However, I think we could avoid some confusion by trying a little harder to reserve "optimization process" for events guided by something like explicit goals, e.g. explicit utility functions. To refer to the broader concept, described above, I propose we use a certain neologism: "probutation process". ("Probutation" being derived from probability and mutation; the idea is that the system probabilistically evolves, or probutates, its surroundings into a certain set of configurations.)
By upholding a distinction between optimization and probutation, I think we might encourage people to more consistently differentiate systems with explicit vs. implicit goals, helping us think more clearly about the unique dangers of explicit utility maximizers. It might also help prevent the misconception that superintelligence could only ever be driven by an explicit optimization process, rather than mere probutation, as well. Per our earlier considerations about the unique danger of explicit utility functions, this might mitigate undue pessimism about our odds of surviving the singularity.[4]
- ^
I'm not actually that confident that "most utility functions" give the maximizers they belong to horrifying incentives. There's an infinite (i.e. non-haltingly generable) set of conceivable utility functions, and when something is infinite, that means you can't straightforwardly apply the concept of fractions. Instead, the best you can do is somehow take a "random sample" of utility functions and see how many of those kill all humans. But how you might take a "random sample" isn't especially well-defined here, as far as I'm aware. It seems plausible that "most utility functions paperclip the universe" isn't quite right, and it's more like "most utility functions humans might come up with in trying to build an AI that optimizes the universe per our values paperclip the universe." It's not hard to generate ideas for benign-but-useless utility functions, if set your mind to the task.
- ^
I've thought about this stuff extensively; here's a Google Doc I spent a month writing up on this and adjacent topics. It's not good enough for LessWrong, partly because the presentation is bad and partly because I didn't do enough empirical research, but those who push through it will probably find value in it anyway.
- ^
And especially not the complex functions which generate our more moment-to-moment goals. That ur-process ensures said short-term goals (like getting up to go eat food) are good for us and non-psychopathic and so on; for architectural reasons, its effects are hard to fully capture in a utility function. Specifically, I suspect that humans' human goal-producing algorithms are shaped by something more like deep learning, rather than utility maximization; I'll write more on this at some point, since I think it offers insight into what kinds of systems we can trust to behave in accordance with "our values". (I'll probably critique that phrase in the eventual post as well).
- ^
Late into the third night of Vibecamp 3, I was awake in bed, coming down from an acid trip. My friends in the Kaleidoscope cabin were laying on the floor, discussing one member's guilt about being an imperfect optimizer. This wasn't the first time I'd heard of ~rationalists being psychologically harmed by how the community talks about optimization; the zeitgeist, which suggests that acting in accordance with some known utility function is a central pillar of intelligence (and, relatedly, a virtue), is at odds with the fact that humans are motivated less by utility functions than loss functions. Thus, the zeitgeist can cause undue psychological distress. I would rather put a stop to that as well.
Using "probutation" rather than "optimization", where appropriate, may at least slightly mitigate some rationalists' excessive focus on knowingly goal-directed behavior.
Discuss