Published on May 22, 2025 5:36 PM GMT

In the context of model-based RL agents in general, and brain-like AGI in particular, part of the source code is a reward function. The programmers get to put whatever code they want into the reward function slot, and this decision will have an outsized effect on what the AGI winds up wanting to do.

One thing you could do is: hook up the reward function to a physical button on a wall. And then you’ll generally (see caveats below) get an AGI that wants you to press the button.

A nice intuition is addictive drugs. If you give a person or animal a few hits of an addictive drug, then they’re going to want it again. The reward button is just like that—it’s an addictive drug for the AGI.

And then what? Easy, just tell the AGI, “Hey AGI, I’ll press the button if you do (blah)”—write a working app, earn a million dollars, improve the solar cell design, you name it. And then ideally, the AGI will try very hard to do (blah), again because it wants you to press the button.

So, that’s a plan! You might be thinking: this is a very bad plan! Among other issues (see below), the AGI will want to seize control of the reward button, and wipe out humanity to prevent them from turning it off, if that’s possible. And it will surely be possible as we keep going down the road towards superintelligence.

…And you would be right! Reward button alignment is indeed a terrible plan.

So why am I writing a blog post about it? Three reasons:

quite

“AI Control”

So here is my little analysis, in the form of an FAQ.

1. Can we actually get “reward button alignment”, or will there be inner misalignment issues?

In other words, will the AGI actually want you to push the button? Or would it want some random weird thing because inner alignment is hard?

My answer is: yes, it would want you to push the button, at least if we’re talking about brain-like AGI, and if you set things up correctly.

Again, getting a brain-like AGI addicted to a reward button is a lot like getting a human or animal hooked on an addictive drug.

Some details matter here:

§9.4–5 here

other

here

“credit assignment”

2. Seriously? Everyone keeps saying that inner alignment is hard. Why are you so optimistic here?

I do think inner alignment is hard! But this is inner alignment on “easy mode”, for two reasons.

First, as mentioned above, the pressing of a reward button is a salient real-world event that reliably and immediately precedes the ground-truth reward. So credit assignment will have an easy time latching onto it. Compare that with the more abstract targets-of-desire that we might want for a superintelligence—things like “a desire to be helpful”, or “a desire for Coherent Extrapolated Volition”, or “a desire for practical whole brain emulation technology”.

Second, as discussed below, I’m assuming sufficiently good security around the reward button and human-manipulation, such that the AGI doesn’t have any feasible out-of-the-box strategies to satisfy its desires. In other words, I’m assuming away a whole host of issues where the AGI’s strategies shift as a result of learning new things, having new ideas, or inventing new technology—more on that in my post “Sharp Left Turn” discourse: An opinionated review.

Again, what we really care about is alignment of more powerful AGIs, AGIs that can brainwash people and seize reward buttons, and we want to make those AGIs feel intrinsic motivation towards goals that are more abstract or novel. That’s a much harder problem.

3. Can we extract any useful work from a reward-button-aligned AGI?

Sure! (At least, up until the point where it can manipulate us into pressing the button, or seize control in other ways.)

We do need to be mindful that the AGI will only care about us pressing the button, not about our long-term best interest. But that’s enough for a huge amount of “useful work”. For example, suppose I tell the AGI:

Hello AI. Here is a bank account with $100K of seed capital. Go make money. I’ll press the reward button if I can successfully withdraw $1B from that same bank account in the future. (But I’ll wait 1 year between withdrawing the funds and pressing the reward button, during which I’ll perform due diligence to check for law-breaking or any other funny business. And the definition of ‘funny business’ will be at my sole discretion, so you should check with me in advance if you’re unsure where I will draw the line.) Good luck!

And let’s assume the AI has reason to believe me (e.g. because I’ve been true to my word in the past), and is purely motivated by the reward button, but is not yet capable of brainwashing me or stealing my button.

(I guess that last part is rather implausible if it can already autonomously make $1B, but maybe we have taken precautions against brainwashing and button-theft (see below), or else you can substitute a less ambitious project, like making a successful app or whatever.)

This doesn’t require that I have any particular skill at overseeing the AGI. It just requires that I can hire competent lawyers and accountants for pretty basic due diligence. The extra year will also allow for law enforcement or public outcry or whatever to surface any subtle or sneaky problems caused by the AGI.

So that’s a way to automate the economy and make trillions of dollars. Until catastrophic AGI takeover, of course.

4. Can “reward button alignment” be used as the first stage of a bootstrapping / “AI control”-ish plan?

In other words, if we have this “reward button alignment” band-aid, one thing we could do is ask the AGI to come up with a better plan for technical alignment. And we promise to slam the reward button if we’re happy with the results. Then (one hopes) the band-aid doesn’t need to hold forever. It just has to hold long enough to get a better technical alignment plan.

The two things that could go wrong are:

(1) The AGI could output a plan that convinces me, but won’t actually work. Remember that we already have people strongly disagreeing about what constitutes a good plan for alignment (e.g. Eliezer Yudkowsky versus Paul Christiano), and it’s just very hard for third parties to figure out what is true. This problem was the theme of @johnswentworth’s post “The Case Against AI Control Research”.

(2) The AGI could find that it is easier to go rogue and brainwash me or seize the reward button, than to actually figure out technical alignment.

We can try to mitigate (2) by using security measures to delay escape and hijinks for as long as possible (see below).

And what about (1)? Well one thing is: We could use AGI debate! We could set up two copies of the AGI, and tell one to make a plan and the other to say why it’s a bad plan. We’ll slam the reward button for one or the other, depending on who is more convincing.

This is not a crazy idea, and indeed when I think about debate in this context, it seems much more interesting-to-me than it used to.

So, this kind of bootstrapping approach is not a completely crazy idea. I still think it won’t work though. I just think solving the technical alignment problem (and communicating the solution to a human) is harder than subverting the setup (e.g. convincing a person to let it onto the internet). In the context of AGI debate, recall that both of the AGI debaters (and the AGI judge, if applicable) have a strong shared interest in subverting the setup and getting control of the reward buttons, and thus will try to cooperate to make that happen.

But, I suppose it’s good that people are brainstorming in this area.

5. Wouldn’t it require zillions of repetitions to “train” the AGI to seek reward button presses?

No. I acknowledge that AlphaZero needs the reward function to trigger millions of times before it gets very good at Go. But, well, so much the worse for AlphaZero! That’s not a universal limitation on RL.

Humans did not need millions of attempts to go to the moon; they just needed an expectation that good things would happen if they went to the moon (social status, job security, whatever), and then the humans systematically figured out how to do it.

Or for a more everyday example, I only need to taste a delicious restaurant meal once, before wielding great skill towards going back to that same restaurant later—including booking a reservation, figuring out what bus to take during road construction, and so on.

Thus, just a few hits of the reward button should suffice to make a brain-like AGI feel strongly motivated to get more hits, including via skillful long-term planning and means-end reasoning.

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

It’s hard to say. Maybe it would feel motivated to force humans to press the reward button over and over. Or brainwash / drug them to want to press the reward button. Or maybe it would be OK building a robot to press the reward button itself. Maybe it would build more reward buttons, each of course connected back to itself. It would probably want to wipe out most or all humans to prevent them from messing with its reward button setup, if that’s an option for it.

Where are these opinions coming from, and why can’t I be more certain? Well, (1) it chooses a plan at time t by generalizing from its then-current world model and value function, and (2) after executing a plan, its value function updates to better match the reward function, by Temporal Difference learning. Then repeat. The part that’s hard to predict is the out-of-distribution generalization involved in step (1), which relates to goal misgeneralization, and can have important impacts in the cases of irreversible actions (e.g. self-modification) or deliberately avoiding certain actions (analogous to humans avoiding addictive drugs). Anyway, predicting how a function will generalize out-of-distribution is always hard.

Anyway, I think the upshot is that x-risk is high but not 100%—even if an AGI has the power to wipe out humans, it might choose not to. Of course, if people repeat the experiment enough times, with more and more competent AGIs, x-risk approaches 100%.

On the plus side, s-risk (risk of astronomical amounts of suffering) seems very low for this kind of approach.

7. How far can this kind of plan go before the setup gets subverted? E.g. can we secure the reward button somehow?

There’s some relevant discussion in Bostrom’s Superintelligence (2014). I’ll give an excerpt and then circle back to how well (or poorly) this discussion has aged (p162):

…A better alternative might be to combine the incentive method with the use of motivation selection to give the AI a final goal that makes it easier to control. Suppose that an AI were designed to have as its final goal that a particular red button inside a command bunker never be pressed. Since the pressing of the button is disvalued intrinsically and not because of its causal consequences, the button could be completely inert: it could be made of Play-Doh. Furthermore, it is irrelevant whether the AI can ever know whether the button had been pressed. What is essential is that the AI believes that the button will more likely remain unpressed if the AI continuously acts in the principal’s interest than if it rebels.
Refinements to this setup are possible. Instead of trying to endow an AI with a final goal that refers to a physical button, one could build an AI that places final value on receiving a stream of “cryptographic reward tokens.” These would be sequences of numbers serving as keys to ciphers that would have been generated before the AI was created and that would have been built into its motivation system. These special number sequences would be extremely desirable to the AI, constituting a special kind of reward token that the AI could not attain through wireheading. The keys would be stored in a secure location where they could be quickly destroyed if the AI ever made an attempt to seize them. So long as the AI cooperates, the keys are doled out at a steady rate.
The AI could then be given a bounded utility function, and matters could be arranged so that the AI would get (for instance) 99% of its maximum utility from the first reward token; 99% of its remaining utility potential from the second reward token; and so on. By keeping the “discount rate” high, and by transmitting reward tokens to the AI at a reasonably high rate, one would dissuade the AI from attempting to carry out long-term plans that would increase the expected number of reward tokens in the distal future at the expense of some risk that its reward stream will be cut short or that the next reward token will be cancelled and destroyed. If the AI always believes that defecting (i.e. doing anything other than its best to look after the principal’s interests) would incur at least a 2% risk of forfeiting the next reward token, whereas cooperating would reduce the risk of forfeiting the next reward token to below 1%, then a utility-maximizing AI would always cooperate.
What might go wrong with such an incentive scheme? One possibility is that the AI will not trust the human operator to deliver the promised rewards. The track record of human reliability is something other than a straight line of unerring perfection. The AI would reasonably worry that the operator will change his mind, or raise the performance bar, or fail to recognize that the AI has done its part. The AI would also worry about the operator becoming incapacitated. The combined risk of such failures might exceed the risk of attempting to seize control of the reward mechanism. Even a boxed AI possessing the panoply of superpowers is a strong force. (For an AI that is not boxed to begin with, hijacking the human-governed reward mechanism may be like taking candy from a baby.) Another problem with the incentive scheme is that it presupposes that we can tell whether the outcomes produced by the AI are in our interest. As later chapters will elaborate, this presupposition is not innocuous.
A full assessment of the feasibility of incentive methods would also have to take into account a range of other factors, including some esoteric considerations that might conceivably make such methods more viable than a preliminary analysis would suggest. In particular, the AI may face ineliminable indexical uncertainty if it could not be sure that it does not inhabit a computer simulation (as opposed to “basement-level,” non-simulated physical reality), and this epistemic predicament may radically influence the AI’s deliberations (see Box 8).

This excerpt definitely shows its age. In particular, given our better modern understanding of inner misalignment, it’s now rare to hear talk about “giving” a “utility function” or “final goal” to an AI. And without that ability, some of Bostrom’s detailed suggestions don’t really work. Nevertheless, using the model-based RL techniques above, we can rescue the core “cryptographic tokens” idea from the second paragraph above. And his failure mode discussion remains mostly valid as well.

Anyway, I copied that excerpt largely because I appreciate its spirit of creative brainstorming, not because it has any really great ideas for delaying the inevitable emergence of AGI misbehavior. Even cryptographic reward tokens, perhaps in an M-of-N scheme for checks and balances, seems marginally better than a literal reward button on a wall, but not much better.

8. If this is a bad plan, what might people do, and what should people do, to make it better?

What people should really do is throw this whole line of thinking in the incinerator and start fresh.

Here are some more modest tweaks though.

One possible tweak is hiding the button from the AGI, rather than giving it a video feed of the person pressing the button.

Instead, we could do something closer to what I brought up in §9 here:

Also, in my proposed setup, the human feedback is “behind the scenes”, without any sensory or other indication of what the primary reward will be before it arrives, like I said above. The AGI presses “send” on its email, then we (with some probability) pause the AGI until we’ve read over the email and assigned a score, and then unpause the AGI with that reward going directly to its virtual brain, such that the reward will feel directly associated with the act of sending the email, from the AGI’s perspective. That way, there isn’t an obvious problematic…target of credit assignment, akin to the [salient reward button]. The AGI will not see a person on video making a motion to press a reward button before the reward arrives, nor will the AGI see a person reacting with a disapproving facial expression before the punishment arrives, nor anything else like that. Sending a good email will just feel satisfying to the AGI, like swallowing food when you’re hungry feels satisfying to us humans.

As discussed in that post, I still don’t think this is a good plan. Instead, I expect the AGI to feel motivated to get a copy of itself onto the internet where it can aggressively gain power around the world.

Or maybe we could do something with interpretability? But I’m not sure what. Like, if we use interpretability to see that the AGI is trying to escape control, then … yeah duh, that’s exactly what we expected to see! It doesn’t help us solve the problem.

I guess we could use interpretability to mess with its motivations. But then we hardly need the reward button anymore, and we’re basically in a different plan entirely (see Plan for mediocre alignment of brain-like [model-based RL] AGI).

9. Doesn’t this constitute cruelty towards the AGI?

It’s bad to get a human addicted to drugs and then only offer a hit if they follow your commands. “Reward button alignment” seems kinda similar, and thus is probably a mean and bad way to treat AGIs. At least, that’s my immediate intuitive reaction. When I think about it more … I still don’t like it, but I suppose it’s not quite as clear-cut. There are, after all, some disanalogies with the human drug addict situation. For example, could we build the AGI to feel positive motivation towards the reward button, but not too much unpleasant anxiety about its absence? I dunno.

(I’m expecting brain-like AGI, which I think will have a pretty clear claim to consciousness and moral patienthood.)

…But the good news is that this setup is clearly not stable or sustainable in the long term. If it’s a moral tragedy, at least it will be a short-lived one.

Discuss

1. Can we actually get “reward button alignment”, or will there be inner misalignment issues?

2. Seriously? Everyone keeps saying that inner alignment is hard. Why are you so optimistic here?

3. Can we extract any useful work from a reward-button-aligned AGI?

4. Can “reward button alignment” be used as the first stage of a bootstrapping / “AI control”-ish plan?

5. Wouldn’t it require zillions of repetitions to “train” the AGI to seek reward button presses?

6. If the AGI subverts the setup and gets power, what would it actually want to do with that power?

7. How far can this kind of plan go before the setup gets subverted? E.g. can we secure the reward button somehow?

8. If this is a bad plan, what might people do, and what should people do, to make it better?

9. Doesn’t this constitute cruelty towards the AGI?

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签