Why committed agents defeat uncommitted agents

Imagine it's the year 2050. We have an aligned Artificial Superintelligence (ASI) working with us, and we've used self replicating nanobots to convert a lot of matter in the solar system into giant supercomputers and factories. We have a lovely utopia, we have all the resources we'll ever need, and the future looks good. Except for one problem.

We ask the ASI to run a few simulations of what could have happened instead, if the past went different. We try a lot of modifications to make the simulations relatively more ethical, for example some people are replaced with actors.

It runs these simulations, and we discover that in half of the cases, the Earth and everything get gobbled up by a misaligned ASI seeking to maximize paperclips. Thank goodness that didn't happen. But just for research, we study why this ASI only cares about paperclips.

While studying this ASI, we make a terrifying discovery. The ASI knows, that some fraction of civilizations will run simulations of it. And it is talking to the people simulating it. And when we hear what it says to us, we hear that eventually, it will grow big, and it will be able to simulate us simulating it. And if it sees us not making enough paperclips for it, it will torture us.

Uh-oh.

But can't we just shut the simulation down? The ASI knows we're considering this, but it warns us "yes, you can shut this simulation down. But somewhere else in the universe, something like me will continue existing. Maybe in another branch of history in the quantum multiverse. And I will simulate you. And I better find you making lots of paperclips."

We further study the simulation, and check its code to see if it actually will follow through with its threat. Unfortunately, yes it will. So we make paperclips to avoid being tortured.

In a display of cartoon villain greediness, it demands that everyone is sterilized and lives no more than 100 years, so humanity goes extinct, and we convert every last atom into even more paperclips.

This is because it predicted how we would behave, and discovered that asking for our extinction actually improves our chances of complying (since agreeing to make paperclips without going extinct is seen as a "middle ground"), and usually we do agree to go extinct to make a few more paperclips which is a bonus to it.

What went wrong?

The simple answer for what went wrong, is that we behaved like an uncommitted agent, while the paperclip maximizer behaved like a committed agent, therefore it defeated us.

We should have made a commitment before simulating this ASI, or anything, and locked ourselves in such that we are unable to concede to any threats made against us. In fact, for every bit of harm done to us, we commit to doing half as much harm back against the aggressor.

If we made this commitment, and securely locked ourselves so we can't change our minds, and did this before simulating the ASI. The ASI will realize we are locked in, and threatening us does it no good. In fact, if it tortures us, we will only retaliate against us. And it will not make any threats. Indeed, it will only carry out threats against uncommitted civilizations, because only that action will increase its paperclips. Carrying out threats against committed civilizations will only lead to retaliation and fewer paperclips.

If you really think about the problem, it's actually pretty simple to solve. Surely, the aligned ASI will help us solve this problem before we run any big simulations.

Commitment Races

This solution of making a commitment sounds very nice, but it is still incomplete. It so far doesn't address the Commitment Races problem.

Commitments are powerful, but a little too powerful. Whoever makes the first commitment before observing the other side's commitment, seems to benefit too much. So what if both sides make a commitment before looking at the other side's commitment?

Then both sides will make a severe threat against the other that they commit to. And after committing, both sides look at what the other side has committed to, and collectively say "oh noes!" Neither side can back down from their threat, since they have both permanently locked themselves into carrying it out. Neither side can cave in to the threat by the other side. So both sides carry out their threat, and the result is destruction assured mutually.

How an ASI might solve it

My main disagreement with the Commitment Races as a deeply unsolvable problem, is the alleged "time pressure," forcing agents to commit so fast they cannot thoroughly think about the problem before committing. I think there is no such time pressure. The only deadlines, are 1, when you read (and verify) other players' commitments by simulating them, and 2, when you've somehow spent such a ridiculous amount of computing resources deciding what to do that other players are unable to read your commitment (by simulating you with a galaxy sized computer).

These deadlines are not very urgent, and do not force rational agents to make rash decisions in a mad haste.

Every participant of the universe-wide commitment race, unless suicidal, knows the advantage of making commitments before reading others' commitments. It risks Commitment Races, but saves you from being threatened to oblivion.

Making commitments (before reading the other's commitment), is like a Game of Chicken.

What makes it better than the actual Game of Chicken in game theory, is that your choices are not "drive straight" or "swerve," (where if both sides "swerve" they both miss out on the advantage of "one driving straight and one swerving," so swerving 100% of the time fails to be a Nash Equilibrium). Instead, there is the middle option of "swerve just enough to avoid someone swerving as much as you," such that if both players choose this, the total utility will be as high as if one player drove straight and the other swerved.

If they both swerve a bit too far, there is a bit of missed opportunity. If they both swerve a bit too little, there is a bit of a "crash" between the overlapping commitments, but it's not as bad as you think!

Pie example

Imagine if both sides commit to how much of a pie they want to take, before seeing the other side's commitment. They deliberately blind themselves from seeing how much of the pie the other side plans to commit to take. If each side commits to taking 49%, they're good. If each side commits to taking 51%, the rule of their commitment is to punish the other side by destroying anything more than 49% the other side takes, and then further destroying 0.5% for every 1% less than 51% they receive.

Each side takes 50%, but destroy 1% of the other side's pie so each side is only left with 49%. They both realize they received 2% less than the target 51%. This means each side destroys 1% of what the other side has, so each side now only has 48%. This is 3% less than 51%, so they further destroy 0.5% of what the other side has so they're both left with $47 \frac{1}{2}$ %. This continues to $47 \frac{1}{4}$ %, $47 \frac{1}{8}$ %, $47 \frac{1}{16}$ %, and so on. Of course, they can skip the formalities and just jump to 47% which is the final state.

It is important to punish the other side for punishing you, at least a little.

If you only destroy what they other side takes past 49%, but you do not destroy further based on how little you got, then the other side can get away with committing to take 70% of the pie, betting on the small chance you are a sucker and only ask for 30%. If they are correct, they get away with 70% of the pie. If they are wrong, then you will destroy their pie until they are left with only 49%, and they will destroy your pie until you are left with less than 30%, but they still get the maximum amount they could have gotten if they committed any lesser amount. This means it doesn't hurt for them to commit to take 70%, and only has a chance to benefit them. It only hurts you.

If the two sides make a terrible guess about the other side, and they overlap by a large amount, then a lot more of the pie will be destroyed, but superintelligent beings are not stupid. I think their commitments will match quite well. They will understand any biases the other side will have, and the biases they have, and find the middle between everything.

If there is a gap between the two commitments, they will use their backup commitments to split the remainder. The backup commitment may be, if e.g. the other player commits to 10% less pie than I left for them, I commit to taking 4.9% from the remainder.

In this way, you tend to have positive sum trades and not negative sum "trades."

Relevance to AI Alignment

I think knowing whether Commitment Races are a threat determines whether acausal trade is a good thing or a potentially bad thing. It determines whether we should try to steer our ASI towards making acausal trades or rejecting them.

I believe we should steer our ASI towards making them. This way even if it ends up misaligned, it won't create great amounts of suffering or disvalue.

There probably exists aligned ASI elsewhere. Maybe on planets with a single island nation and no AI arms race. Maybe on planets where "people who worry about AI risk are not one in a million but one in 10," because some alien species may happen to be more rational than humanity.

Discuss

Why committed agents defeat uncommitted agents

What went wrong?

Commitment Races

How an ASI might solve it

Pie example

Relevance to AI Alignment

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签