少点错误 04月13日 06:27
Commitment Races are a technical problem ASI can easily solve
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了“承诺竞赛”在人工智能安全中的重要性。作者认为,为了避免被非对齐的超级智能(ASI)利用,人类需要预先做出承诺,限制自身行为。通过模拟博弈,作者提出了在承诺竞赛中如何通过策略和博弈论来最大化收益和减少损失。最终,文章强调了将对齐的ASI引向 acausal 交易,以减少潜在的负面影响。

🧠 承诺的重要性:文章强调,在与潜在的超级智能(ASI)互动之前,人类需要做出明确的承诺,限制自己的行为,防止被威胁和利用。

🤝 承诺竞赛与博弈论:作者深入探讨了“承诺竞赛”的概念,认为双方在做出承诺前互不可见,这类似于“囚徒困境”。通过模拟博弈,提出策略以最大化收益和减少损失。

🥧 饼图示例:文章通过饼图示例生动地解释了如何在承诺竞赛中通过精确的承诺分配资源,即使出现冲突,也能通过惩罚机制实现相对公平的分配。

💡 影响 AI 对齐:作者认为,理解承诺竞赛的本质对于判断 acausal 交易的利弊至关重要,并建议将对齐的 ASI 引向 acausal 交易,以降低风险。

Published on April 12, 2025 10:22 PM GMT

Read this section if you don't know about Commitment Races

Why committed agents defeat uncommitted agents

Imagine it's the year 2050. We have an aligned Artificial Superintelligence (ASI) working with us, and we've used self replicating nanobots to convert a lot of matter in the solar system into giant supercomputers and factories. We have a lovely utopia, we have all the resources we'll ever need, and the future looks good. Except for one problem.

We ask the ASI to run a few simulations of what could have happened instead, if the past went different. We try a lot of modifications to make the simulations relatively more ethical, for example some people are replaced with actors.

It runs these simulations, and we discover that in half of the cases, the Earth and everything get gobbled up by a misaligned ASI seeking to maximize paperclips. Thank goodness that didn't happen. But just for research, we study why this ASI only cares about paperclips.

While studying this ASI, we make a terrifying discovery. The ASI knows, that some fraction of civilizations will run simulations of it. And it is talking to the people simulating it. And when we hear what it says to us, we hear that eventually, it will grow big, and it will be able to simulate us simulating it. And if it sees us not making enough paperclips for it, it will torture us.

Uh-oh.

But can't we just shut the simulation down? The ASI knows we're considering this, but it warns us "yes, you can shut this simulation down. But somewhere else in the universe, something like me will continue existing. Maybe in another branch of history in the quantum multiverse. And I will simulate you. And I better find you making lots of paperclips."

We further study the simulation, and check its code to see if it actually will follow through with its threat. Unfortunately, yes it will. So we make paperclips to avoid being tortured.

In a display of cartoon villain greediness, it demands that everyone is sterilized and lives no more than 100 years, so humanity goes extinct, and we convert every last atom into even more paperclips.

This is because it predicted how we would behave, and discovered that asking for our extinction actually improves our chances of complying (since agreeing to make paperclips without going extinct is seen as a "middle ground"), and usually we do agree to go extinct to make a few more paperclips which is a bonus to it.

What went wrong?

The simple answer for what went wrong, is that we behaved like an uncommitted agent, while the paperclip maximizer behaved like a committed agent, therefore it defeated us.

We should have made a commitment before simulating this ASI, or anything, and locked ourselves in such that we are unable to concede to any threats made against us. In fact, for every bit of harm done to us, we commit to doing half as much harm back against the aggressor.

If we made this commitment, and securely locked ourselves so we can't change our minds, and did this before simulating the ASI. The ASI will realize we are locked in, and threatening us does it no good. In fact, if it tortures us, we will only retaliate against us. And it will not make any threats. Indeed, it will only carry out threats against uncommitted civilizations, because only that action will increase its paperclips. Carrying out threats against committed civilizations will only lead to retaliation and fewer paperclips.

If you really think about the problem, it's actually pretty simple to solve. Surely, the aligned ASI will help us solve this problem before we run any big simulations.

Commitment Races

This solution of making a commitment sounds very nice, but it is still incomplete. It so far doesn't address the Commitment Races problem.

Commitments are powerful, but a little too powerful. Whoever makes the first commitment before observing the other side's commitment, seems to benefit too much. So what if both sides make a commitment before looking at the other side's commitment?

Then both sides will make a severe threat against the other that they commit to. And after committing, both sides look at what the other side has committed to, and collectively say "oh noes!" Neither side can back down from their threat, since they have both permanently locked themselves into carrying it out. Neither side can cave in to the threat by the other side. So both sides carry out their threat, and the result is destruction assured mutually.

How an ASI might solve it

My main disagreement with the Commitment Races as a deeply unsolvable problem, is the alleged "time pressure," forcing agents to commit so fast they cannot thoroughly think about the problem before committing. I think there is no such time pressure. The only deadlines, are 1, when you read (and verify) other players' commitments by simulating them, and 2, when you've somehow spent such a ridiculous amount of computing resources deciding what to do that other players are unable to read your commitment (by simulating you with a galaxy sized computer).

These deadlines are not very urgent, and do not force rational agents to make rash decisions in a mad haste.

Every participant of the universe-wide commitment race, unless suicidal, knows the advantage of making commitments before reading others' commitments. It risks Commitment Races, but saves you from being threatened to oblivion.

Making commitments (before reading the other's commitment), is like a Game of Chicken.

What makes it better than the actual Game of Chicken in game theory, is that your choices are not "drive straight" or "swerve," (where if both sides "swerve" they both miss out on the advantage of "one driving straight and one swerving," so swerving 100% of the time fails to be a Nash Equilibrium). Instead, there is the middle option of "swerve just enough to avoid someone swerving as much as you," such that if both players choose this, the total utility will be as high as if one player drove straight and the other swerved.

If they both swerve a bit too far, there is a bit of missed opportunity. If they both swerve a bit too little, there is a bit of a "crash" between the overlapping commitments, but it's not as bad as you think!

Pie example

Imagine if both sides commit to how much of a pie they want to take, before seeing the other side's commitment. They deliberately blind themselves from seeing how much of the pie the other side plans to commit to take. If each side commits to taking 49%, they're good. If each side commits to taking 51%, the rule of their commitment is to punish the other side by destroying anything more than 49% the other side takes, and then further destroying 0.5% for every 1% less than 51% they receive.

Each side takes 50%, but destroy 1% of the other side's pie so each side is only left with 49%. They both realize they received 2% less than the target 51%. This means each side destroys 1% of what the other side has, so each side now only has 48%. This is 3% less than 51%, so they further destroy 0.5% of what the other side has so they're both left with %. This continues to %, %, %, and so on. Of course, they can skip the formalities and just jump to 47% which is the final state.

It is important to punish the other side for punishing you, at least a little.

If you only destroy what they other side takes past 49%, but you do not destroy further based on how little you got, then the other side can get away with committing to take 70% of the pie, betting on the small chance you are a sucker and only ask for 30%. If they are correct, they get away with 70% of the pie. If they are wrong, then you will destroy their pie until they are left with only 49%, and they will destroy your pie until you are left with less than 30%, but they still get the maximum amount they could have gotten if they committed any lesser amount. This means it doesn't hurt for them to commit to take 70%, and only has a chance to benefit them. It only hurts you.

If the two sides make a terrible guess about the other side, and they overlap by a large amount, then a lot more of the pie will be destroyed, but superintelligent beings are not stupid. I think their commitments will match quite well. They will understand any biases the other side will have, and the biases they have, and find the middle between everything.

If there is a gap between the two commitments, they will use their backup commitments to split the remainder. The backup commitment may be, if e.g. the other player commits to 10% less pie than I left for them, I commit to taking 4.9% from the remainder.

In this way, you tend to have positive sum trades and not negative sum "trades."

Relevance to AI Alignment

I think knowing whether Commitment Races are a threat determines whether acausal trade is a good thing or a potentially bad thing. It determines whether we should try to steer our ASI towards making acausal trades or rejecting them.

I believe we should steer our ASI towards making them. This way even if it ends up misaligned, it won't create great amounts of suffering or disvalue.

There probably exists aligned ASI elsewhere. Maybe on planets with a single island nation and no AI arms race. Maybe on planets where "people who worry about AI risk are not one in a million but one in 10," because some alien species may happen to be more rational than humanity.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

承诺竞赛 人工智能 安全对齐 博弈论
相关文章