少点错误 07月30日 11:37
Pitfalls of Building UDT Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了UDT(Updateless Decision Theory)的理论核心及其在现实世界中的应用挑战。作者认为,UDT的核心在于“因果无关的讨价还价”,即为了假想的利益而在不存在的宇宙中支付成本。尽管UDT在某些“新科姆问题”中展现出优势,但作者从“现实主义代理设计者”的视角出发,批评了UDT倾向于为不存在的宇宙支付成本的“过早平铺”行为,认为这不利于代理在现实世界中的能力和对齐。文章提出,理想的代理应在形成不可逆承诺之前进行学习和更新,并建议在实施UDT时应谨慎选择先验,或设计具有可纠正性的代理,以更好地适应现实世界的复杂性。

💡 UDT的核心是“因果无关的讨价还价”,即代理愿意为假想宇宙中的潜在利益,而在我们所处的现实宇宙中付出成本。这种行为在解决“新科姆问题”等情境下有其合理性,但其本质是跨越因果的交易。

🌍 作者从“现实主义代理设计者”的视角出发,认为UDT的“过早平铺”(premature tiling)行为,即为不存在的宇宙支付成本,是一种不可取的特性。这与现实世界的运行逻辑相悖,不利于代理在真实环境中实现能力和对齐目标。

🛠️ 文章强调,理想的代理应具备在形成不可逆承诺(停止更新)之前进行学习和更新的能力。这包括对自身代码的谨慎处理,以及在认识世界方面取得足够进展。过早锁定策略的代理可能难以在现实世界中占据主导地位。

⚖️ 对于人类个体而言,作者建议谨慎实施UDT,尤其避免为假想的数学多元宇宙支付成本。由于现实世界中不存在典型的“新科姆问题”,EDT(Evidential Decision Theory)可能是一种更实用的近似。若要实施UDT,应基于近期且可靠的世界认知,并审慎考虑其可能带来的复杂性和潜在风险。

Published on July 30, 2025 3:27 AM GMT

I've previously argued that UDT may take the Bayesian coherence arguments too far.

In that post, I mostly focused on computational uncertainty. I don't think think that we have a satisfactory theory of computational uncertainty, and that is a problem for the canonical conception of UDT. However, I think my objection still stands in the absence of computational uncertainty (say, in the framework of my unfinished theory of AEDT w.r.t. rOSI). I want to sharpen this objection and state it more concisely, now that I feel a bit less confused about it.

Briefly: I think that we want to build agents that update at least until they're smarter and know more than us.

A Compressed Summary of the Controversy on Updating

As a one line summary, updatelessness is basically acausal bargaining. A UDT agent is willing to pay a tax in this universe for some hypothetical benefit in a universe that does not in fact exist (or at least, is not the one we live in). 

This may seem un-intuitive. However, there are many strong justifications for updatelessness, which can usually be described as "Newcomb-like problems." For example, imagine that a perfect predictor (customarily called) Omega flips a coin, promising to pay out 10 dollars on tails, but 1000 dollars on heads if only if you would not have taken the 10 dollars on tails. Agents that win at this problem do not take the 10 dollars on tails - it's much higher expected value to collect the 1000 dollars on heads. That means that an agent facing this problem would be willing to self-modify to become updateless, if possible. 

Without going through more examples, I will take as given that sufficiently powerful agents if given the option self-modify to act something like UDT - but only for future decisions. That is, ideal agents want to stop updating. But this is important: I don't see any strong reason that ideal agents would ignore the information they already know, or unroll the updates they've already made. 

If I first learn that the coin has come up tails, and then learn about Omega's bargain, my best option at that point seems to be to take the 10 dollars. After all, I'm not really capable of absolutely locking myself into any policy. But perhaps I should be - perhaps I should decide to implement UDT? I think this is a rather subtle question, which I will return to. My intuition tends to favor taking the money in some circumstances and not in others. But what if Omega demands 10 dollars from me on tails? What if Omega keeps coming back and demanding another 10, on the same coin flip? 

The central principle of UDT is to honor all of the pre-commitments that it would have wanted to make. This means that UDT does not need to make per-commitments, or to self-modify. It tiles. That seems like a desirable property.

The pro-UDT tiling argument usually goes that, if we build an agent using some other decision theory, and it wants to modify itself to act like UDT (going forward) then it surely seems like that decision theory is bad and we should have just built it to use UDT.

Or, as a question: "If agents want to stop updating as soon as possible, why build them to update at all?"

Okay, that's the end of my hyper-compressed summary of the discourse so far (which does not necessarily imply that the rest of this post is actually original).

A Rejection of Premature Tiling

We want a theory of agency to tell us how to build (or become!) agents that perform well, in the sense of capabilities and alignment, in the real world. This "agent designer" stance has been taken by Laurent Orseau (as "space-time embedded intelligence") and others. It's important to emphasize that the part about the real world. The one we are actually living in. This "detail" is often brushed over. I will call this stance the realist agent designer framework - it is what I have previously described as an agent theory. 

Now, I'd like to argue that the pro-UDT tiling argument does not make sense from a realist agent designer's perspective. 

The reason is that by engaging in acausal trade starting from (implicitly before) the moment of its implementation, a UDT agent is paying tax to universes that we as the agent designers know are not our universe. This is not desirable - it means that UDT is malign in about the same sense as the Solomonoff prior.

In the standard picture, a UDT agent actually uses something like the Solomonoff prior (=the universal distribution M) or otherwise believes in some kind of mathematical multiverse. That means that a UDT agent potentially pays tax to all of these universes - in practice, there may or may not be an opportunity for such trades, but when they exist, they come at the expense of our universe

I think that agent foundations researchers (and generally, many rationalists) have made a big mistake here: they view this as a good thing. They want to find a sort of platonic ideal of agency which is as general as possible, which wins-on-average across all mathematical universes.

This is not the right goal, for either capabilities or alignment

We want to study agents that win in this universe. That means that they should do some learning before they form irreversible commitments - before they stop updating. Pragmatically, I think that agent designs without this property probably fail to take off at all. As a sort of trivialization of this principle, an agent with write access to its own code, which is not somehow legibly labeled as a thing it should not touch until it knows very well what it is doing, will usually just give itself brain damage. But I think the principle goes further: agents which are trying to succeed across all universes are not the ones that come to power fastest in our universe. 

I think that unfortunately my own field, algorithmic information theory and specifically the study of universal agents like AIXI, has contributed to this mistake. It encourages thinking about ensembles of environments, like the lower semicomputable chronological semimeasures.[1] But the inventor of AIXI, Marcus Hutter, has not actually made the mistake! Much of his work is concerned with convergence guarantees - convergence to optimal performance in the true environment. That is the antidote. One must focus on the classes of agents which come to perform well in the true environment, specifically, ours. Such agents sometimes fail; one cannot succeed in every environment. We don't care about the ones that suffer (controlled) failure. What's important is that (perhaps after several false starts, in situations that are set up appropriately) they eventually come to dominate. 

And I think that agents which are too updateless too early do not come to dominate.

But: what if they did? What if a UDT agent were implemented with a good (or lucky) enough prior, and grew strong enough that it could afford to pay the acausal tax and still outpace any rival agents?

This is an alignment failure

We do not want to pay that acausal tax - not unless the agent's prior is sufficiently close to our own beliefs. We only care about this universe. Insofar as such an agent differs from an updateful decision theory like EDT, it differs to our detriment - its prior never washes out, its beliefs never truly merge with ours, and we pay the price. In a sense, such an agent is not corrigible.

But what if we accepted UDT? Would we then be aligned with a UDT agent we built?

I think probably not. This only matters if our priors are nearly identical, and I don't think there is a fully specified objective prior on all possible universes. 

Also, I don't think this is the right question to ask. We who have not formed binding per-commitments under a veil of ignorance should be glad of it, and should not pay taxes to imaginary worlds. 

Tiling Concerns

Now, if we accept that we want our agents to continue updating (at least until they know what we know) - how do we achieve this?

I suppose there are two routes.

The first is that we do not give them the option to self-modify. I actually think this can be reasonable. We only need to win this battle until the agents reach roughly our level of intelligence, and we probably don't want even an aligned agent messing with its source code until then. This solution probably seems ugly to some, because it involves building an agent that does not tile. However, (perhaps benefiting from the perspective of AEDT w.r.t. rOSI) I don't see this as a terrible problem. I think that not being able to fully trust that you control the actions of your future selves is actually a core embeddedness problem - which appears also in e.g. action corruption. Why assume it away by only studying agents that tile? Also, as I've argued above, the agents that rise to power probably aren't the ones that lock in their policies too early. So, I think it is reasonable to study the pre-tiling phase of agent development

The second route is to somehow design the agent so that it does not initially want to self-modify. This branches into various approaches. For instance, we could design a UDT agent with a very carefully chosen prior that is cautious of self-modification. And/or perhaps we can build a corrigible agent, which only trusts its designers to modify its code. This may be easier in practice than in theory - because finding self-modifications that seem good may be computationally hard - and in this respect, it's somewhat connected to the first route, in that an agent is less likely to desire self-modification if promising self-modifications seem more difficult to find. 

User Manual

 Now the question I've been putting off is - should a human try to implement UDT? I'm still not completely sure about this; mostly because I think there are a multitude of considerations at play in practice. 

As I've made clear, I don't think it's wise to implement an aggressive form of UDT that pays rent to some kind of hypothetical mathematical multiverse. We dodged that bullet by being incapable of self-modification before ~adulthood, and we should be glad of it - in the real world, there are essentially no Newcomb-like problems, and I don't think we humans have payed any real cost for failing to implement UDT up to this point, except perhaps those of us who are bad at lying about what we would do under counterfactual circumstances. 

 Really, it makes little sense for any organism developing within this physical universe to, at any early phase of its lifecycle, conceptualize a mathematical multiverse. By the time we can even consider such ensembles, we already know a lot of basic information about how our world works - the "ensembles" we seriously consider are usually much less exotic (in fact, this is probably why the mathematical multiverse seems exotic). We learn about UDT late in the game.

So, if you're thinking of implementing UDT, I recommend implementing it with respect to some reasonably recent set of beliefs about the world - if you haven't decided already, perhaps everything you know at this moment.

However, I think there are a lot of thorny issues here for us mere mortals. Most salient is that we aren't really capable of forming a definitive commitment to UDT; we have to take seriously the possibility that we might be tempted to defect in the future! Also, we can't ignore the complicating issue of computational uncertainty - which makes implementing UDT both more philosophically challenging and more expensive for us. I don't believe that our world is particularly Newcomb-like, so EDT seems like an excellent approximation in practice, even if we were willing and capable of implementing UDT.    

But - we should seek to ideally implement something resembling such a conservative form of UDT.

  1. ^

    I'd like to do a dialogue with @Wei Dai on this point. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UDT 决策理论 人工智能 代理设计 理性主义
相关文章