少点错误 03月12日
Existing UDTs test the limits of Bayesianism (and consistency)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了理性智能体中一致性作为最优性标准的局限性,并以无更新决策理论(UDT)为例进行了分析。文章从贝叶斯主义的根源出发,剖析了UDT的理论基础,并质疑了其对先验的过度依赖可能导致的不合理行为。作者认为,过分追求一致性可能会使智能体无法从错误中学习,从而导致次优决策。文章还讨论了UDT与其他决策理论的联系,并提出了可能的改进方向,旨在为理性智能体的设计提供更合理的指导。

🧠UDT的核心思想是,一个智能体应该遵守它在过去某个时间点所做的所有预先承诺,即使在它还未存在时。这种“追溯预承诺”可能基于对宇宙的完全无知,从而放大了先验问题,导致智能体采取不合理的策略。

🌌UDT的支持者可能会诉诸泰格马克的多重宇宙理论,认为在数学宇宙中,这样的智能体可能表现更好。然而,作者认为,没有直接理由相信数学宇宙是真实存在的,并且即使存在,智能体也应该优先关注现实宇宙。

💻UDT的出现,实际上是出于对计算有限性的考虑。智能体无法在诞生之初就考虑所有可能的预先承诺,因此需要追溯地做出承诺。同时,UDT也考虑了逻辑信念,但如果智能体在某个时刻对数学的理解是错误的,那么它可能会永远遵循一个愚蠢的策略。

⚖️EDT(证据决策理论)可以作为UDT的替代方案。特别是在策略选择问题上,EDT会考虑智能体当前所知道的一切来形成预先承诺。这种方法可以成功解决纽科姆悖论,并且在实践中,它应该会收敛到更合理的行为。

Published on March 12, 2025 4:09 AM GMT

Epistemic status: Using UDT as a case study for the tools developed in my meta-theory of rationality sequence so far, which means all previous posts are prerequisites. This post is the result of conversations with many people at the CMU agent foundations conference, including particularly Daniel A. Herrmann, Ayden Mohensi, Scott Garrabrant, and Abram Demski. I am a bit of an outsider to the development of UDT and logical induction, though I've worked on pretty closely related things.

I'd like to discuss the limits of consistency as an optimality standard for rational agents. A lot of fascinating discourse and useful techniques have been built around it, but I think that it can be in tension with learning at the extremes. Updateless decision theory (UDT) is one of those extremes; but in order to think about it properly, we need to start with its Bayesian roots. Because, appropriately enough for a sequence on the meta-theory of rationality, I want to psychoanalyze the invention/inventors of UDT. Hopefully, we'll then be in a position to ask what we think we know and how we think we know it in regards to updatelessness (also sometimes called priorism), the driving idea behind UDT.

Subjective Bayesianism is about consistency[1] among beliefs. The Cox axioms force real-valued credences to act to like probabilities under some natural conditions that ultimately boil down to consistency; one way to intuitively compress the assumptions is that beliefs about related things have to continuously "pull on each other," so I think of the Cox axioms as requiring credence to propagate through an ontology properly. Dutch book arguments further require that betting behavior is consistent with probabilistic structure to avoid being "money-pumped" or accepting a series of bets that is sure to lose money (a kind of dominance principle). That handles the "statics." Bayesian updating is of course a theorem of probability theory, forced from Kolmogorov's axioms, so in that sense it is a consequence of the preceding arguments. But insofar as we want to it describe belief dynamics, updating enforces a kind of consistency (with respect to old beliefs and new information) across time. Similar arguments motivate maximizing some expected utility with respect to these credences = subjective (prior/posterior) probabilities - but I actually won't be very concerned with utilities here. 

This all seems good - if we expend enough cognitive resources on understanding a problem or situation, we should hope that our beliefs eventually stabilize into something consistent. Otherwise, it does feel like we are open to arbitrage and something is going obviously wrong. Unfortunately, Bayesian probability theory doesn't exactly tell us how to remedy the situation; in that way it fails Demski's criterion that a theory of rationality is meant to provide advice about how to be more rational. Occasionally though, we might have a decent source of "objective" priors, derived from our knowledge of the situation, maximum entropy, or just the catch-all universal distribution. In cases like this[2], I think there is a decent argument that this describes normative reasoning. It is an optimality standard, and a pretty powerful one because it not only constrains an agent's actions but even their beliefs. Arguably, in this capacity it acts a lot like a convergent algorithm. I think it is, and it will be discovered and "consciously" applied in many cases by many AGI designs, because it should often be tractable to do so. However, note that though the idea of a Bayesian core engine of cognition has many proponents, it does not follow from any of this argumentation. Still, I think Bayesian probability is quite central to understanding cognition, on pain of inconsistency.

But if we push hard enough on this desire for consistency, it starts to break down as a reasonable optimality standard. Updateless decision theory, at least in its current form, provides a sort of counterexample to the supremacy of consistency by using it to justify absurdly unagentic behavior.

The problem ultimately comes from the priors. Unless they capture something about reality, priors are just mistaken beliefs. An agent which acts according to sufficiently mistaken beliefs may never learn it was wrong (failing at self-optimization) and will then remain stupid forever. Fortunately, I tend to think that reasonable priors will eventually reach agreement in practice. 

Updatelessness throws that argument out the window. In its strongest form, an updateless agent should obey all pre-commitments it would have made, at some previous time, if it had the chance (as Abram Demski emphasizes, the goal is not to pre-commit, but rather to make pre-commitment unnecessary). How far back in time should we "retro-actively pre-commit," according to UDT? It's not really clear to me, which is apparently because it's not really agreed among updateless decision theorists (I talked to many of them for a week). I think the general strong and perhaps original view is as early as possible; even before the agent was created, in case other agents may have reasoned about its code when deciding whether to create it. This would mean choosing pre-commitments from a time when you did not even exist, meaning you knew nothing whatsoever about the universe, except perhaps whatever can be determined by pure reason. This is starting to sound more like classical rationalism than modern rationality! It seems likely to massively amplify any problems with the agent's prior - and really, it's not clear what class of priors (short of near-perfect knowledge about our universe) this is really safe for.

At this point, someone sufficiently MIRI-brained might start to think about (something equivalent to) Tegmark's level 4 mathematical multiverse, where such agents might theoretically outperform others. Personally, I see no direct reason to believe in the mathematical multiverse as a real object, and I think this might be a case of the mind projection fallacy - computational multiverses are something that agents reason about in order to succeed in the real universe[3]. Even if a mathematical multiverse does exist (I can't rule it out) and we can somehow learn about its structure[4], I am not sure that any effective, tractable agents can reason about or form preferences over it - and if they do, they should be locally out-competed by agents that only care about our universe, which means those are probably the ones we should worry about. My cruxiest objection is the first, but I think all of them are fairly valid.

From this view, it's not clear that reasoning about being the best agent behind a veil of total ignorance about the universe is even a sensible idea. Humans seem to have arrived at agent theory only because we were motivated by considering all the agents in the actual world around us, and invented the abstractions we use for agent theory because they don't seem empirically to be very leaky. Are those observations of a lower status than the true, multiversal theory of agency, and where exactly would such a thing come from or live?

We can instead do something like form retroactive commitments starting from, say, the time the agent came into existence, or shortly thereafter when it knows at least the basic facts about our universe. This still makes sense, but now, why not just pre-commit then? The answer is that UDT is (secretly?) about computational boundedness! An agent presumably can't think through every possible pre-commitment instantly at birth. That's another reason to make them retro-actively, once we've had time to realize they are valuable.

At this point, UDT (as introduced by Wei Dai) takes a further leap in the "priorist" direction: if we're going to make pre-commits according to our previous self's beliefs about the world, why not also their logical beliefs? After all, we are considering computationally bounded Bayesians; its natural to put credences on logical statements as well as empirical facts. Insofar as the two are entangled, I can see the elegance[5] of the idea, but it massively amplifies my objection to updatelessness: now an agent may follow a stupid strategy forever, simply because it at one point was wrong about math.

I think it's possible to not notice the danger of serious error here if you're thinking in terms of policy theory, and everything seems a little more abstract, but "dropping down" to agent theory makes it look a lot less sensible. I just would not build a robot that way. And I would not really act that way.

There may be a solution within UDT - perhaps some kind of prior that is carefully constructed to make nearly all pre-commitments look bad until you're a smart agent. If so, that sounds fascinating, and I'd love to discover or learn about it! Lots of smart people have ideas for other elaborations (or perhaps complete refactors and hopefully simplifications) that might solve the problem; for instance I believe Scott Garrabrant views it as closely analogous to alignment (in the ordinary AI safety sense) between an agent's past and current selves. 

But there might also be a merely conventional solution outside of UDT: evidential decision theory (EDT). Specifically, EDT on the policy selection problem, as academic decision theorists seem to put it. This is a policy theory that takes into account everything it currently knows to form pre-commitments, and it seems to be the relevant problem faced by (some) AGI with a Bayesian core engine. This would normally be called Son of EDT in lesswrong lingo; it is also roughly equivalent to sequential policy evidential decision theory (SPEDT). For brevity, perhaps WDT, because E "turns into" W? ;)

How would this work? What, if anything, would it converge to?

Well, it should obviously succeed at Newcomb-like problems insofar as it anticipated facing them, which is arguably the reasonable thing to ask. In practice, I don't see any way in which it should act much less reasonably than UDT, except perhaps "around boundary conditions" at its creation.

Unfortunately, Son of EDT seems likely to inherit many of the problems of UDT if it is allowed unrestricted ability to self-modify. That is because it might start self-modifying at the moment of its creation, at which point it still knows essentially nothing (unless, again, an appropriately conservative prior can be constructed). The dynamics might be a little better particularly regarding logical uncertainty (even if we continue to treat logical credences in a Bayesian way). This is because the agent can at least take advantage of the logical facts it currently knows as it performs each self-modification, and perhaps it needs to do a lot of math before arriving at the conclusion that it ought to self-modify (depending on the detailed implementation). This switches real time to logical time in a way that I suspect is actually useful in practice. 

The whole scheme does feel highly heuristic and ramshackle, but perhaps it's not as bad as it seems. First of all, it's clearly unsafe to hand a newborn agent a screwdriver to modify itself with unless you can safely unmodify and restart it, and this doesn't really seem to be EDT's fault (it's just an unforgiving environment for any decision theory). By the time the agent "grows up" perhaps it only makes sensible modifications. Certainly Bayesian decision theory has proven itself quite robust to criticism, once its applied very carefully, with all considerations taken into account[6]

 In fact, I think it's quite likely that we are going through this exact sort of decision process in this very discussion, using everything we know about agency in our universe to reason about the policy that would make the best agent (we control the former, but consider the consequences for the later). If we are reasoning locally at the action level, then this forms a descending chain of abstraction, where action theory looks at policy theory looking at agent theory. So, if we are operating in a Bayesian way, it seems questionable whether we can arrive at any theory of agency better than Son of EDT!

The problem with Son of EDT is that it's not in itself a clean decision theory. EDT does not tile, so perhaps picks a sequence of increasingly arcane self-modifications and ends up with some sort of incomprehensible policy. But I suspect it isn't actually incomprehensible; it just may not be a grand unified theory of rationality (GUTR). We can still attempt to analyze its behavior on the problems we care about, in particular alignment. Indeed, there may be no useful GUTR, in which case the best we can do is analyze particular important or recurring (sub)problems of cognition and agency. I wouldn't go this far, but I also wouldn't be surprised if the unifiable part of the theory looks a lot like EDT, and the rest like Son of EDT. 

  1. ^

    Frequently "coherence," which feels stronger because to be incoherent sounds quite negative.

  2. ^

    Richard Ngo would probably say that this does not apply to any interesting situations.

  3. ^

    Here I notably depart from Infra-Bayesian Physicalism (as I understand it).  

  4. ^

    This is related to the robustness of definitions for the mathematical multiverse.

  5. ^

    Or perhaps just... consistency?

  6. ^

    Thanks to Aydin Mohensi for suggesting this outside view. 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UDT 贝叶斯主义 理性智能体 一致性 EDT
相关文章