Existing UDTs test the limits of Bayesianism (and consistency)

Published on March 12, 2025 4:09 AM GMT

Epistemic status: Using UDT as a case study for the tools developed in my meta-theory of rationality sequence so far, which means all previous posts are prerequisites. This post is the result of conversations with many people at the CMU agent foundations conference, including particularly Daniel A. Herrmann, Ayden Mohensi, Scott Garrabrant, and Abram Demski. I am a bit of an outsider to the development of UDT and logical induction, though I've worked on pretty closely related things.

I'd like to discuss the limits of consistency as an optimality standard for rational agents. A lot of fascinating discourse and useful techniques have been built around it, but I think that it can be in tension with learning at the extremes. Updateless decision theory (UDT) is one of those extremes; but in order to think about it properly, we need to start with its Bayesian roots. Because, appropriately enough for a sequence on the meta-theory of rationality, I want to psychoanalyze the invention/inventors of UDT. Hopefully, we'll then be in a position to ask what we think we know and how we think we know it in regards to updatelessness (also sometimes called priorism), the driving idea behind UDT.

Subjective Bayesianism is about consistency^[1] among beliefs. The Cox axioms force real-valued credences to act to like probabilities under some natural conditions that ultimately boil down to consistency; one way to intuitively compress the assumptions is that beliefs about related things have to continuously "pull on each other," so I think of the Cox axioms as requiring credence to propagate through an ontology properly. Dutch book arguments further require that betting behavior is consistent with probabilistic structure to avoid being "money-pumped" or accepting a series of bets that is sure to lose money (a kind of dominance principle). That handles the "statics." Bayesian updating is of course a theorem of probability theory, forced from Kolmogorov's axioms, so in that sense it is a consequence of the preceding arguments. But insofar as we want to it describe belief dynamics, updating enforces a kind of consistency (with respect to old beliefs and new information) across time. Similar arguments motivate maximizing some expected utility with respect to these credences = subjective (prior/posterior) probabilities - but I actually won't be very concerned with utilities here.

This all seems good - if we expend enough cognitive resources on understanding a problem or situation, we should hope that our beliefs eventually stabilize into something consistent. Otherwise, it does feel like we are open to arbitrage and something is going obviously wrong. Unfortunately, Bayesian probability theory doesn't exactly tell us how to remedy the situation; in that way it fails Demski's criterion that a theory of rationality is meant to provide advice about how to be more rational. Occasionally though, we might have a decent source of "objective" priors, derived from our knowledge of the situation, maximum entropy, or just the catch-all universal distribution. In cases like this^[2], I think there is a decent argument that this describes normative reasoning. It is an optimality standard, and a pretty powerful one because it not only constrains an agent's actions but even their beliefs. Arguably, in this capacity it acts a lot like a convergent algorithm. I think it is, and it will be discovered and "consciously" applied in many cases by many AGI designs, because it should often be tractable to do so. However, note that though the idea of a Bayesian core engine of cognition has many proponents, it does not follow from any of this argumentation. Still, I think Bayesian probability is quite central to understanding cognition, on pain of inconsistency.

But if we push hard enough on this desire for consistency, it starts to break down as a reasonable optimality standard. Updateless decision theory, at least in its current form, provides a sort of counterexample to the supremacy of consistency by using it to justify absurdly unagentic behavior.

The problem ultimately comes from the priors. Unless they capture something about reality, priors are just mistaken beliefs. An agent which acts according to sufficiently mistaken beliefs may never learn it was wrong (failing at self-optimization) and will then remain stupid forever. Fortunately, I tend to think that reasonable priors will eventually reach agreement in practice.

Updatelessness throws that argument out the window. In its strongest form, an updateless agent should obey all pre-commitments it would have made, at some previous time, if it had the chance (as Abram Demski emphasizes, the goal is not to pre-commit, but rather to make pre-commitment unnecessary). How far back in time should we "retro-actively pre-commit," according to UDT? It's not really clear to me, which is apparently because it's not really agreed among updateless decision theorists (I talked to many of them for a week). I think the general strong and perhaps original view is as early as possible; even before the agent was created, in case other agents may have reasoned about its code when deciding whether to create it. This would mean choosing pre-commitments from a time when you did not even exist, meaning you knew nothing whatsoever about the universe, except perhaps whatever can be determined by pure reason. This is starting to sound more like classical rationalism than modern rationality! It seems likely to massively amplify any problems with the agent's prior - and really, it's not clear what class of priors (short of near-perfect knowledge about our universe) this is really safe for.

At this point, someone sufficiently MIRI-brained might start to think about (something equivalent to) Tegmark's level 4 mathematical multiverse, where such agents might theoretically outperform others. Personally, I see no direct reason to believe in the mathematical multiverse as a real object, and I think this might be a case of the mind projection fallacy - computational multiverses are something that agents reason about in order to succeed in the real universe^[3]. Even if a mathematical multiverse does exist (I can't rule it out) and we can somehow learn about its structure^[4], I am not sure that any effective, tractable agents can reason about or form preferences over it - and if they do, they should be locally out-competed by agents that only care about our universe, which means those are probably the ones we should worry about. My cruxiest objection is the first, but I think all of them are fairly valid.

From this view, it's not clear that reasoning about being the best agent behind a veil of total ignorance about the universe is even a sensible idea. Humans seem to have arrived at agent theory only because we were motivated by considering all the agents in the actual world around us, and invented the abstractions we use for agent theory because they don't seem empirically to be very leaky. Are those observations of a lower status than the true, multiversal theory of agency, and where exactly would such a thing come from or live?

We can instead do something like form retroactive commitments starting from, say, the time the agent came into existence, or shortly thereafter when it knows at least the basic facts about our universe. This still makes sense, but now, why not just pre-commit then? The answer is that UDT is (secretly?) about computational boundedness! An agent presumably can't think through every possible pre-commitment instantly at birth. That's another reason to make them retro-actively, once we've had time to realize they are valuable.

At this point, UDT (as introduced by Wei Dai) takes a further leap in the "priorist" direction: if we're going to make pre-commits according to our previous self's beliefs about the world, why not also their logical beliefs? After all, we are considering computationally bounded Bayesians; its natural to put credences on logical statements as well as empirical facts. Insofar as the two are entangled, I can see the elegance^[5] of the idea, but it massively amplifies my objection to updatelessness: now an agent may follow a stupid strategy forever, simply because it at one point was wrong about math.

I think it's possible to not notice the danger of serious error here if you're thinking in terms of policy theory, and everything seems a little more abstract, but "dropping down" to agent theory makes it look a lot less sensible. I just would not build a robot that way. And I would not really act that way.

There may be a solution within UDT - perhaps some kind of prior that is carefully constructed to make nearly all pre-commitments look bad until you're a smart agent. If so, that sounds fascinating, and I'd love to discover or learn about it! Lots of smart people have ideas for other elaborations (or perhaps complete refactors and hopefully simplifications) that might solve the problem; for instance I believe Scott Garrabrant views it as closely analogous to alignment (in the ordinary AI safety sense) between an agent's past and current selves.

But there might also be a merely conventional solution outside of UDT: evidential decision theory (EDT). Specifically, EDT on the policy selection problem, as academic decision theorists seem to put it. This is a policy theory that takes into account everything it currently knows to form pre-commitments, and it seems to be the relevant problem faced by (some) AGI with a Bayesian core engine. This would normally be called Son of EDT in lesswrong lingo; it is also roughly equivalent to sequential policy evidential decision theory (SPEDT). For brevity, perhaps WDT, because E "turns into" W? ;)

How would this work? What, if anything, would it converge to?

Well, it should obviously succeed at Newcomb-like problems insofar as it anticipated facing them, which is arguably the reasonable thing to ask. In practice, I don't see any way in which it should act much less reasonably than UDT, except perhaps "around boundary conditions" at its creation.

Unfortunately, Son of EDT seems likely to inherit many of the problems of UDT if it is allowed unrestricted ability to self-modify. That is because it might start self-modifying at the moment of its creation, at which point it still knows essentially nothing (unless, again, an appropriately conservative prior can be constructed). The dynamics might be a little better particularly regarding logical uncertainty (even if we continue to treat logical credences in a Bayesian way). This is because the agent can at least take advantage of the logical facts it currently knows as it performs each self-modification, and perhaps it needs to do a lot of math before arriving at the conclusion that it ought to self-modify (depending on the detailed implementation). This switches real time to logical time in a way that I suspect is actually useful in practice.

The whole scheme does feel highly heuristic and ramshackle, but perhaps it's not as bad as it seems. First of all, it's clearly unsafe to hand a newborn agent a screwdriver to modify itself with unless you can safely unmodify and restart it, and this doesn't really seem to be EDT's fault (it's just an unforgiving environment for any decision theory). By the time the agent "grows up" perhaps it only makes sensible modifications. Certainly Bayesian decision theory has proven itself quite robust to criticism, once its applied very carefully, with all considerations taken into account^[6].

In fact, I think it's quite likely that we are going through this exact sort of decision process in this very discussion, using everything we know about agency in our universe to reason about the policy that would make the best agent (we control the former, but consider the consequences for the later). If we are reasoning locally at the action level, then this forms a descending chain of abstraction, where action theory looks at policy theory looking at agent theory. So, if we are operating in a Bayesian way, it seems questionable whether we can arrive at any theory of agency better than Son of EDT!

The problem with Son of EDT is that it's not in itself a clean decision theory. EDT does not tile, so perhaps picks a sequence of increasingly arcane self-modifications and ends up with some sort of incomprehensible policy. But I suspect it isn't actually incomprehensible; it just may not be a grand unified theory of rationality (GUTR). We can still attempt to analyze its behavior on the problems we care about, in particular alignment. Indeed, there may be no useful GUTR, in which case the best we can do is analyze particular important or recurring (sub)problems of cognition and agency. I wouldn't go this far, but I also wouldn't be surprised if the unifiable part of the theory looks a lot like EDT, and the rest like Son of EDT.

^{^}
Frequently "coherence," which feels stronger because to be incoherent sounds quite negative.
^{^}
Richard Ngo would probably say that this does not apply to any interesting situations.
^{^}
Here I notably depart from Infra-Bayesian Physicalism (as I understand it).
^{^}
This is related to the robustness of definitions for the mathematical multiverse.
^{^}
Or perhaps just... consistency?
^{^}
Thanks to Aydin Mohensi for suggesting this outside view.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签