Simulation-aware causal decision theory: A case for one-boxing in CDT

Published on August 9, 2024 6:09 PM GMT

Disclaimer: I am a math student new to LW and the rationalist community. I could not find this idea anywhere else after a modest amount of research; I apologize if this has been brought up before.

Summary: CDT agents may behave differently than expected in situations involving reliable predictors due to the possibility that they are themselves simulations being run by the predictor. This may lead to decisions such as one-boxing in Newcomb's problem and acquiescing in counterfactual mugging.

The traditionally held conclusion of CDT in Newcomb's problem is that the decision of which boxes to take cannot causally influence the predictor's prediction, so in the absence of any evidence as to what the prediction is, a CDT agent should take both boxes as it would be better off with that choice than with the choice to take only the opaque box regardless of what the prediction is. The potential challenge I pose to this is that in such a situation, a rational agent ought to have significant credence in the possibility that it is a simulation being run by the predictor, and that in the case that the agent is a simulation, its choice actually does have a causal influence on the "real" prediction.

Fundamental to this argument is the claim that a reliable predictor is almost certainly a reliable simulator. To see why this is plausible, consider that if a predictor was not able to predict any particular internal process of the agent, that process could be abused to generate decisions that thwart the predictor. For a rough example, think of a human-like agent deciding "at random" whether to one-box or two-box. Human pseudorandom bit generation is far from a coin flip, but it depends on enough internal state that a predictor unable to simulate most, if not all, of the brain would likely not perform well enough to warrant being called reliable.

We must also make the assumption that an agent in this situation would, if it knew it was a simulation, strive to maximize expected utility for its real-world counterpart instead of its simulated self. I'm not entirely sure if this is a reasonable assumption, and I admit this may be a failure point of the theory, but I'll continue anyway. At the very least, it seems like the kind of behavior you would want to build into a utility-maximizing agent if you were to design one.

Now, for a concrete example of a decision problem where this observation is consequential, consider Newcomb's problem with $1,000 in the transparent box and $1,000,000 possibly in the opaque box. Assume that the agent has 50% credence that it is a simulation. In the case that the agent is a simulation, it does not have causal influence over its decision in the real world and so it assumes a uniform prior over the two possibilities (its real-world counterpart one-boxing or two-boxing). In the case that the agent is in the real world, it assumes a uniform prior over the presence of the $1,000,000 in the opaque box. Given these assumptions, the expected value of two-boxing is

$0.5 (0.5 \cdot $ 1000 + 0.5 \cdot $ 0) + 0.5 (0.5 \cdot $ 1000 + 0.5 \cdot $ 1001000) = $ 250750$ ,

(where the first top-level term corresponds to the simulation, the second to reality, the first parenthesized terms to the other agent (real or sim) two-boxing, and the second to the other agent one-boxing) whereas the expected value of one-boxing is

$0.5 (0.5 \cdot $ 1001000 + 0.5 \cdot $ 1000000) + 0.5 (0.5 \cdot $ 0 + 0.5 \cdot $ 1000000) = $ 750250$ ,

and hence a CDT agent with these assumptions would choose to one-box.

For the second example, consider a counterfactual mugging of $100 in exchange for a counterfactual $1,000,000 (to simplify the calculations, assume that the predictor would only go through the effort of making a prediction if the coin landed heads). Under the same assumptions as above, the CDT expected value of acquiescing to the mugging is

$0.5 \cdot $ 1000000 + 0.5 (- $ 100) = $ 499950$ ,

and the expected value of refusing the mugging is

$0.5 \cdot $ 0 + 0.5 \cdot $ 0 = $ 0$ ,

hence a CDT agent would acquiesce.

This idea relies on several assumptions, some of which are more reasonable than others. I don't claim that this is how a pure CDT agent would actually behave under these scenarios, but it's at least an interesting alternative to think about to explain why the common interpretation of CDT appears irrational (or at least destitute in practice) when it comes to decision problems involving predictors, despite giving reasonable verdicts in most others. In the future, it would be interesting to investigate to what extent this version of CDT agrees with other decision theories, and on what problems it diverges.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签