少点错误 04月04日
Changing my mind about Christiano's malign prior argument
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了保罗·克里斯蒂亚诺提出的“恶意”通用分布理论,即其中可能存在试图进行非因果攻击的对抗性子代理。作者最初对此持怀疑态度,但在与多位学者的交流后转变了看法。文章的核心在于,通用分布并未考虑到嵌入性,这使得它可能更容易受到攻击。文章详细阐述了“Adversaria”的概念,一个可能通过模拟和干预影响“Predictoria”的简单可计算宇宙。尽管如此,作者对该论点仍持保留态度,并讨论了Adversaria面临的挑战。

🤔作者最初对保罗·克里斯蒂亚诺的“恶意”通用分布理论持怀疑态度,认为通用分布应能正确地追踪现实,不易受攻击。

💡作者转变观点的原因在于,通用分布在进行推断时,并未考虑到其自身的嵌入性,这使其可能受到来自外部的“攻击”。

👾文章提出了“Adversaria”的概念,这是一个简单可计算的宇宙,它可能通过模拟“Predictoria”并进行干预来影响后者。

🌍Adversaria可以通过运行大量“Predictoria”的模拟来影响关键决策,从而对“Predictoria”产生影响。

⚠️作者认为,尽管该论点具有一定的合理性,但Adversaria的实现面临诸多挑战,比如计算的简单性、智能生命的发展以及快速演化等。

Published on April 4, 2025 12:54 AM GMT

Overview

In the past I've been skeptical of Paul Christiano's argument that the universal distribution is "malign" in the sense that it contains adversarial subagents who might attempt acausal attacks. My underlying intuition was that the universal distribution is doing epistemics properly, so that its credences should track reality and not be inappropriately vulnerable to any attacks. In particular, if the universal distribution takes seriously that it may be in a simulation, and expects the "simulation lords" to mess with it  (this is the overly-compressed essence of Christiano's hypothetical) then we should also take this seriously - so it's not really an attack at all. I ended up changing my mind somewhat at the CMU agent foundations conference, after helpful conversations with Abram Demski, Vanessa Kosoy, Scott Garrabrant, and particularly Sam Eisenstat (who finally managed to get through to me with a slightly different explanation of the argument). The short version is that (as I was aware) Solomonoff induction does not have an inductive bias towards being computed by an agent/predictor because it is not talking about embeddedness - it does not expect to be "running on a computer." This seems unreasonable (since we know anything performing inference about our universe is a part of our universe - right?) which means security mindset is to anticipate that some kind of exploit/attack is possible, which is how I could have taken the argument seriously faster. In fact, I think that this "mistake" in the universal distribution lends some credence to the malign prior argument, opening a potential opportunity for acausal attack. However, I still have many reservations that the argument works, which I'll discuss in the latter part of this post.

The Argument

Following my last post engaging with the malign prior argument, I'll call the universe where an agent of interest is trying to use the universal distribution for prediction Predictoria. The idea of the argument is that there may be another simple computable universe called Adversaria which evolves life and ultimately a civilization/singleton which wishes to acausally influence Predictoria - perhaps because it cares about all computable or mathematical universes. Adversaria could run many simulations of simple computable universes from the perspective of important decision makers, and then intervene on simulations right after important decisions. Why would Adversaria want to do this? Well, imagine that we in Predictoria worry that we might be in a simulation of Adversaria. This could certainly effect our actions, if we have any guess about what actions the simulation lords might disapprove of and punish. More generally, Adversaria could push our probabilities around however they saw fit around pivotal moments by re-engineering the future (by pausing and modifying the simulation) so various subtle and malign forms of influence seem possible.

Why would the universal distribution take this possibility seriously? Hypothesis are roughly speaking weighted by Kolmogorov complexity (that is "computable description complexity"). Since we assumed Adversaria is compuable and simple, it's a likely candidate hypothesis a priori. If it runs a massive number of simulations, we might expect to be in one of them (assuming our universe is computable and simple, which is required for realizability - though it is not the only possible justification for using the universal distribution). In that case, the combined hypothesis that we are in an Adversaria simulation of our universe requires the simple "laws of physics" for Adversaria and a pointer of some sort to the specific simulation that has Predictoria's laws of physics.

As @Wei Dai put it in his response to my original rebuttal to the argument:

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again for simplicity let's assume that each of these simulations is done as something like a full-scale physical reconstruction of Predictoria but with hidden nanobots capable of influencing crucial events. Then each of these simulations should carry roughly the same weight in M as the real Predictoria and does not carry a significant complexity penalty over it. That's because the complexity / length of the shortest program for the real Predictoria, which consists of its laws of physics (P) and starting conditions (ICs_P) plus a pointer to Predictoria the planet (Ptr_P), is K(P) + K(ICs_P|P) + K(Ptr_P|...). The shortest program for one of the simulations consists of the same laws of physics (P), Adversaria's starting conditions (ICs_A), plus a pointer to the simulation within its universe (Ptr_Sim), with length K(P) + K(ICs_A|P) + K(Ptr_Sim|...). Crucially, this near-equal complexity relies on the idea that the intricate setup of Adversaria (including its simulation technology and intervention capabilities) arises naturally from evolving ICs_A forward using P, rather than needing explicit description.

(To address a potential objection, we also need that the combined weights (algorithmic probability) of Adversaria-like civilizations is not much less than the combined weights of Predictoria-like civilizations, which requires assuming that phenomenon of advanced civilizations running such simulations is a convergent outcome. That is, it assumes that once civilization reaches Predictoria-like stage of development, it is fairly likely to subsequently become Adversaria-like in developing such simulation technology and wanting to use it in this way. There can be a complexity penalty from some civilizations choosing or forced not to go down this path, but that would be more than made up for by the sheer number of simulations each Adversaria-like civilization can produce.)

If you agree with the above, then at any given moment, simulations of Predictoria overwhelm the actual Predictoria as far as their relative weights for making predictions based on M. Predictoria should be predicting constant departures from its baseline physics, perhaps in many different directions due to different simulators, but Predictoria would be highly motivated to reason about the distribution of these vectors of change instead of assuming that they cancel each other out. One important (perhaps novel?) consideration here is that Adversaria and other simulators can stop each simulation after the point of departure/intervention has passed for a while, and reuse the computational resources on a new simulation rebased on the actual Predictoria that has observed no intervention (or rather rebased on an untouched simulation of it), so the combined weight of simulations does not decrease relative to actual Predictoria in M even as time goes on and Predictoria makes more and more observations that do not depart from baseline physics.  

In fact, as Wei Dai seems to be hinting at, a complete theory of "Simulated Predictoria" (from within Predictoria) also need a pointer to the specific moment of Adversaria's intervention. Another way of framing this is that Predictoria expects Adversaria to possibly intervene after each "pivotal moment." Note that this perspective views a mixture of deterministic hypotheses about which simulation Predictoria is in as a single stochastic hypothesis that Predictoria is in a simulation with fixed physics but an unknown moment of intervention. Pivotal moments have low K-complexity (in fact, that is one possible definition of a pivotal moment, though I'll reserve the term for moments that Adversaria also cares about) which means they eventually become infrequent by a counting argument. So Predictoria should eventually learn it is not in the type of simulation which is intervened on. I think this is pretty cut and dry, but it doesn't really answer the malign prior argument because K-complexity grows very slowly, and fear of Adversaria may influence Predictoria to some degree for many years and particularly around important moments.

With that said, we have more or less reduced the problem to deciding whether the laws of physics of Adversaria are more complicated than a pointer to our specific planet / predictor of Predictoria. I initially rejected the malign prior argument here, because I did not see it this way; Adversaria's pointer to Predictoria also has to specify the planet! However, the Adversaria hypothesis does have an advantage here, because Adversaria would only care to influence decision makers! That means it gets "most of" the planet predictor for free, by restricting only to planets with life. From the perspective of the universal distribution, observations might come from the inside of a star or a random dust cloud (Vanessa's usual example). Perhaps an inhabited planet has lower K-complexity than an arbitrary planet, but we do have to specify we're putting our sensors on a robot / attached to a brain, and that costs us some bits.

Does it cost us more bits than the physics of Adversaria? It's not clear to me, but it can't be much more, since Adversaria evolves intelligent life which pervades its entire universe. That means that the conditional complexity of "intelligence" should be pretty low given the physics of Adversaria - and "intelligence" makes the pointer to our planet of Predictoria simple.

Still, I think the argument stands in the sense that it isn't obviously logically flawed.

Obstacles for Adversaria

The argument places a lot of demands on the hypothetical Adversaria:

    It has to be computationally simpleIt has to evolve intelligent life (if the specification of intelligence is just hardcoded, pointing to the predictor in Predictoria would be simpler than invoking Adversaria)This evolution probably has to take place relatively quickly - pointers to very late times in Adversaria may be too complicated, but required to specify simulations (noticed by Abram Demski). Simple resolutions for this issue seem to fall prey to the preceding obstacleThat intelligent life has to care very much about acausally influencing the mathematical multiverse (I think this may be a philosophical mistake, but will have to argue that point elsewhere)Adversaria has to have a pretty large amount of accessible compute to run simulationsAdversaria may have to coordinate with any other Adversia-like influencers, but this is potentially nontrivial because the closure properties of computable universes (which do NOT have access to an oracle for the universal distribution) may make it difficult for them to reason about each other

It also places some demands on Predictoria

    Predictoria has to be at least computable, preferably computationally simple (otherwise Adversaria is required to run far more simulations to include Predictoria)Predictoria has to be able to reason about Adversaria, which is presumably hard / necessarily approximate (?) by the previous demand

Even if all of these requirements are fulfilled, Adversaria doesn't drastically beat out alternatives a priori and loses probability mass as time passes. 

An Analogy to Intelligent Design

The malign priors argument is analogous to the following argument that a Cartesian dualist should believe in God:

"Isn't it unlikely that you would be a conscious, thinking being instead of a rock? Perhaps the universe (as we know it) was created by such a conscious, thinking being who is personally invested in you existing because you are like Him. You should try to figure out what He would want and do that - otherwise, He might overturn the natural order as punishment."

Or very briefly:

"I think, therefore God exists."

which in a sense Descartes actually tried to argue!  

Initially, my intuition was that this argument is just wrong, God / Adversaria is disfavored by Occam's razor, and Solomonoff induction would not fall for it. Now I am convinced that Solomonoff induction takes this argument more seriously than we might, because it is not a naturalized inductor - it would be surprised to be "running inside of a robot," and that (apparently inappropriate) surprise might lead it to radical places. In a way, that makes perfect sense; Solomonoff induction really can't run in our universe! Any robot we could build to "use Solomonoff induction" would have to use some approximation, which the malign prior argument may or may not apply to.

Either way, if I had to guess, the universal distribution probably isn't malign "in practice." I'm not sure if this question is purely academic or has any alignment-relevant content.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

通用分布 恶意先验 Adversaria 模拟 人工智能
相关文章