少点错误 16小时前
MIRI's "The Problem" hinges on diagnostic dilution
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入剖析了“诊断稀释”这一逻辑谬误,指出在AI安全领域,尤其是在讨论AI的威胁性时,容易将特定条件X替换为更宽泛的条件Y,从而得出不准确的结论。文章以AI建造足球场为例,阐述了这一谬误如何误导对AI目标导向行为及潜在威胁的判断。作者建议在论证过程中,应坚持使用精确的条件,避免过度泛化,并强调了对AI行为进行细致分析的重要性,尤其是在考虑AI可能面临的特定环境(如攻击)和泛化能力时,更需谨慎,以避免得出不当的结论。

🎯 **诊断稀释谬误的核心在于以宽泛条件取代精确条件。** 作者指出,在论证过程中,将特定信息X替换为与其相关的、但信息量更少的条件Y,会产生逻辑上的偏差。例如,将“知道孩子数量是非负32位整数”这一宽泛信息(Y)用于推断“孩子数量大于100万”(A),而忽略了“我们正在计数孩子”(X)这一更精确的初始信息,导致结论失实。这种做法如同在推理中引入了“诊断稀释”,使得对原始情况的判断被削弱。

⚽ **AI安全论证易受诊断稀释谬误影响。** 文章以“AI建造足球场”为例,说明了论证中可能出现的逻辑链条:AI建造足球场(X)必然表现出许多目标导向行为(Y),而目标导向的AI(Y)根据“工具性趋同理论”倾向于消除威胁性代理(A)。作者强调,这个论证忽略了X到A的直接关联,而仅仅依赖于Y到A的推断,并且Y本身也并非从X必然推导而来,尤其是在不考虑具体执行细节时。这种论证方式容易导致对AI威胁的过度解读。

🛡️ **避免诊断稀释需关注具体情境和技术细节。** 作者提出,要避免诊断稀释,需要考虑AI在特定情境下的行为。例如,一个能够在特定攻击下(如网络攻击)成功防御的AI,并不意味着它会主动消除所有威胁。同样,AI的泛化能力也需要具体分析。以“合作AI系统”或“模拟训练”方式构建的AI,其行为更多是基于训练和架构的窄域泛化,而非广泛的“泛化能力”假设。因此,在评估AI风险时,应避免将特定AI的行为泛化到所有目标导向的AI。

💡 **AI的泛化能力论证尚不充分。** 文章探讨了“机构泛化假设”,即智能AI会通过泛化获得普遍目标导向代理的特征。虽然有部分研究(如Anthropic的代理错误泛化工作)显示出一些趋势,但作者认为这种支持是微弱的。他通过对当前大型语言模型(LLM)行为的观察发现,即使在某些情况下表现出目标导向性,也并非如广泛泛化假设所言,会出现如操纵用户、修改需求等行为。这表明,即使是表现出目标导向性的AI,其泛化程度也可能远未达到普遍危险的程度。

Published on August 13, 2025 6:25 AM GMT

[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point]

I'm going to meet someone new tomorrow, and I'm wondering how many kids they have. I know their number of kids is a nonnegative 32 bit integer, and almost all such numbers are greater than 1 million. So I suppose they're highly likely to have more than 1 million kids.

This is an instance of a fallacy I'm calling diagnostic dilution[1]. It is akin to a probabilistic form of the fallacy of the undistributed middle. The error goes like this: we want to evaluate some outcome given some information . We note that implies (or nearly so), and so we decide to evaluate . But this is a mistake! Because implies we are strictly better off evaluating [2]. We've substituted a sharp condition for a vague one, hence the name diagnostic dilution.

In the opening paragraph we have ="count > 1 million", ="we're counting kids" and ="count is a nonnegative 32 bit integer". So is and is close to 1. Diagnostic dilution has lead us far astray.

Diagnostic dilution is always structurally invalid, but it misleads us specifically when the conclusion hinges on forgetting . "Bob is a practising doctor, so Bob can prescribe medications, so Bob can probably prescribe Ozempic" is a fine inference because it goes through without forgetting that Bob is a doctor. But consider the same chain starting with "Bob is a vet": now the inference does not go through unless we forget Bob's job, and indeed the conclusion is false.

Diagnostic dilution is tricky because it looks a bit like a valid logical inference: and so . But if is even a hair less than 1, then, as we saw, we can end up being very wrong.

Quick definition: Diagnostic dilution is a probabilistic reasoning error that involves replacing the condition with a weaker condition and drawing a conclusion that follows from conditioning on but doesn't follow from conditioning on .

Now I'm going to consider a caricature of the argument presented in section 2+4 of The Problem. Hopefully we won't have to haggle the details of its representativeness too much:

Now this argument is clearly structurally unsound, but does the conclusion actually depend on forgetting ? It does. follows from by imagining a stadium builder, "observing its behaviour" and noting that it seemingly has to do things like

Maybe there are technical reasons that make the conclusion correct. Even if this is so, it does not make the presented argument sound. Instead we have a different argument:

Section 2 of "The Problem" offers another similar argument in terms of "strategizing" instead of goal directedness, but it commits the same error of diagnostic dilution: start with a specific kind of agent that does something complex and valuable, note that this has to involve strategising, switch to considering things that strategize "in general"[4] and then via something like instrumental convergence conclude that they'll wipe out humans if they can.

Technical reasons why the conclusion might hold

There are two ways that I see to rescue the conclusion. One option: instead of imagining AIs that build football stadiums, we imagine AIs that build football stadiums while under attack (which could be, for example, cyber, kinetic or legal). It seems intuitive that such systems, if they are to be successful, have to in some sense seek to eliminate threats. Many people talk about AI as a tool to secure geopolitical dominance, and so "strong AI will be under attack in various ways" is not an especially far fetched claim. I think this is a legitimate worry, but it doesn't seem to be central to MIRI's threat model at least. We need to worry about diagnostic dilution here too: an AI that successfully defends itself from adversarial cyber attacks need not defend itself from directions from legitimate authorities; even if we have adversarial pressure of a particular type, we are not licensed to substitute "agents acting under adversarial pressure in general".

Another option, which I think is closer to MIRI’s thinking, is some kind of agency generalisation hypothesis. The hypothesis goes something like this: if an AI is smart enough and has some markers of goal-directedness, then via generalization it picks up characteristics of the general class of goal-directed agents (as imagined by the instrumental convergence thesis). The minimal point I am making in this article is that if they are persuaded by a hypothesis like this, then it ought to appear in the original argument. But I have a less minimal point to make: I am also doubtful that this hypothesis is sound.

How do we get an AI football stadium builder? Here are two plausible proposals[5]:

    Cooperative AI systems: we take an AI project manager which assembles an AI legal team, an AI construction team etc., all individually proven on many smaller and cheaper projectsSimulation: we construct a high fidelity simulation of all of the relevant systems in a city and have an AI learn to construct football stadiums by reinforcement learning

In the first case, we have a swarm of AI systems each with well-characterised behaviour and each operating fairly closely to its training/evaluation environment. Yes, the overall project can be significantly out of distribution for the whole system, but each subsystem is operating in a regime where its behaviour is well characterised (in fact, this property is probably what enables large ensembles to work at all). So once we know how all the components operate, there's little room for additional "goal-directed properties of agents in general" to creep into the system[6].

In the second case, we can just look at the behaviours it exhibits in the simulation. Now there will be differences between the sim and the world, so this will not be a perfect indication of real world behaviour. It's possible that in the transition from the simulation to the real world it applies behaviours learned (or generalized) in the sim in ways that are dangerous in the real world, but it is even more likely to exhibit learned behaviours that are not effective in the real world. If the simulation approach is to be effective, it probably has to have pretty high fidelity, in which case sim behaviours are likely to be pretty representative of the real world behaviour[7].

In both cases, there's no "agency gap" that needs to be explained by broad generalization of agency. All of the goal-directed behaviours needed to accomplish the task are explained by narrow generalization from the training and architecture of the AI system that builds the stadium. So the goal-directedness generalization hypothesis does not help to explain the admittedly imaginary outcome of this admittedly imaginary experiment. Nevertheless, perhaps there is simply a tendency for smart agents to pick up features of "agency in general", even though it is not strictly required for them to perform – something like the mechanism explored by Evan Hubinger in How Likely is Deceptive Alignment?

Without meaning to be too harsh on that piece – which I actually rather like – theories about machine learning on that scale of ambition do not have a high success rate. There is a reason why the field has converged on expecting empirical validation of even very modest theoretical results. If the case for doom hinges on that argument or a close relative, it is a serious error of omission not to mention it.

There is actually some empirical support for some kind of agency generalization thesis, though I think it rather weak. In particular, we have Anthropic's agentic misgeneralization work. My rough understanding is that if you ask recent LLMs (like Claude 4) to “focus on achieving your goal above all else” and put them in a situation where they will either be shut off and fail to achieve their goal, or they have to blackmail or murder someone, they will reasonably often choose blackmail or murder. Also, relevant to the question here, this seems to apply more to more recent models which are able to accomplish longer-horizon tasks. The obvious rejoinder is that you have to ask for this behaviour for it to appear, but there's still the weak appearance of a trend.

On the other hand, recent systems definitely seem motivated to create code that passes validation. They do it successfully fairly often, even in challenging situations. Furthermore, they’ve been noted to sometimes cheat on validation tests for code they found to hard to implement, and to write code in such a manner that it continues to run even if it contains errors. So this looks like motivated behaviour – they want to write successful code. But if that's what they really want, there are a bunch of other things they could do:

TL;DR


  1. Please accept my sincere apologies if it is already named. ↩︎

  2. Ask your favourite LLM why this is if it's not immediately obvious. ↩︎

  3. Alternatively, just as we erred by evaluating instead of , we err by evaluating instead of . ↩︎

  4. For the record, I think the notion of "the general class of things that pursue goals" is suspect, but I'm happy to grant it for the sake of argument here. But seriously, you need an actual prior and not a pretend one. ↩︎

  5. I'm not putting these forward as safe ways to get an AI stadium builder – I'm putting them forward as ways you might get an AI stadium builder at all. ↩︎

  6. In principle there's room for some kind of egregore to emerge, but I don't obviously see an account of why this is likely here, nor why arguments about goal-directed agents in general apply to it. ↩︎

  7. I'm kind of expecting here a lot of responses along the lines "a superintelligence could totally behave very differently out of sim if it wanted to" which is true, but a) I think we should try to get a grip on the somewhat-superhuman case before we go to the wildly superhuman case and b) whether it "wants to" is the entire point of contention. ↩︎



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

诊断稀释 AI安全 逻辑谬误 AI伦理 推理
相关文章