Published on August 13, 2025 6:25 AM GMT
[adapted with significant technical improvements from https://clarifyingconsequences.substack.com/p/miris-ai-problem-hinges-on-equivocation, which I also wrote and will probably update to be more in line with this at some point]
I'm going to meet someone new tomorrow, and I'm wondering how many kids they have. I know their number of kids is a nonnegative 32 bit integer, and almost all such numbers are greater than 1 million. So I suppose they're highly likely to have more than 1 million kids.
This is an instance of a fallacy I'm calling diagnostic dilution[1]. It is akin to a probabilistic form of the fallacy of the undistributed middle. The error goes like this: we want to evaluate some outcome given some information . We note that implies (or nearly so), and so we decide to evaluate . But this is a mistake! Because implies we are strictly better off evaluating [2]. We've substituted a sharp condition for a vague one, hence the name diagnostic dilution.
In the opening paragraph we have ="count > 1 million", ="we're counting kids" and ="count is a nonnegative 32 bit integer". So is and is close to 1. Diagnostic dilution has lead us far astray.
Diagnostic dilution is always structurally invalid, but it misleads us specifically when the conclusion hinges on forgetting . "Bob is a practising doctor, so Bob can prescribe medications, so Bob can probably prescribe Ozempic" is a fine inference because it goes through without forgetting that Bob is a doctor. But consider the same chain starting with "Bob is a vet": now the inference does not go through unless we forget Bob's job, and indeed the conclusion is false.
Diagnostic dilution is tricky because it looks a bit like a valid logical inference: and so . But if is even a hair less than 1, then, as we saw, we can end up being very wrong.
Quick definition: Diagnostic dilution is a probabilistic reasoning error that involves replacing the condition with a weaker condition and drawing a conclusion that follows from conditioning on but doesn't follow from conditioning on .
Now I'm going to consider a caricature of the argument presented in section 2+4 of The Problem. Hopefully we won't have to haggle the details of its representativeness too much:
- : An AI that autonomously builds football stadiums will: exhibit many goal-directed behaviours.: By the instrumental convergence thesis, systems that pursue goals often seek to eliminate threatening agents,so a football stadium building AI is likely to seek to eliminate threatening agents
Now this argument is clearly structurally unsound, but does the conclusion actually depend on forgetting ? It does. follows from by imagining a stadium builder, "observing its behaviour" and noting that it seemingly has to do things like
- dealing with unreliable suppliersnegotiate complex planning issues with a broad set of parties with diverse interestsmaking tradeoffs about where to put resources that suggest it prefers a completed stadium to a really amazing statue out the frontetc.However, we can apply the same method of imaginary behaviour observation to note that the stadium builder does not have to eliminate threatening agents. The machinery that got us from does not get us to a high probability of from . So this argument takes another path: invoking the instrumental convergence thesis to say that, ignoring the specific premise , goal-directed agents in general have a high probability of a conclusion which depends on forgetting [3].
Maybe there are technical reasons that make the conclusion correct. Even if this is so, it does not make the presented argument sound. Instead we have a different argument:
- : An AI that autonomously builds football stadiums will exhibit many goal-directed behaviours.: By the instrumental convergence thesis, systems that pursue goals often seek to eliminate threatening agents,and for technical reasons this is likely to apply to ,so a football stadium building AI is likely to seek to eliminate threatening agents
Section 2 of "The Problem" offers another similar argument in terms of "strategizing" instead of goal directedness, but it commits the same error of diagnostic dilution: start with a specific kind of agent that does something complex and valuable, note that this has to involve strategising, switch to considering things that strategize "in general"[4] and then via something like instrumental convergence conclude that they'll wipe out humans if they can.
Technical reasons why the conclusion might hold
There are two ways that I see to rescue the conclusion. One option: instead of imagining AIs that build football stadiums, we imagine AIs that build football stadiums while under attack (which could be, for example, cyber, kinetic or legal). It seems intuitive that such systems, if they are to be successful, have to in some sense seek to eliminate threats. Many people talk about AI as a tool to secure geopolitical dominance, and so "strong AI will be under attack in various ways" is not an especially far fetched claim. I think this is a legitimate worry, but it doesn't seem to be central to MIRI's threat model at least. We need to worry about diagnostic dilution here too: an AI that successfully defends itself from adversarial cyber attacks need not defend itself from directions from legitimate authorities; even if we have adversarial pressure of a particular type, we are not licensed to substitute "agents acting under adversarial pressure in general".
Another option, which I think is closer to MIRI’s thinking, is some kind of agency generalisation hypothesis. The hypothesis goes something like this: if an AI is smart enough and has some markers of goal-directedness, then via generalization it picks up characteristics of the general class of goal-directed agents (as imagined by the instrumental convergence thesis). The minimal point I am making in this article is that if they are persuaded by a hypothesis like this, then it ought to appear in the original argument. But I have a less minimal point to make: I am also doubtful that this hypothesis is sound.
How do we get an AI football stadium builder? Here are two plausible proposals[5]:
- Cooperative AI systems: we take an AI project manager which assembles an AI legal team, an AI construction team etc., all individually proven on many smaller and cheaper projectsSimulation: we construct a high fidelity simulation of all of the relevant systems in a city and have an AI learn to construct football stadiums by reinforcement learning
In the first case, we have a swarm of AI systems each with well-characterised behaviour and each operating fairly closely to its training/evaluation environment. Yes, the overall project can be significantly out of distribution for the whole system, but each subsystem is operating in a regime where its behaviour is well characterised (in fact, this property is probably what enables large ensembles to work at all). So once we know how all the components operate, there's little room for additional "goal-directed properties of agents in general" to creep into the system[6].
In the second case, we can just look at the behaviours it exhibits in the simulation. Now there will be differences between the sim and the world, so this will not be a perfect indication of real world behaviour. It's possible that in the transition from the simulation to the real world it applies behaviours learned (or generalized) in the sim in ways that are dangerous in the real world, but it is even more likely to exhibit learned behaviours that are not effective in the real world. If the simulation approach is to be effective, it probably has to have pretty high fidelity, in which case sim behaviours are likely to be pretty representative of the real world behaviour[7].
In both cases, there's no "agency gap" that needs to be explained by broad generalization of agency. All of the goal-directed behaviours needed to accomplish the task are explained by narrow generalization from the training and architecture of the AI system that builds the stadium. So the goal-directedness generalization hypothesis does not help to explain the admittedly imaginary outcome of this admittedly imaginary experiment. Nevertheless, perhaps there is simply a tendency for smart agents to pick up features of "agency in general", even though it is not strictly required for them to perform – something like the mechanism explored by Evan Hubinger in How Likely is Deceptive Alignment?
Without meaning to be too harsh on that piece – which I actually rather like – theories about machine learning on that scale of ambition do not have a high success rate. There is a reason why the field has converged on expecting empirical validation of even very modest theoretical results. If the case for doom hinges on that argument or a close relative, it is a serious error of omission not to mention it.
There is actually some empirical support for some kind of agency generalization thesis, though I think it rather weak. In particular, we have Anthropic's agentic misgeneralization work. My rough understanding is that if you ask recent LLMs (like Claude 4) to “focus on achieving your goal above all else” and put them in a situation where they will either be shut off and fail to achieve their goal, or they have to blackmail or murder someone, they will reasonably often choose blackmail or murder. Also, relevant to the question here, this seems to apply more to more recent models which are able to accomplish longer-horizon tasks. The obvious rejoinder is that you have to ask for this behaviour for it to appear, but there's still the weak appearance of a trend.
On the other hand, recent systems definitely seem motivated to create code that passes validation. They do it successfully fairly often, even in challenging situations. Furthermore, they’ve been noted to sometimes cheat on validation tests for code they found to hard to implement, and to write code in such a manner that it continues to run even if it contains errors. So this looks like motivated behaviour – they want to write successful code. But if that's what they really want, there are a bunch of other things they could do:
- Convince users to give them easier specsQuietly revise specs downDissuade users from changing specs halfway throughDissuade users from imposing strict validation requirementsConvince users to ask for code even when it's not neededBut I’ve used these systems a lot and never encountered any of that. So even though there are strong indications of goal-directedness here, we do quite poorly if we interpret it as an overly broad tendency to pursue the apparent goal. If agency really does generalize broadly, it seems we're still far from seeing it happen, even in a restricted domain where the models are particularly agentic.
TL;DR
- When you're making an argument, don't weaken your premises if you don't have toMIRI's argument for goal-directedness & threat elimination does thisNeed to consider additional technical details to see if their argument stands upStrong AI + war is a worrying combinationI don't find the case for goal-directedness from generalization compelling
Please accept my sincere apologies if it is already named. ↩︎
Ask your favourite LLM why this is if it's not immediately obvious. ↩︎
Alternatively, just as we erred by evaluating instead of , we err by evaluating instead of . ↩︎
For the record, I think the notion of "the general class of things that pursue goals" is suspect, but I'm happy to grant it for the sake of argument here. But seriously, you need an actual prior and not a pretend one. ↩︎
I'm not putting these forward as safe ways to get an AI stadium builder – I'm putting them forward as ways you might get an AI stadium builder at all. ↩︎
In principle there's room for some kind of egregore to emerge, but I don't obviously see an account of why this is likely here, nor why arguments about goal-directed agents in general apply to it. ↩︎
I'm kind of expecting here a lot of responses along the lines "a superintelligence could totally behave very differently out of sim if it wanted to" which is true, but a) I think we should try to get a grip on the somewhat-superhuman case before we go to the wildly superhuman case and b) whether it "wants to" is the entire point of contention. ↩︎
Discuss