Published on June 20, 2025 5:14 PM GMT
Currently, only 5 companies in the world have access to frontier AI training compute and are also pursuing development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). This will still hold in 2026 for Google and OpenAI, and plausibly also for Anthropic, Meta, and xAI.
Stance towards trying to develop AGI can change, but the frontier AI training compute barrier is increasingly insurmountable for any company that doesn't already have impressive AI development accomplishments. In 2024, frontier compute was 100K H100s, and that cost about $5-7bn (it was still possible to use legacy air cooling infrastructure with H100s). In 2025, that's 100K chips in GB200 NVL72 racks, which costs $7-11bn. In 2026, OpenAI's Stargate Abilene sets the lower bound at 400K chips in NVL72 racks (GB200 or possibly GB300), which is a 1 GW datacenter campus that costs $35-45bn (though you can continue building out the 2025 system, so only $30-35bn in addition to that).
For 2025, Musk said on a recent podcast that "30K GB200s" are already installed at xAI's original Memphis site, and they are going to install additional "110K GB200s" shortly at a new Memphis site (at 31:35). (Counting "GB200s" is a bit ambiguous, since in various contexts it can refer to either a single compute die, to a chip/package that has 2 compute dies in it, or to a board with these chips that has 2 chips on it.) OpenAI of course has phase 1 of Stargate Abilene, which is 100K chips in GB200 NVL72 racks (2 out of 8 buildings planned for completion in summer 2026) that are either already online or will be coming online imminently. Anthropic has Project Rainier, which is 400K Trainium 2 and has the FLOP/s of about 250K H100s, the same as 100K Blackwell (GB200) chips. Meta can afford a $7-11bn training system, and given their recent moves has the willingness to spend.
RLVR and Large World Size
There is likely a new constraint on AI training systems starting with 2025-2026, if RLVR (training of thinking models) scales to have a use for as much GPU-time as frontier AI pretraining. A scale-up world is a system of AI accelerators with sufficiently good interconnect between them, so that for some purposes it can act as a single very large chip. If a large reasoning model doesn't fit in one or very few such scale-up worlds, it will generate tokens much slower than if it does, while having more compute in the form of small scale-up world systems won't particularly help. Google historically has TPUs organized in very large scale-up world systems with 3D topology, and for example Gemini 2.5 report says it was trained on TPU-v5p pods, which are scaled-up from chips with individual 0.5e15 BF16 FLOP/s compute, only half of what H100 does. While the H100s themselves only form 8-chip scale-up worlds, which are too small for many current frontier models. For Nvidia, GB200 NVL72 is the first scale-up world system large enough to hold frontier LLMs (with the number of parameters that is compute optimal to train with 2024 compute), and additionally its all-to-all topology might be more convenient for some purposes than 3D topology of the TPU systems. Notably, in Apr 2025 Huawei announced their CloudMatrix 384 system that is a 7nm-chip based scale-up world system larger than even GB200 NVL72, though China still can't produce enough HBM or compute dies domestically for that to make a difference, and the chips they do have are unlikely to take the form of a single frontier AI training system, even though in principle there is enough to match the capabilities of the 2026 form of Stargate Abilene. And Amazon's Trainium 2 from Project Rainier that Anthropic uses has a relatively small scale-up world, only a bit larger than that of H100.
So in terms of large scale-up world sizes, only Google and OpenAI clearly have access to frontier amounts of AI training compute in 2025. Anthropic plausibly will have access to enough GB200 NVL36/NVL72 because of their relationship with Amazon, Meta can afford and seems willing to make this happen, though it's unclear if they can get their act together after the recent changes sufficiently quickly for 2025. And xAI will also likely be able to go through with their "110K GB200s" move (though that might turn out to be only 55K chips), as even if they didn't already raise the money necessary for this they almost certainly will.
2026
For 2026, the difficulty of financing a frontier AI training system grows, so again only Google and OpenAI will clearly have enough compute (and it's not completely clear that Google is willing to dedicate enough of it to actual training, to match Stargate Abilene of 2026). Also, the 2026 systems need 1 GW of power and use liquid cooling, so new non-compute infrastructure also needs to be built for them specifically, which as Stargate Abilene demonstrates takes at least 9 months (plus 2-3 months to install the compute hardware). So these systems for 2026 need to already start being built, and it's not clear if even Amazon/Anthropic is doing that, let alone Meta or xAI.
On the other hand, sufficiently high bandwidth networking between different datacenter campuses effectively makes them act as a single AI training system (which is especially the case with RLVR), so as established datacenter builders, Amazon and Meta will probably have enough compute with sufficient inter-datacenter bandwidth between its locations, and then the main question is whether they'll be willing to use enough of it for training of AIs. For xAI, the question is the ability to actually raise the necessary $30-35bn, which needs to happen already this year for the training system to be built by 2026.
There are no contenders for securing 2026 frontier AI training compute (that are trying to build AGI) outside these 5 companies that I know of, though there is a tiny fraction of a chance that Huawei might make this happen for some Chinese company if suddenly and very surprisingly their chip supply problem gets resolved this year.
Discuss