少点错误 10小时前
Musings on AI Companies of 2025-2026 (Jun 2025)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了全球范围内在人工智能(AI)领域中,具备前沿算力并积极研发通用人工智能(AGI)的少数几家公司。文章分析了这些公司在获取和部署大规模计算资源,特别是GPU方面的挑战,并预测了在2025年和2026年,哪些公司可能在算力竞赛中占据领先地位。文章还讨论了大型模型训练的规模化需求、互联技术的重要性,以及融资和基础设施建设对AI发展的影响。

🚀 只有少数公司掌握前沿AI训练算力:目前全球仅有包括Google DeepMind、OpenAI、Anthropic、xAI和Meta在内的5家公司拥有前沿AI训练算力,并积极开发AGI。

💰 算力成本飞速增长:2024年,前沿算力需要10万颗H100芯片,成本约为50-70亿美元;2025年,则需要10万颗GB200 NVL72芯片,成本增至70-110亿美元;2026年,OpenAI的Stargate Abilene项目预计需要40万颗芯片,总成本高达350-450亿美元。

🌐 规模化训练系统的重要性:随着RLVR(训练思维模型)的出现,对AI训练系统的规模化提出了新要求。大规模互联的AI加速器系统能够模拟单个超大芯片,对训练速度至关重要。

⏳ 2026年的挑战:到2026年,融资和基础设施建设将成为关键瓶颈。除了算力成本,还需要构建新的非计算基础设施,如1GW的电力供应和液冷系统,这需要至少9个月的建设周期。

💡 潜在的竞争格局:虽然目前只有少数公司具备前沿算力,但亚马逊和Meta等拥有强大数据中心基础设施的公司,可能通过提升数据中心间的带宽,间接获得足够的算力。华为也可能成为潜在的竞争者,但其芯片供应仍面临挑战。

Published on June 20, 2025 5:14 PM GMT

Currently, only 5 companies in the world have access to frontier AI training compute and are also pursuing development of AGI (Google DeepMind, OpenAI, Anthropic, xAI, and Meta). This will still hold in 2026 for Google and OpenAI, and plausibly also for Anthropic, Meta, and xAI.

Stance towards trying to develop AGI can change, but the frontier AI training compute barrier is increasingly insurmountable for any company that doesn't already have impressive AI development accomplishments. In 2024, frontier compute was 100K H100s, and that cost about $5-7bn (it was still possible to use legacy air cooling infrastructure with H100s). In 2025, that's 100K chips in GB200 NVL72 racks, which costs $7-11bn. In 2026, OpenAI's Stargate Abilene sets the lower bound at 400K chips in NVL72 racks (GB200 or possibly GB300), which is a 1 GW datacenter campus that costs $35-45bn (though you can continue building out the 2025 system, so only $30-35bn in addition to that).

For 2025, Musk said on a recent podcast that "30K GB200s" are already installed at xAI's original Memphis site, and they are going to install additional "110K GB200s" shortly at a new Memphis site (at 31:35). (Counting "GB200s" is a bit ambiguous, since in various contexts it can refer to either a single compute die, to a chip/package that has 2 compute dies in it, or to a board with these chips that has 2 chips on it.) OpenAI of course has phase 1 of Stargate Abilene, which is 100K chips in GB200 NVL72 racks (2 out of 8 buildings planned for completion in summer 2026) that are either already online or will be coming online imminently. Anthropic has Project Rainier, which is 400K Trainium 2 and has the FLOP/s of about 250K H100s, the same as 100K Blackwell (GB200) chips. Meta can afford a $7-11bn training system, and given their recent moves has the willingness to spend.

RLVR and Large World Size

There is likely a new constraint on AI training systems starting with 2025-2026, if RLVR (training of thinking models) scales to have a use for as much GPU-time as frontier AI pretraining. A scale-up world is a system of AI accelerators with sufficiently good interconnect between them, so that for some purposes it can act as a single very large chip. If a large reasoning model doesn't fit in one or very few such scale-up worlds, it will generate tokens much slower than if it does, while having more compute in the form of small scale-up world systems won't particularly help. Google historically has TPUs organized in very large scale-up world systems with 3D topology, and for example Gemini 2.5 report says it was trained on TPU-v5p pods, which are scaled-up from chips with individual 0.5e15 BF16 FLOP/s compute, only half of what H100 does. While the H100s themselves only form 8-chip scale-up worlds, which are too small for many current frontier models. For Nvidia, GB200 NVL72 is the first scale-up world system large enough to hold frontier LLMs (with the number of parameters that is compute optimal to train with 2024 compute), and additionally its all-to-all topology might be more convenient for some purposes than 3D topology of the TPU systems. Notably, in Apr 2025 Huawei announced their CloudMatrix 384 system that is a 7nm-chip based scale-up world system larger than even GB200 NVL72, though China still can't produce enough HBM or compute dies domestically for that to make a difference, and the chips they do have are unlikely to take the form of a single frontier AI training system, even though in principle there is enough to match the capabilities of the 2026 form of Stargate Abilene. And Amazon's Trainium 2 from Project Rainier that Anthropic uses has a relatively small scale-up world, only a bit larger than that of H100.

So in terms of large scale-up world sizes, only Google and OpenAI clearly have access to frontier amounts of AI training compute in 2025. Anthropic plausibly will have access to enough GB200 NVL36/NVL72 because of their relationship with Amazon, Meta can afford and seems willing to make this happen, though it's unclear if they can get their act together after the recent changes sufficiently quickly for 2025. And xAI will also likely be able to go through with their "110K GB200s" move (though that might turn out to be only 55K chips), as even if they didn't already raise the money necessary for this they almost certainly will.

2026

For 2026, the difficulty of financing a frontier AI training system grows, so again only Google and OpenAI will clearly have enough compute (and it's not completely clear that Google is willing to dedicate enough of it to actual training, to match Stargate Abilene of 2026). Also, the 2026 systems need 1 GW of power and use liquid cooling, so new non-compute infrastructure also needs to be built for them specifically, which as Stargate Abilene demonstrates takes at least 9 months (plus 2-3 months to install the compute hardware). So these systems for 2026 need to already start being built, and it's not clear if even Amazon/Anthropic is doing that, let alone Meta or xAI.

On the other hand, sufficiently high bandwidth networking between different datacenter campuses effectively makes them act as a single AI training system (which is especially the case with RLVR), so as established datacenter builders, Amazon and Meta will probably have enough compute with sufficient inter-datacenter bandwidth between its locations, and then the main question is whether they'll be willing to use enough of it for training of AIs. For xAI, the question is the ability to actually raise the necessary $30-35bn, which needs to happen already this year for the training system to be built by 2026.

There are no contenders for securing 2026 frontier AI training compute (that are trying to build AGI) outside these 5 companies that I know of, though there is a tiny fraction of a chance that Huawei might make this happen for some Chinese company if suddenly and very surprisingly their chip supply problem gets resolved this year.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI AI算力 OpenAI GPU 算力成本
相关文章