Slowdown After 2028: Compute, RLVR Uncertainty, MoE Data Wall

Published on May 1, 2025 1:54 PM GMT

It'll take until ~2050 to repeat the level of scaling that pretraining compute is experiencing this decade, as increasing funding can't sustain the current pace beyond ~2029 if AI doesn't deliver a transformative commercial success by then. Natural text data will also run out around that time, and there are signs that current methods of reasoning training might be mostly eliciting capabilities from the base model.

If scaling of reasoning training doesn't bear out actual creation of new capabilities that are sufficiently general, and pretraining at ~2030 levels of compute together with the low hanging fruit of scaffolding doesn't bring AI to crucial capability thresholds, then it might take a while. Possibly decades, since training compute will be growing 3x-4x slower after 2027-2029 than it does now, and the ~6 years of scaling since the ChatGPT moment stretch to 20-25 subsequent years, not even having access to any more natural text data that pretraining of this decade is putting to use.

Training Compute Slowdown

GPT-4 was pretrained in 2022 using ~24K A100 chips (0.3e15 BF16 FLOP/s per chip, 7.2e18 FLOP/s total) for ~2e25 FLOPs^[1]. The current batch of frontier models (Grok 3, GPT-4.5) were pretrained in 2024 using ~100K H100 chips (1e15 BF16 FLOP/s per chip, 1e20 FLOP/s total), possibly for about 3e26 BF16 FLOPs^[2] (or ~2x more in FP8). Abilene site of Crusoe/Stargate/OpenAI will have 400K-500K Blackwell chips in NVL72 racks in 2026 (2.5e15 BF16 FLOP/s per 2-die chip/package, ~1.1e21 FLOP/s total), enough to pretrain models for about 4e27 BF16 FLOPs.

Thus raw compute of a frontier training system is increasing about 160x in 4 years, or about 3.55x per year. Epoch AI estimates the trend of increasing price-performance of compute at 1.39x per year (adjusted for inflation). The 100K H100s training systems cost about $4-5bn (all-in, not just chips), which is ~$45K per 1e15 FLOP/s chip, and the GB200 NVL72 seem to be about $4M per 72-chip rack (again, all-in), which is ~$55K per 2.5e15 FLOP/s chip. Counting this as 2 years between the training systems, we get a 1.42x per year increase in price-performance, which fits the trend.

At $4M per rack (all-in), compute at the Abilene site will cost about $22-28bn to build, which is an increase in cost of 2.35x per year. At this pace, a 2028 training system would need to cost $140bn, which is still borderline plausible. But by 2030 we would get to $770bn, which probably can't actually happen if AI doesn't cross enough capability thresholds by then.

When funding stops increasing, the current pace of 3.55x per year (fueled by increasing funding) regresses to the pace of improvement in price-performance of compute of 1.4x per year, which is 3.7x slower. If the $140bn training systems of 2028 do get built, they'll each produce about 1.5e22 BF16 FLOP/s of compute, enough to train models for about 5e28 BF16 FLOPs.

Thus from 2022 to 2028, training systems (and pretrained models) might scale about 2,000x in FLOP/s. When increasing funding further is no longer feasible, the rate of scaling of 1.4x per year will take another 22 years to match this feat, increasing about 2,000x by year 2050.

Bounded Potential of Thinking Training

The current publicly known method of long horizon reasoning training was revealed in DeepSeek-R1 report (GRPO, with multiple incremental improvements since then such as DAPO). The collective term for these methods that might be catching on is RLVR, "RL with Verifiable Rewards".

So far, there is no public evidence that RLVR usefully scales beyond a very modest amount of training, or that training generalizes far beyond competition math and coding. Most thinking models only had a chance to demonstrate the first success in applying these methods, not yet sustained advancement beyond previous successes. Even for o1 and o3 from OpenAI, their base models are very likely different (GPT-4o and GPT-4.1). Only time will tell, or a paper about scaling laws that shows how well RLVR converts non-scarce inputs (such as compute) into capabilities. By 2028, this crucial uncertainty should mostly resolve.

There are recent results that suggest RLVR to be mostly elicitation, with potential bounded by capabilities of the pretrained model. In the s1 paper, the authors construct a training set of 1K reasoning traces for SFT sufficient to lift the capabilities of instruct models to a level comparable to those directly trained with RLVR/GRPO. Even more strikingly, there is another recent paper that shows finetuning on merely 1-2 well-chosen reasoning traces is sufficient to transfer most of the benefits of RLVR, which seems conclusive in demonstrating its elicitation-nature at the current scale.

There is a clear possibility that with more compute poured into RLVR, it'll start creating capabilities rather than only eliciting them, like RL did with AlphaZero. But this hasn't been publicly demonstrated to happen yet, and another recent paper gives some indication that it might fail to work. The authors measure pass@k performance^[3] of models before and after RLVR, and the RL-trained models consistently do better at low k, but then they do worse at high k, so that the base model starts outperforming them, after some point between 8 and 400 attempts in the benchmarks the paper uses (Figure 2).

The point at which the pass@k curves before and after RLVR training intersect seems remarkably stable for any given type of tasks (benchmark). It barely moves across multiple variations on GRPO (some of which mitigate loss of entropy that it suffers from), or from applying training in the range from 150 to 450 steps (Figure 7). The point of intersection even moves lower with more training, suggesting that performance of the base model at the intersection point with a weak RLVR model might remain an upper bound for performance of a much stronger RLVR model. Since the reliability of the base model is not yet very high even at pass@400 for many important tasks, this kind of bound on capabilities would be crippling for RLVR's potential.

Data Inefficiency of MoE

A paper from Jan 2025 finally gave clear measurements about the scaling laws for MoE models of various levels of sparsity, directly comparing them to dense. Turns out that even though MoE models are more compute efficient (it's possible to train a better model with less training compute), training them compute optimally^[4] needs a lot more data than for dense models. Namely, at 1:8 (87%) sparsity, a compute optimal run would use ~3x more tokens per active param than a dense model, and at 1:32 (97%) sparsity it'll use ~6x more tokens per active param (Figure 12, left).

Famously, Chinchilla shows that using about 20 tokens per param was compute optimal at 6e23 FLOPs, and more recently Llama 3 405B report measured 40 tokens per param to be optimal with their dataset at 4e25 FLOPs. Larger datasets might continue this trend, since average data quality is going down, and for a 1:32 sparse MoE model (like DeepSeek-V3) this suggests 240 tokens per param might well turn out to be optimal.

This means that the natural text data will be running out much faster than the Chinchilla's 20 tokens per param used to suggest. At 240 tokens per param, even a 4e27 FLOPs model trained in 2026 would want 400T tokens, which exceeds even the unfiltered Common Crawl dataset (estimated to contain about 240T tokens prior to 2023, see Section 3.1). To some extent this can be mitigated by training for multiple epochs, repeating the same data, and a May 2023 paper shows that repeating the data up to about 5 times is not meaningfully worse than using 5 times more of unique data from the same distribution (Figure 4), with up to 15 repetitions remaining somewhat useful.

The recent Qwen 3 and Llama 4 Scout releases disclose using 36T and 40T token datasets in pretraining. Repeated 5 times, that gives up to 200T tokens, which would almost bridge the gap to the compute optimal 400T tokens for the hypothetical 1:32 sparse 4e27 FLOPs MoE model of 2026. At 1:8 sparsity instead, the compute optimal ratio might be about 120 tokens per param, asking for 280T tokens, which would be a little more manageable and only need 46T tokens repeated 6 times.

This completely breaks down with a 5e28 FLOPs model of 2028-2029 trained on a $140bn training system, which at 1:32 sparsity would be asking for 1,400T tokens. This might somewhat work as 100T tokens repeated 14 times, but probably not that well, and the compute will likely be put to other uses, such as training on video (assuming RLVR didn't scale to the point where it needs most of the 5e28 FLOPs of compute). It only gets worse as the scaling continues beyond 2030, so even with the same 2,000x increase in compute by ~2050, its straightforward impact on pretraining capabilities will be lower than what we are observing in 2022-2028, since there won't be any more natural text data to work with.

Various rumors and estimates give the range between 20K and 25K A100 chips, SemiAnalysis gives the 24K figure. ↩︎
The estimates assume 40% compute utilization and training for 3.5 months. ↩︎
A model solves a task at pass@k if at least one of k attempts succeeds (perhaps by chance). This is not very useful in practice, because outside an eval there is often no way to tell which solutions were correct. ↩︎
Pretraining is compute optimal if it achieves as low loss/perplexity as possible using a fixed amount of training compute. In practice it's often useful to overtrain models, which wastes some compute but makes models smaller and thus cheaper to inference. ↩︎

Discuss

Training Compute Slowdown

Bounded Potential of Thinking Training

Data Inefficiency of MoE

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签