Musings on LLM Scale (Jul 2024)

少点错误 2024年07月04日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

人工智能模型训练成本正在迅速攀升，目前已达到数亿美元。据Dario Amodei透露，一些正在训练的模型成本已超过10亿美元，预计到2025-2027年，训练成本将突破100亿美元。文章分析了训练成本的构成，并探讨了未来模型训练成本突破千亿美元的可能性。

💥 **模型训练成本大幅攀升：** 当前，AI模型训练成本已达到数亿美元，部分正在训练的模型甚至超过10亿美元。预计未来几年，训练成本将继续攀升，到2025-2027年，可能会突破100亿美元，甚至可能达到千亿美元。这主要是因为模型规模和训练数据量不断增加，以及训练所需要的算力资源需求越来越大。例如，一个价值10亿美元的模型需要25000个H100芯片，这些芯片需要消耗大量的电力，并需要长时间运行。为了降低训练成本，研究人员正在探索新的训练方法，例如使用更高效的训练算法和更强大的硬件。但目前，训练成本仍然是AI发展的一个重要瓶颈。

🚀 **超大规模模型训练：** 为了训练更强大的AI模型，研究人员需要使用大量的算力资源。目前，一些大型科技公司已经拥有了专门的训练集群，例如Meta的新训练集群和Google的SuperPods。这些集群可以提供数万个H100芯片，以及足够的电力和网络带宽，以支持大型模型的训练。随着模型规模的不断增加，训练集群的规模也需要不断扩大。未来，可能需要多个训练集群协同工作，才能满足大型模型的训练需求。除了算力资源，训练数据的质量和数量也是影响模型性能的关键因素。为了获得更好的训练效果，研究人员需要收集大量的、高质量的训练数据，并对数据进行预处理和清洗。

💡 **未来趋势：** 未来，AI模型训练成本将继续攀升，但同时，训练效率也将得到提高。随着硬件技术的不断进步，以及新的训练方法的出现，训练成本可能会得到控制，并可能出现新的突破。例如，一些研究人员正在探索异步训练算法，这种算法可以利用多个分散的计算资源，提高训练效率。此外，一些公司正在开发新的芯片，例如谷歌的TPU，这些芯片专门针对AI模型的训练和推理进行了优化。未来，AI模型训练成本将成为一个重要的研究方向，研究人员需要不断探索新的技术，降低训练成本，提高训练效率，推动AI技术的进一步发展。

🤔 **成本与能力的平衡：** 目前，大型科技公司在AI模型训练中面临着一个重要的挑战：如何平衡成本和能力。一方面，他们需要投入大量的资金和资源来训练更强大的模型，另一方面，他们也需要考虑模型的成本和实际应用价值。一些公司可能会选择训练更小的模型，以降低成本，同时也能满足一些实际应用的需求。另一些公司可能会选择训练更大的模型，以获得更高的性能，但需要承担更高的成本。未来，AI模型训练成本将成为一个重要的竞争因素，公司需要找到一个平衡点，才能在AI领域取得领先优势。

Published on July 3, 2024 6:35 PM GMT

In a recent interview, Dario Amodei claimed that cost of training is (starting with models already available)

Right now, $100 million. There are models in training today that are more like a $1 billion. I think if we go to $10 or a $100 billion, and I think that will happen in 2025-2026, maybe 2027, ...

(Epistemic status: Fermi estimates, 8 is approximately 10 which is greater than 9.)

Assuming $40,000 per H100 and associated infrastructure in a datacenter, $1 billion gives 25K H100s, which matches the scale of for example Meta's new training clusters and requires about 40MW of power. At $2 per hour, training time cost of 25K H100s reaches $100 million in 80 days, which seems reasonable if on the short side for a production training run. The cost of time matches $1 billion at 2.3 years. An H100 (SXM) is rated for 2e15 FLOP/s in BF16 (my impression is this is usually stable out of the box). This becomes 4e15 FLOP/s in FP8, which seems practical if done carefully, no degradation in pre-training loss compared to FP32. The $100 million run then translates to 9e25 FLOPs at 30% utilization in BF16, or 2e26 FLOPs in FP8. (For some reason this SemiAnalysis estimate is 2x lower, peak 2e20 FLOP/s for 100,000 H100s at FP8, possibly the sparsity footnote in H100 specification for the 4000 teraFLOP/s figure is the culprit.)

This is maybe 10x original GPT-4, estimated at 2e25 FLOPs. The leading models (Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4 Omni) cost $15-20 per million output tokens, compared to $75-120 for once-frontier models Claude 3 Opus, Gemini 1 Ultra, original GPT-4. Given a Chinchilla optimal model, if we reduce its active parameters 3x and increase training compute 3x, we get approximately the same performance, but it's now at least 3x cheaper for inference. This increases data 10x, which if everything else fails can be obtained by repeating the old data, giving 30x overtraining in compute compared to what is Chinchilla optimal for the smaller model. Llama-3-70b is overtrained 10x, Llama-3-8b 90x, though they don't use MoE and their performance is lower than for MoE models with the same active parameters and training cost.

Beyond $100 million

The current frontier models are overtrained on compute that could enable even smarter models. Compute is increasing, but it mostly goes to reduction of inference cost, and only a little bit to capabilities. Why aren't any of the three labs directing the compute to train/release models optimized for maximum capability? Possibly costs are already such that training at too many parameter/data tradeoff points won't be done, instead they choose an option that's currently most useful and spend the rest on experiments that would make imminent larger scale runs better. Even OpenAI's next frontier model in training as of May 28 might just be using compute comparable to what GPT-4 Omni required, not OOMs more, and it could still get much more capable if allowed to be more expensive for inference.

To do a run at $1 billion in cost of time, even 100K H100s would need 200 days (powered by 150MW). There probably aren't any individual clusters of this scale yet (which would cost about $4 billion). Gemini 1.0 report stated that

Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google across multiple datacenters. ... we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.

This together with Amodei's claim of current $1 billion training runs and individual 100K H100 clusters still getting built suggests that training using multiple clusters is possible, and individual clusters are not the crucial bottleneck for scale of training runs. The claim about feasibility of $10 billion training runs by the end of 2025 would also make even less sense otherwise.

So Microsoft's 5GW Stargate datacenter that starts construction in 2028 (and might get into operation in 2030s) is not the relevant anchor for timelines of scaling. There is currently on the order of 3GW in datacenters for each hyperscaler, with plans to double it. A $10 billion run over 200 days needs 1.5 GW and invests 2e28 FLOPs of FP8 compute, 1000x original GPT-4, and in this framing end of 2025 no longer looks completely impossible. There are also more speculative asynchronous training algorithms that might at some point help with making use of poorly connected islands of compute.

Discuss

Beyond $100 million

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签