Society's Backend 05月23日 23:58
How much does a 10 million token context window actually cost?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了Meta的Llama 4 Scout模型,该模型声称拥有1000万个tokens的行业领先语境窗口。尽管如此,文章指出,如此大的语境窗口在实际应用中的价值可能被高估。文章分析了训练成本、推理成本以及存储需求,强调了实现和利用如此大语境窗口所面临的巨大挑战。文章最后得出结论,即虽然模型可以支持大语境窗口,但其部署和应用成本高昂,使其难以在实际中充分发挥作用。

🧠 大语境窗口的局限性:尽管Llama 4 Scout拥有1000万个tokens的语境窗口,但模型在处理如此长的文本时,难以准确地从中间位置检索信息,倾向于关注开头和结尾的信息。

💰 训练成本高昂:训练大型语言模型(LLM)的成本极高。Llama 4 Scout的训练可能需要花费数千万甚至数亿美元,消耗大量的GPU资源。

💾 推理成本与存储需求:使用大语境窗口会产生巨大的内存需求,特别是KV缓存,其大小与tokens数量呈线性关系。这导致了高昂的服务器成本,例如,运行1000万个tokens的Llama 4 Scout可能需要多个H100 GPU,每小时的服务器成本约为90美元。

🤫 技术细节的保密性:公司通常不会公开其实现大语境窗口的具体方法,这使得准确评估成本和技术细节变得困难。但可以确定的是,实现大语境窗口需要大量的计算和训练时间。

Meta is boasting Llama 4 Scout as having an industry-leading context window of 10 million tokens. Social media was eating this up and talking about how you can fit multiple textbooks into the context window and get information from them. While that’s true, the large context window really isn’t as useful as everyone has made it out to be.

Yes, 10 million is a lot of tokens and achieving a 10 million token context window for a production model is a feat (check out MagicAI’s 100 million token context window for kicks), but a large context window isn’t infallible and being able to achieve 10 million tokens comes at a great cost.

Society's Backend is a machine learning newsletter for software engineers. Subscribe to get articles like this in your inbox.

What you should know

A large context window isn’t infallible.

You can feed multiple textbooks into a large context window, but it's very difficult for models to accurately recall information from within a context window that is 10 million tokens long. This means you can give the model as much info as you’d like but that doesn’t mean you’ll get the output you’re hoping for.

Models seem to struggle at retrieving information in the middle of the given input and bias their output toward the information at beginning and end of the context.

Training costs are immense.

The traditional self-attention mechanism also becomes computationally prohibitive with so many input tokens. It calculates attention scores between all pairs of input tokens (which is O(n2) for all you engineers) and becomes inherently prohibitive as n scales up. This necessitates the use of non-standard attention mechanisms to make massive context windows computationally viable.

Even with these optimizers, training is expensive. The quadratic term is significant when a model is trained with 256,000 input tokens for each forward pass (this is an estimate based on the Llama 4 information) and model training takes a long time when the model is trained on over 40 trillion tokens.

Without more info from Meta, it’s difficult to estimate the exact training time and cost, but it wouldn’t surprise me if training Llama 4 Scout took tens or hundred of millions of H100 GPU-hours and cost Meta well over $500 million dollars (probably closer to $1 billion mark).

Utilizing the 10 million token context window comes with significant serving constraints.

Whether or not a model has a 10 million token context window in practice comes down to more than just enabling the context window via training and model architecture. The company hosting the model also needs to make the memory requirements for such a context window available to users.

Those memory requirements are massive. As the context window of an LLM increases, the LLM will store more relationships between tokens in its key-value (KV) cache. The KV cache scales roughly linearly as the number of tokens increases.

In fact, the required memory of the KV cache for a 10 million token input is so large that the KV cache becomes the primary memory constraint instead of the size of model parameters.

Let’s do some back-of-the-napkin math to understand how much serving a model with this context window costs:

First, our KV cache size equation:

KV Cache = 2 * number_of_layers * sequence_length * number_kv_heads * head_dimension * bytes_per_element

Second, extracting those values by extrapolating from Llama 3 (couldn’t find these for Llama 4 Scout, so let me know if they’re available somewhere):

number_of_layers = 40

head_dimension = 128

number_kv_heads = 8

bytes_per_element = 0.5 (4-bit precision or 0.5 bytes)

Third, calculation our KV cache cost per token:

= 2 * number_of_layers * number_kv_heads * head_dimension * bytes_per_element

= 2 * 40 * 8 * 128 * 0.5 = 40960 bytes = 40.96 KB

Fourth, our KV cache cost per 1 million token:

= cost_per_token * 1 million

= 40,960 bytes * 1 million = 40,960,000,000 bytes = 40.96 GB per million

Fifth, our KV cache cost per 10 million tokens:

= cost per million * 10

= 40,960,000,000 bytes * 10 = 409,600,000,000 bytes = 409.60 GB

This means Llama 4 Scout with the 10 million token context window won’t be running on a consumer GPU and will require aggressive engineering to fit into a single H100 (80 GB of VRAM so even more quantization, requiring a shorter context length, and more).

This means to serve one model at a 10 million token context window, the company hosting Llama 4 Scout would likely need 7 H100s (storing the KV cache in memory along with ~55 GB of model weights at 4-bit) which would require a server with 8 H100s since that’s how they’re usually made available.

Finally, estimate our cost based on compute requirements:

Looking across GCP, Azure, and AWS, a server of this size would cost ~$90/hour. Compare this to the cost of a single H100 which is ~$10/hr which Llama 4 Scout can fit on if the context window is reduced.

This is a rudimentary calculation. Hosting Llama 4 Scout with a large context window would likely cost a company less than the numbers above if they were using their own servers, but these numbers are a good estimate.

The lesson here is that not just training, but hosting an LLM with a large context window is expensive. Just because a model is capable of a large context window doesn’t mean it’ll be served with that window.

The real tragedy of the Llama series is that they’ve gotten so large they’re basically incapable of being run on a single consumer GPU. This is what made the Llama models so beloved in the AI community.

Companies keep their context window strategies close to the chest.

Companies generally don’t share how they achieve their large context windows and don’t get too specific with details. This is why my numbers above are just estimates (especially training costs).

However, the community is able to reason pretty well about how achieving this context window could be done but the actual methodology is kept secret. The thing we do know is it takes a lot of compute and training time to make happen.


Just writing about this is making me excited. This really is what machine learning engineering is fundamentally about: making machine learning work in practice.

As always I’m open to any questions and let me know if there’s an error in my work.

Always be (machine) learning,

Logan

Share

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama 4 Scout 大语境窗口 LLM 训练成本 推理成本
相关文章