Society's Backend 04月16日 23:13
How much does a 10 million token context window actually cost?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了Meta的Llama 4 Scout模型,其宣称拥有1000万tokens的超长上下文窗口。尽管这一技术突破引人注目,但文章指出,如此大的上下文窗口并非如想象中般实用。文章深入分析了超长上下文窗口在信息检索、训练成本和部署方面的挑战,强调了其对计算资源和经济成本的巨大需求。此外,文章还提到了公司在实现超长上下文窗口时所面临的工程挑战和技术保密性,并最终指出,即使模型具备超长上下文窗口,实际应用中也可能因成本限制而无法充分发挥其优势。

🧠超长上下文窗口的局限性:虽然可以将大量信息输入模型,但模型难以从1000万tokens的上下文中准确检索信息,易受开头和结尾信息的影响。

💰训练成本高昂:训练超长上下文窗口模型需要使用非标准注意力机制以提高计算效率,即使如此,训练过程仍需耗费巨额资金,预估Llama 4 Scout的训练成本可能高达数亿美元。

⚙️部署的内存限制:部署具有1000万tokens上下文窗口的模型对内存需求巨大,KV缓存大小成为主要瓶颈,单模型部署可能需要7个H100 GPU,服务器成本高昂。

🤫技术细节的保密性:公司通常不会公开实现超长上下文窗口的具体方法,但社区可以推断出其需要大量的计算和训练时间,实际操作方法是保密的。

Meta is boasting Llama 4 Scout as having an industry-leading context window of 10 million tokens. Social media was eating this up and talking about how you can fit multiple textbooks into the context window and get information from them. While that’s true, the large context window really isn’t as useful as everyone has made it out to be.

Yes, 10 million is a lot of tokens and achieving a 10 million token context window for a production model is a feat (check out MagicAI’s 100 million token context window for kicks), but a large context window isn’t infallible and being able to achieve 10 million tokens comes at a great cost.

Society's Backend is a machine learning newsletter for software engineers. Subscribe to get articles like this in your inbox.

What you should know

A large context window isn’t infallible.

You can feed multiple textbooks into a large context window, but it's very difficult for models to accurately recall information from within a context window that is 10 million tokens long. This means you can give the model as much info as you’d like but that doesn’t mean you’ll get the output you’re hoping for.

Models seem to struggle at retrieving information in the middle of the given input and bias their output toward the information at beginning and end of the context.

Training costs are immense.

The traditional self-attention mechanism also becomes computationally prohibitive with so many input tokens. It calculates attention scores between all pairs of input tokens (which is O(n2) for all you engineers) and becomes inherently prohibitive as n scales up. This necessitates the use of non-standard attention mechanisms to make massive context windows computationally viable.

Even with these optimizers, training is expensive. The quadratic term is significant when a model is trained with 256,000 input tokens for each forward pass (this is an estimate based on the Llama 4 information) and model training takes a long time when the model is trained on over 40 trillion tokens.

Without more info from Meta, it’s difficult to estimate the exact training time and cost, but it wouldn’t surprise me if training Llama 4 Scout took tens or hundred of millions of H100 GPU-hours and cost Meta well over $500 million dollars (probably closer to $1 billion mark).

Utilizing the 10 million token context window comes with significant serving constraints.

Whether or not a model has a 10 million token context window in practice comes down to more than just enabling the context window via training and model architecture. The company hosting the model also needs to make the memory requirements for such a context window available to users.

Those memory requirements are massive. As the context window of an LLM increases, the LLM will store more relationships between tokens in its key-value (KV) cache. The KV cache scales roughly linearly as the number of tokens increases.

In fact, the required memory of the KV cache for a 10 million token input is so large that the KV cache becomes the primary memory constraint instead of the size of model parameters.

Let’s do some back-of-the-napkin math to understand how much serving a model with this context window costs:

First, our KV cache size equation:

KV Cache = 2 * number_of_layers * sequence_length * number_kv_heads * head_dimension * bytes_per_element

Second, extracting those values by extrapolating from Llama 3 (couldn’t find these for Llama 4 Scout, so let me know if they’re available somewhere):

number_of_layers = 40

head_dimension = 128

number_kv_heads = 8

bytes_per_element = 0.5 (4-bit precision or 0.5 bytes)

Third, calculation our KV cache cost per token:

= 2 * number_of_layers * number_kv_heads * head_dimension * bytes_per_element

= 2 * 40 * 8 * 128 * 0.5 = 40960 bytes = 40.96 KB

Fourth, our KV cache cost per 1 million token:

= cost_per_token * 1 million

= 40,960 bytes * 1 million = 40,960,000,000 bytes = 40.96 GB per million

Fifth, our KV cache cost per 10 million tokens:

= cost per million * 10

= 40,960,000,000 bytes * 10 = 409,600,000,000 bytes = 409.60 GB

This means Llama 4 Scout with the 10 million token context window won’t be running on a consumer GPU and will require aggressive engineering to fit into a single H100 (80 GB of VRAM so even more quantization, requiring a shorter context length, and more).

This means to serve one model at a 10 million token context window, the company hosting Llama 4 Scout would likely need 7 H100s (storing the KV cache in memory along with ~55 GB of model weights at 4-bit) which would require a server with 8 H100s since that’s how they’re usually made available.

Finally, estimate our cost based on compute requirements:

Looking across GCP, Azure, and AWS, a server of this size would cost ~$90/hour. Compare this to the cost of a single H100 which is ~$10/hr which Llama 4 Scout can fit on if the context window is reduced.

This is a rudimentary calculation. Hosting Llama 4 Scout with a large context window would likely cost a company less than the numbers above if they were using their own servers, but these numbers are a good estimate.

The lesson here is that not just training, but hosting an LLM with a large context window is expensive. Just because a model is capable of a large context window doesn’t mean it’ll be served with that window.

The real tragedy of the Llama series is that they’ve gotten so large they’re basically incapable of being run on a single consumer GPU. This is what made the Llama models so beloved in the AI community.

Companies keep their context window strategies close to the chest.

Companies generally don’t share how they achieve their large context windows and don’t get too specific with details. This is why my numbers above are just estimates (especially training costs).

However, the community is able to reason pretty well about how achieving this context window could be done but the actual methodology is kept secret. The thing we do know is it takes a lot of compute and training time to make happen.


Just writing about this is making me excited. This really is what machine learning engineering is fundamentally about: making machine learning work in practice.

As always I’m open to any questions and let me know if there’s an error in my work.

Always be (machine) learning,

Logan

Share

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama 4 Scout 大语言模型 上下文窗口 训练成本 部署成本
相关文章