What the cost difference in processing input vs. output tokens with LLMs?

少点错误 2024年08月08日

What the cost difference in processing input vs. output tokens with LLMs?

文章探讨输入与输出令牌成本差异，提到输入令牌处理规模呈二次方，输出令牌因使用KV缓存呈线性规模，实际中输出令牌的问题在于内存容量和带宽，作者对输入令牌成本低于输出令牌感到困惑，且希望能估算单个令牌的处理成本。

🎯输入令牌的处理规模呈二次方，需计算每个令牌与其他令牌的注意力（通过编码器传递K和V），这导致处理过程较为复杂。

💾输出令牌因使用KV缓存而呈线性规模，通过用内存换取计算，需巧妙地存储和计算相关内容，如稀疏性、缓存驱逐等，但实际中输出令牌的问题在于内存容量和带宽。

🤔作者对输入令牌成本普遍比输出令牌低2到5倍感到困惑，不确定提供商按内存问题线性定价的证据，若按FLOPS受限，应按二次方定价，也怀疑是否只是为了客户定价简单或营销。

📈作者从LeptonAI的一条推文中了解到通常输入令牌比输出令牌多3 - 10倍，若输入令牌在序列中占主导且FLOPS是问题，定价应有所体现，但目前不确定其在这些计算中所起的作用。

Published on August 8, 2024 10:43 AM GMT

Hi,

I am trying to understand the difference in the cost of producing a single input token vs output token.

Based on some articles, I came to the following conclusion:

Input tokens processing scales quadratically, there’s no way around it, you have to compute attention (K and V by passing it through encoders) for each token with each other token.Output tokens scale linearly thanks to using KV cache (otherwise quadratic without KV cache, as linear tokens do) which is what everyone seems to do when hosting these models. I.e. you trade compute for memory, and try to be clever about how you store and compute all these stuff (like sparse, cache eviction…). I believe this is simply an empirically practical way around the quadratic scaling.Therefore, realistically, it’s really the memory capacity and bandwidth being the problem for output tokens rather than raw FLOPS. KV cache grows linearly with the sequence length regardless input/output tokens ratio.I am confused here why input tokens are almost universally cheaper - 2 to 5 times - to output tokens. Also, is the evidence that providers are pricing things linearly for memory being the problem here (as they do now, you pay the same price for 11th token as for 10001)? If not and they were FLOPS bounded, I would expect the providers to price stuff quadratically, not linearly. Or is it just for client-pricing simplicity/marketing?

I want to be able to estimate the cost of processing a single token, and I cannot wrap my head around this. I theoretically estimated based on GPU rent price and separately based on power consumption (assuming some utilization such as 10%), and I believe I somehow need to differentiate between input/output tokens here.

In one tweet from LeptonAI who hosts these LLM, I also saw that there are usually 3-10 times more input tokens than output tokens. Again, if input tokens dominate the sequence and it was FLOPS the issue, I would expect to reflect that in the pricing. Not sure what role it plays in these calculations so far.

Any help is appreciated, thanks!

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

输入令牌输出令牌成本差异 KV缓存内存带宽

相关文章

Show HN: 开源 LLM 补丁流 - 速度和输出令牌改进

2.5%KV缓存保持大模型90%性能，大模型金字塔式信息汇聚模式探秘｜开源

如何评价巴黎奥运村不装空调

风电专家：深远海展望及海外投资机会交流240723

NACL: A Robust KV Cache Eviction Framework for Efficient Long-Text Processing in LLMs

价格的事情传了很久，当然也烦了我四个月咯，所以近期板块走的弱：①说法不准确，要低都会低，包括SS自己，还要考虑各家成本差异；②依然重点关注单米净利润和实...

GPU推理时代终结？世界最大芯片加持推理狂飙20倍，英伟达H100也被干趴

电解液资金投入是真的一般，偶然的一次机会，跟朋友去某电解液厂观看了他们的生产线，我感觉他们的生产流程是真的很简单，说直白一点就是按比例勾兑，这就跟勾兑...

宜鼎发布 CXL 2.0 内存扩展模块，EDSFF E3.S 2T 加厚外形

$英特尔(INTC)$ 美国商务部长吉娜・雷蒙多牵头举办了一场闭门投资者会议，在会上，英特尔首席执行官帕特・基辛格“大倒苦水”，表示当前美国公司在芯片制造方面...