少点错误 2024年08月08日
What the cost difference in processing input vs. output tokens with LLMs?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨输入与输出令牌成本差异,提到输入令牌处理规模呈二次方,输出令牌因使用KV缓存呈线性规模,实际中输出令牌的问题在于内存容量和带宽,作者对输入令牌成本低于输出令牌感到困惑,且希望能估算单个令牌的处理成本。

🎯输入令牌的处理规模呈二次方,需计算每个令牌与其他令牌的注意力(通过编码器传递K和V),这导致处理过程较为复杂。

💾输出令牌因使用KV缓存而呈线性规模,通过用内存换取计算,需巧妙地存储和计算相关内容,如稀疏性、缓存驱逐等,但实际中输出令牌的问题在于内存容量和带宽。

🤔作者对输入令牌成本普遍比输出令牌低2到5倍感到困惑,不确定提供商按内存问题线性定价的证据,若按FLOPS受限,应按二次方定价,也怀疑是否只是为了客户定价简单或营销。

📈作者从LeptonAI的一条推文中了解到通常输入令牌比输出令牌多3 - 10倍,若输入令牌在序列中占主导且FLOPS是问题,定价应有所体现,但目前不确定其在这些计算中所起的作用。

Published on August 8, 2024 10:43 AM GMT

Hi,

I am trying to understand the difference in the cost of producing a single input token vs output token.

Based on some articles, I came to the following conclusion:

    Input tokens processing scales quadratically, there’s no way around it, you have to compute attention (K and V by passing it through encoders) for each token with each other token.Output tokens scale linearly thanks to using KV cache (otherwise quadratic without KV cache, as linear tokens do) which is what everyone seems to do when hosting these models. I.e. you trade compute for memory, and try to be clever about how you store and compute all these stuff (like sparse, cache eviction…). I believe this is simply an empirically practical way around the quadratic scaling.Therefore, realistically, it’s really the memory capacity and bandwidth being the problem for output tokens rather than raw FLOPS. KV cache grows linearly with the sequence length regardless input/output tokens ratio.I am confused here why input tokens are almost universally cheaper - 2 to 5 times - to output tokens. Also, is the evidence that providers are pricing things linearly for memory being the problem here (as they do now, you pay the same price for 11th token as for 10001)? If not and they were FLOPS bounded, I would expect the providers to price stuff quadratically, not linearly. Or is it just for client-pricing simplicity/marketing?

I want to be able to estimate the cost of processing a single token, and I cannot wrap my head around this. I theoretically estimated based on GPU rent price and separately based on power consumption (assuming some utilization such as 10%), and I believe I somehow need to differentiate between input/output tokens here.

In one tweet from LeptonAI who hosts these LLM, I also saw that there are usually 3-10 times more input tokens than output tokens. Again, if input tokens dominate the sequence and it was FLOPS the issue, I would expect to reflect that in the pricing. Not sure what role it plays in these calculations so far.

Any help is appreciated, thanks!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

输入令牌 输出令牌 成本差异 KV缓存 内存带宽
相关文章