MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

cs.AI updates on arXiv.org 07月30日 12:11

MemShare: Memory Efficient Inference for Large Reasoning Models through KV Cache Reuse

本文提出MemShare，一种基于协作过滤算法的KV缓存管理方法，有效降低大型推理模型（LRM）的内存开销，提高推理效率，同时保持推理准确性。

arXiv:2507.21433v1 Announce Type: cross Abstract: Large Reasoning Models (LRMs) have achieved significant advances in mathematical reasoning and formal logic tasks. However, their tendency to generate lengthy chain-of-thought sequences leads to substantial memory overhead during inference. We observe that LRMs frequently produce highly similar intermediate reasoning steps, which correspond to similar KV cache states across layers. Motivated by this observation, we propose MemShare, a novel KV cache management approach that effectively reduces memory overhead. MemShare employs a collaborative filtering algorithm to efficiently identify reusable KV cache blocks and enables zero copy cache reuse to significantly reduce memory overhead, improve throughput while maintaining accuracy. Experimental results demonstrate that MemShare delivers up to 84.79\% improvement in throughput while maintaining better accuracy compared to existing KV cache management methods.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型推理模型 KV缓存管理内存优化协作过滤推理效率

相关文章

FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

拆分Transformer注意力，韩国团队让大模型解码提速20倍

院士领衔推出大模型的第3种记忆：比参数存储和RAG都便宜，2.4B模型越级打13B

Python 潮流周刊#37：Python “令人失望”的动态类型超能力

只激活3.8B参数，性能比肩同款7B模型！训练微调都能用，来自微软

Q-Sparse: A New Artificial Intelligence AI Approach to Enable Full Sparsity of Activations in LLMs

Optimizing Memory for Large-Scale NLP Models: A Look at MINI-SEQUENCE TRANSFORMER

全新腾讯混元Turbo模型发布价格再低50%

腾讯混元Turbo。该模型采用MoE架构，比上一代产品推理效率提升100%，推理成本降低50%。对外，腾讯混元Turbo的价格也比混元Pro降低50%，输出价格为0.05元/千token...

腾讯发布新一代大模型“混元Turbo” 推理效率提升100%