MarkTechPost@AI 前天 16:20
NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA与爱丁堡大学的研究人员提出了一种名为动态内存稀疏化(DMS)的新方法,旨在优化Transformer模型在推理时的效率。DMS通过压缩KV缓存,能够在不降低模型准确性的前提下,实现8倍的KV缓存压缩,从而加速LLM的推理速度,尤其是在处理需要大量推理的任务时。这项技术通过最少的训练步骤和延迟淘汰机制,在推理时间和内存占用之间取得了良好的平衡,为资源受限环境下的LLM应用提供了新的可能性。

🧠 Transformer模型中的KV缓存是推理瓶颈:基于Transformer的模型(如GPT、LLaMA)使用KV缓存存储过去token的表征,用于自回归生成。该缓存随序列长度和并行线程线性增长,消耗大量GPU内存,导致推理速度变慢。

💡 动态内存稀疏化(DMS)的创新之处:DMS采用混合方法,类似于传统的剪枝方法,但通过最少的训练开销(约1000步)和延迟淘汰机制来压缩KV缓存。这种设计保留了重要的上下文信息,避免了准确性的突然下降。

🚀 DMS的优势:DMS仅需少量训练即可实现8倍的KV缓存压缩,且在推理任务中保持甚至提高了模型性能。它在推理效率和峰值内存使用方面均优于现有技术,同时DMS易于集成,无需改变模型架构,适用于现有模型的改造。

As the demand for reasoning-heavy tasks grows, large language models (LLMs) are increasingly expected to generate longer sequences or parallel chains of reasoning. However, inference-time performance is severely limited by the memory footprint of the key–value (KV) cache, not just the number of tokens produced. In a recent paper, researchers from NVIDIA and the University of Edinburgh introduce Dynamic Memory Sparsification (DMS)—a data-efficient, retrofit-friendly method that compresses KV caches and unlocks inference-time hyper-scaling without degrading model accuracy.

The Bottleneck: KV Cache in Transformer Inference

Transformer-based models like GPT, LLaMA, and Qwen use KV caches to store past token representations for autoregressive generation. This cache grows linearly with sequence length and width (parallel threads), consuming large amounts of GPU memory and leading to slower inference due to frequent memory access.

Existing techniques for KV cache optimization either rely on training-free heuristics—such as attention weight-based token eviction—or require heavy post-training retrofits like Dynamic Memory Compression (DMC). Both have significant downsides: the former tends to hurt accuracy, while the latter is computationally expensive.

Dynamic Memory Sparsification DMS: Compression Without Compromise

Dynamic Memory Sparsification DMS addresses these limitations with a hybrid approach: it sparsifies the KV cache like traditional pruning methods but does so with a minimal training overhead (~1,000 steps) and delayed eviction, which retains tokens temporarily after they’re marked for removal. This design preserves important context information and avoids abrupt accuracy drops.

The core idea is to make eviction decisions differentiable during training using a Gumbel-sigmoid-based sampling mechanism. Tokens predicted for future eviction remain usable for a sliding window duration before being discarded, allowing the model to absorb their informational value more effectively.

Efficient Retrofitting with Minimal Data

Unlike DMC, which requires thousands of training steps and complex gradient-based optimization, DMS introduces no additional parameters per attention head. It reuses a small part of the attention mechanism (a single neuron) to predict eviction. This makes DMS ideal for retrofitting existing models without architectural changes.

Empirical results show that with as few as 1K training steps, DMS can achieve 8× KV cache compression, preserving or even improving model performance across reasoning tasks.

Benchmark Results: Scaling Performance Without Scaling Cost

The research team tested DMS on reasoning-heavy benchmarks like:

Across model sizes—Qwen-R1 1.5B, 7B, and 32B—DMS improved exact-match performance by 9.1 points on AIME, 7.6 on GPQA, and 9.6 on LiveCodeBench, all under the same memory and compute budgets.

When compared to top-performing baselines like Quest and TOVA, DMS consistently outperformed them in both KV cache read efficiency (runtime proxy) and peak memory usage, achieving better Pareto frontiers.

General-Purpose Utility

DMS also holds up in non-reasoning tasks. On short-context benchmarks like MMLU, GSM8K, and HellaSwag, DMS-maintained performance at compression ratios up to with minimal degradation (~3.5 points). On long-context tasks like Needle-in-a-Haystack and Variable Tracking, DMS even surpassed the vanilla models, suggesting its potential to mitigate issues like information over-squashing in long sequences.

Conclusion

In conclusion, Dynamic Memory Sparsification (DMS) presents a practical and scalable solution for enhancing the inference-time efficiency of Transformer-based language models. By intelligently compressing the KV cache with minimal retraining, DMS enables models to reason over longer sequences or in parallel without increasing runtime or memory demands. Its consistent gains across a range of reasoning and general-purpose tasks highlight its versatility and effectiveness. As LLMs are increasingly deployed in resource-constrained environments, DMS offers a compelling path forward—balancing compression, accuracy, and ease of integration for real-world inference workloads.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 99k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA Researchers Introduce Dynamic Memory Sparsification (DMS) for 8× KV Cache Compression in Transformer LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DMS LLM KV缓存压缩 Transformer
相关文章