MarkTechPost@AI 2024年08月02日
Salesforce AI Introduces ‘ThinK’: A New AI Method that Exploits Substantial Redundancy Across the Channel Dimension of the KV Cache
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce AI 研究人员和香港中文大学提出 Think,一种独特的 KV 缓存修剪方法,它将任务视为一个优化问题,以最小化修剪带来的注意力权重损失。Think 引入了一种查询相关的标准来评估通道重要性,并贪婪地选择关键通道。该方法基于对 LLaMA3-8B 模型可视化的关键观察结果:关键缓存通道显示出不同的重要性大小,而值缓存缺乏清晰的模式。注意力矩阵的奇异值分解表明,少数奇异值具有高能量,表明注意力机制的低秩性质。这些见解表明,关键缓存可以使用低维向量有效地近似。Think 利用这些发现来开发一种高效的修剪策略,针对关键缓存的通道维度,可能减少内存消耗,同时保持模型性能。

🤔 **Think 是一种新颖的 KV 缓存修剪方法,它将任务视为一个优化问题,以最小化修剪带来的注意力权重损失。** 该方法基于对 LLaMA3-8B 模型可视化的关键观察结果:关键缓存通道显示出不同的重要性大小,而值缓存缺乏清晰的模式。注意力矩阵的奇异值分解表明,少数奇异值具有高能量,表明注意力机制的低秩性质。这些见解表明,关键缓存可以使用低维向量有效地近似。Think 利用这些发现来开发一种高效的修剪策略,针对关键缓存的通道维度,可能减少内存消耗,同时保持模型性能。

💡 **Think 引入了一种查询相关的标准来评估通道重要性,并贪婪地选择关键通道。** 该方法利用查询和键向量之间的交互作用来评估通道重要性。它使用贪婪算法来选择最重要的通道,从而在注意力计算中保留主要的信息流。

🚀 **Think 在两个主要基准测试中展示了其有效性:LongBench 和 Needle-in-a-Haystack。** Think 成功地修剪了关键缓存通道,同时保持或略微提高 LLaMA3-8B 的性能。对于 Mistral-7B,它减少了内存,对性能的影响很小。查询相关的通道修剪 (Think) 优于基于 l1 和 l2 范数的修剪方法,特别是在 40% 的修剪率下。性能往往在较小的修剪率和较大的 KV 缓存大小下更好。在 Needle-in-a-Haystack 测试中,Think 在不同的 KV 缓存大小下,在 40% 的修剪率下,保持或提高了与 SnapKV 相比的准确性。

🎯 **Think 是一种有效的、与模型无关的方法,用于进一步优化 KV 缓存压缩,在性能略有折衷的情况下,提高内存效率。** Think 在优化大型语言模型以适应长上下文场景方面取得了令人鼓舞的进展。通过引入针对关键缓存的查询相关的通道修剪,这种创新方法实现了 40% 的缓存大小减少,同时保持或甚至提高了性能。Think 的压缩策略在各种任务和模型中表现出鲁棒性,为处理长上下文场景提供了一种经济高效的解决方案。

📊 **Think 通过使用观察窗口来减少计算成本,专注于长上下文场景。** Think 在 KV 缓存中维护两种类型的键:通道大小减小的修剪键和原始大小的未修剪键。一个二进制掩码跟踪修剪的通道。在解码期间,修剪的键被零填充,并与未修剪的键连接起来,或者在与相应的键相乘之前修剪查询。这种方法可以与 FlashAttention 等优化技术集成,可能在保持模型性能的同时提供更好的计算效率。

Large Language Models (LLMs) have revolutionized natural language processing, demonstrating exceptional performance across various tasks. The Scaling Law suggests that as model size increases, LLMs develop emergent abilities, enhancing their context understanding and long sequence handling capabilities. This growth enables LLMs to generate coherent responses and power applications like document summarization, code generation, and conversational AI. However, LLMs face significant challenges in terms of cost and efficiency. The expenses associated with LLM generation escalate with increasing model size and sequence length, affecting both the training and inference stages. Additionally, managing long sequences presents computational burdens due to the quadratic complexity of the transformer attention mechanism, which scales poorly with sequence length. These challenges necessitate the development of efficient LLM architectures and strategies to reduce memory consumption, particularly in long-context scenarios.

Existing researchers have pursued various approaches to address the computational challenges posed by LLMs, particularly in long-context scenarios. KV cache eviction methods like StreamingLLM, H2O, SnapKV, and FastGen aim to reduce memory usage by selectively retaining or discarding tokens based on their importance. PyramidKV and PyramidInfer propose adjusting KV cache sizes across different layers. KV cache quantization techniques, such as SmoothQuant and Q-Hitter, compress the cache while minimizing performance loss. Some studies suggest different quantization strategies for key and value caches. Structured pruning of LLMs has also been explored, focusing on removing unimportant layers, heads, and hidden dimensions. However, these methods often result in significant performance degradation or fail to exploit potential optimizations fully. 

Researchers from Salesforce AI Research and The Chinese University of Hong Kong propose ThinK, a unique KV cache pruning method that approaches the task as an optimization problem to minimize attention weight loss from pruning. It introduces a query-dependent criterion for assessing channel importance and selects critical channels greedily. The method is founded on key observations from LLaMA3-8B model visualizations: key cache channels show varying magnitudes of significance, while value cache lacks clear patterns. The singular value decomposition of attention matrices reveals that few singular values carry high energy, indicating the attention mechanism’s low-rank nature. These insights suggest that key cache can be effectively approximated using low-dimensional vectors. ThinK utilizes these findings to develop an efficient pruning strategy targeting the key cache’s channel dimension, potentially reducing memory consumption while preserving model performance.

ThinK is an innovative method for optimizing the KV cache in LLMs by pruning the channel dimension of the key cache. The approach formulates the pruning task as an optimization problem, aiming to minimize the difference between original and pruned attention weights. ThinK introduces a query-driven pruning criterion that evaluates channel importance based on the interaction between the query and key vectors. This method uses a greedy algorithm to select the most important channels, preserving the primary information flow in the attention computation.

The implementation focuses on long-context scenarios and employs an observation window to reduce computational costs. ThinK maintains two categories of keys in the KV cache: pruned keys with reduced channel size and unpruned keys at original size. A binary mask tracks pruned channels. During decoding, pruned keys are zero-filled and concatenated with unpruned keys, or the query is pruned before multiplication with the corresponding keys. This approach can be integrated with optimization techniques like FlashAttention, potentially offering improved computational efficiency while maintaining model performance.

The experimental results demonstrate the effectiveness of ThinK, a unique key cache pruning method, across two major benchmarks: LongBench and Needle-in-a-Haystack. Key findings include:

    ThinK successfully prunes key cache channels after applying existing compression methods (H2O and SnapKV), reducing memory usage while maintaining or slightly improving performance on LLaMA3-8B. For Mistral-7B, it reduces memory with minimal performance impact.
    Query-based channel pruning (ThinK) outperforms l1 and l2 norm-based pruning methods, especially at a 40% pruning ratio.Performance tends to be better with smaller pruning ratios and larger KV cache sizes. With a KV cache size of 2048 and 40% pruning, ThinK can even outperform full KV cache models in some cases.On the Needle-in-a-Haystack test, ThinK maintains or improves accuracy compared to SnapKV at a 40% pruning ratio across different KV cache sizes. Higher pruning ratios (≥50%) show some accuracy drops, particularly with smaller cache sizes.
    Visualizations of the Needle-in-a-Haystack results demonstrate ThinK’s robustness in maintaining retrieval capabilities across various token lengths and depths.

These results suggest that ThinK is an effective, model-agnostic method for further optimizing KV cache compression, offering improved memory efficiency with minimal performance trade-offs.

ThinK emerges as a promising advancement in optimizing Large Language Models for long-context scenarios. By introducing query-dependent channel pruning for the key cache, this innovative method achieves a 40% reduction in cache size while maintaining or even improving performance. ThinK’s compatibility with existing optimization techniques and its robust performance across various benchmarks, including LongBench and Needle-in-a-Haystack tests, underscore its effectiveness and versatility. As the field of natural language processing continues to evolve, ThinK’s approach to balancing efficiency and performance addresses critical challenges in managing computational resources for LLMs. This method not only enhances the capabilities of current models but also paves the way for more efficient and powerful AI systems in the future, potentially revolutionizing how we approach long-context processing in language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Salesforce AI Introduces ‘ThinK’: A New AI Method that Exploits Substantial Redundancy Across the Channel Dimension of the KV Cache appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Think Salesforce AI KV 缓存 通道修剪 大型语言模型
相关文章