MarkTechPost@AI 2024年07月24日
Apple Researchers Propose LazyLLM: A Novel AI Technique for Efficient LLM Inference in Particular under Long Context Scenarios
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LazyLLM 是一种新颖的 AI 技术,旨在通过选择性地计算重要词元的 KV 缓存并延迟计算不太相关的词元来加速大型语言模型 (LLM) 的推理,尤其是在长上下文场景下。通过利用前几层的注意力分数来评估词元的重要性,LazyLLM 在不影响精度的同时显著提高了推理速度,并且在各种语言任务中表现出色,包括问答、摘要和代码补全。

🤔 LazyLLM 是一种新颖的 AI 技术,旨在通过选择性地计算重要词元的 KV 缓存并延迟计算不太相关的词元来加速大型语言模型 (LLM) 的推理,尤其是在长上下文场景下。 在处理长提示时,LLM 的推理速度会受到影响,因为它们需要计算提示中所有词元的注意力。LazyLLM 通过利用前几层的注意力分数来评估词元的重要性,并选择性地计算重要词元的 KV 缓存,从而显著提高了推理速度。 LazyLLM 的主要优势包括: * **通用性:** LazyLLM 可以与任何基于 Transformer 的 LLM 兼容。 * **无需训练:** LazyLLM 可以在无需重新训练模型的情况下实现。 * **有效性:** LazyLLM 在各种语言任务中都表现出色,包括问答、摘要和代码补全。

🚀 LazyLLM 通过利用前几层的注意力分数来评估词元的重要性,并选择性地计算重要词元的 KV 缓存,从而显著提高了推理速度。 LazyLLM 的核心思想是,在模型的早期层中,保留所有词元,并在模型的后期层中逐渐减少要计算的词元数量。这种方法可以有效地减少计算量,而不会显著影响模型的性能。 LazyLLM 还引入了一种辅助缓存机制,用于存储被修剪的词元的隐藏状态。当模型需要访问被修剪的词元时,它可以从辅助缓存中检索这些词元的隐藏状态,从而避免重新计算。

🏆 LazyLLM 在各种语言任务中都表现出色,包括问答、摘要和代码补全。它在保持性能接近基线水平的同时,实现了显著的 TTFT 加速(对于 Llama 2 高达 2.89 倍,对于 XGen 高达 4.77 倍)。 LazyLLM 的结果表明,它在速度-精度权衡方面优于其他方法,例如随机词元丢弃、静态修剪和提示压缩。LazyLLM 通常计算少于 100% 的提示词元,从而导致总体计算量减少,并提高了生成速度。 LazyLLM 的进步性修剪策略,通过分层分析得到启发,有助于其优异的性能。这些结果突出了 LazyLLM 在不影响精度的同时优化 LLM 推理的能力。

💡 LazyLLM 是一种很有前途的技术,它可以显著提高 LLM 的推理效率,同时保持其性能。LazyLLM 的易用性和有效性使其成为各种 LLM 应用的宝贵工具。

Large Language Models (LLMs) have made a significant leap in recent years, but their inference process faces challenges, particularly in the prefilling stage. The primary issue lies in the time-to-first-token (TTFT), which can be slow for long prompts due to the deep and wide architecture of state-of-the-art transformer-based LLMs. This slowdown occurs because the cost of computing attention increases quadratically with the number of tokens in the prompts. For example, Llama 2 with 7 billion parameters requires 21 times more time for TTFT compared to each subsequent decoding step, accounting for approximately 23% of the total generation time on the LongBench benchmark. Optimizing TTFT has become a critical path toward efficient LLM inference.

Prior studies have explored various approaches to address the challenges of efficient long-context inference and TTFT optimization in LLMs. Some methods focus on modifying transformer architectures, such as replacing standard self-attention with local windowed attention or using locality-sensitive hashing. However, these require significant model changes and retraining. Other techniques optimize the KV cache to accelerate decoding steps but don’t address TTFT. Token pruning approaches, which selectively remove less important tokens during inference, have shown promise in sentence classification tasks. Examples include Learned Token Pruning and width-wise computation reduction. However, these methods were designed for single-iteration processing tasks and need adaptation for generative LLMs. Each approach has limitations, prompting the need for more versatile solutions that can improve TTFT without extensive model modifications.

Researchers from Apple and Meta AI propose LazyLLM, a unique technique to accelerate LLM prefilling by selectively computing the KV cache for important tokens and deferring less crucial ones. It uses attention scores from previous layers to assess token importance and prune progressively. Unlike permanent prompt compression, LazyLLM can revive pruned tokens to maintain accuracy. An Aux Cache mechanism stores pruned tokens’ hidden states, ensuring efficient revival and preventing performance degradation. LazyLLM offers three key advantages: universality (compatible with any transformer-based LLM), training-free implementation, and effectiveness across various language tasks. This method improves inference speed in both prefilling and decoding stages without requiring model modifications or fine-tuning.

The LazyLLM framework is designed to optimize LLM inference through progressive token pruning. The method starts with the full context and gradually reduces computations towards the end of the model by pruning less important tokens. Unlike static pruning, LazyLLM allows the dynamic selection of token subsets in different generation steps, crucial for maintaining performance.

This framework employs layer-wise token pruning in each generation step, using attention maps to determine token importance. It calculates a confidence score for each token and prunes those below a certain percentile. This approach is applied progressively, keeping more tokens in earlier layers and reducing them towards the end of the transformer.

To overcome the challenges in extending pruning to decoding steps, LazyLLM introduces an Aux Cache mechanism. This cache stores hidden states of pruned tokens, allowing efficient retrieval without recomputation. During decoding, the model first accesses the KV cache for existing tokens and retrieves hidden states from the Aux Cache for pruned tokens. Also, this implementation ensures each token is computed at most once per transformer layer, guaranteeing that LazyLLM’s worst-case runtime is not slower than the baseline. The method’s dynamic nature and efficient caching mechanism contribute to its effectiveness in optimizing both the prefilling and decoding stages of LLM inference.

LazyLLM demonstrates significant improvements in LLM inference efficiency across various language tasks. It achieves substantial TTFT speedups (up to 2.89x for Llama 2 and 4.77x for XGen) while maintaining accuracy close to baseline levels. The method outperforms other approaches like random token drop, static pruning, and prompt compression in speed-accuracy trade-offs. LazyLLM’s effectiveness spans multiple tasks, including QA, summarization, and code completion. It often computes less than 100% of prompt tokens, leading to reduced overall computation and improved generation speeds. The progressive pruning strategy, informed by layer-wise analysis, contributes to its superior performance. These results highlight LazyLLM’s capacity to optimize LLM inference without compromising accuracy.

LazyLLM, an innovative technique for efficient LLM inference, particularly in long context scenarios, selectively computes KV for important tokens and defers computation of less relevant ones. Extensive evaluation across various tasks demonstrates that LazyLLM significantly reduces TTFT while maintaining performance. A key advantage is its seamless integration with existing transformer-based LLMs, improving inference speed without fine-tuning. By dynamically prioritizing token computation based on relevance, LazyLLM offers a practical solution to enhance LLM efficiency, addressing the growing demand for faster and more resource-efficient language models in diverse applications.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Apple Researchers Propose LazyLLM: A Novel AI Technique for Efficient LLM Inference in Particular under Long Context Scenarios appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LazyLLM 大型语言模型 推理 AI 技术 长上下文
相关文章