MarkTechPost@AI 2024年12月07日
CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新的研究提出了一种CPU-GPU I/O感知的LLM推理方法,通过优化CPU-GPU交互来降低延迟。该方法利用部分KV缓存重新计算和异步重叠,解决了加载大型KV缓存的系统瓶颈。实验证明,该方法在面向延迟的工作负载中将延迟降低了35.8%,在面向吞吐量的工作负载中实现了高达29%的改进。

🚀LLMs推动了当前的研发的重大进步,但它们的高成本使得大规模应用难以实现。降低操作延迟,特别是在需要响应的动态应用中,是一个重大挑战。

🔍KV缓存用于LLMs中的自回归解码,它存储了推理预填充阶段多头注意力机制中的键值对。在解码阶段,新的键值对被添加到内存中。KV缓存的内存大小随着批次大小、序列长度和模型大小线性增长,超过了GPU的处理能力,将其传输到CPU会引入瓶颈,增加延迟并降低吞吐量。

📑PCIe接口成为一个限制因素,特别是当将缓存从CPU传输到GPU进行计算时。缓慢的PCIe接口可能导致延迟超过正常水平一个数量级,导致大量的GPU空闲时间。

💡南加州大学的研究人员提出了一种高效的CPU-GPU I/O感知LLM推理方法,以优化PCIe利用率。该方法利用部分KV缓存重新计算和异步重叠,解决了加载大型KV缓存的系统瓶颈。

💻该方法包括三个模块来最小化GPU延迟:分析器模块收集系统硬件信息,调度器模块制定线性规划任务以确定最佳KV分割点,运行时模块协调两个设备之间的数据传输和内存分配。

LLMs are driving major advances in research and development today. A significant shift has been observed in research objectives and methodologies toward an LLM-centric approach. However, they are associated with high expenses, making LLMs for large-scale utilization inaccessible to many. It is, therefore, a significant challenge to reduce the latency of operations, especially in dynamic applications that demand responsiveness.

KV cache is used for autoregressive decoding in LLMs. It stores key-value pairs in multi-headed attention during the pre-filling phase of inference. During the decoding stage, new KV pairs get appended to the memory. KV cache stores the intermediate key and value activations in the attention mechanism, thus reducing complexity from quadratic to linear order. KV cache allows for improved efficiency but grows linearly with batch size, sequence length, and model size. The growing memory size of the KV cache exceeds the handling capacity of GPUs, and transferring it to the CPU introduces several bottlenecks, increasing latency while reducing throughput.

PCIe interfaces become a limiting factor, especially when transferring the cache from the CPU to the GPU for computation. Slow PCIe interfaces can result in latency exceeding normal levels by an order of magnitude, leading to substantial GPU idle time.

Previous work has attempted to mitigate the issue of slow PCIe performance. Still, these approaches often fail due to mismatched data transfer and GPU computation times, particularly with large batch and context sizes. Others depended on CPU resources, which again became a limiting factor. This article discusses a novel approach to PCIe and GPU optimization.

University of Southern California researchers propose an efficient CPU-GPU I/O-aware LLM inference method for optimized PCIe utilization. It leverages partial KV cache recomputation and asynchronous overlapping to address the system bottleneck of loading large KV caches. Their process involves transferring smaller activation segments of the cache to the GPU rather than transferring the entire KV cache. The GPU then reconstructs the whole cache memory from these smaller activation bits. The key lies in computing attention scores that ensure minimal information loss.

The authors propose a fully automated method for determining recomputation and communication splits. This work consists of three modules to minimize GPU latency:

    Profiler Module: Collects system hardware information, such as PCIe bandwidth and GPU processing speed.Scheduler Module: Formulates the problem as a linear programming task to determine the optimal KV split point using hardware information and user configuration. The objective is to maximize the overlap between computation and communication processes.Runtime Module: Coordinates data transfer between the two devices and manages memory allocations.

The Scheduler Module, which is responsible for finding the optimal KV split, works in two ways:

Row-by-Row Schedule: Reduces latency with a row-by-row execution plan. Here, the GPU begins reconstructing the KV cache while the remaining activations are asynchronously loading. Column-by-Column Schedule: Maximizes throughput and accommodates significant batch size inference by reusing model weights across batches. It overlaps the transmission of KV cache and activations with the computation of MHA (multi-headed attention) across multiple batches instead of processing each layer sequentially in a batch.Further using a six-process communication parallelism strategy, the Runtime Module enables concurrent GPU computation and CPU-GPU communication.

The authors tested the proposed framework for efficient LLM inference using an NVIDIA A100 GPU connected to a CPU via a PCIe 4.0 x16 interface. Experiments were conducted with two objectives to assess the framework’s performance:

Conclusion:

The CPU-GPU I/O-aware LLM inference method efficiently reduces latency while increasing throughput in LLM inference. It leverages partial KV cache recomputation and overlaps it with data transmission to minimize idle GPU time and enhance efficiency.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Partner with us]: ‘Next Magazine/Report- Open Source AI in Production’

The post CPU-GPU I/O-Aware LLM Inference Reduces Latency in GPUs by Optimizing CPU-GPU Interactions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM KV缓存 PCIe CPU-GPU 延迟优化
相关文章