MarkTechPost@AI 2024年07月27日
FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

FLUTE是一种针对LLM推理中融合量化矩阵乘法而设计的CUDA内核,通过灵活的查找表映射量化值到反量化值,并支持各种位宽和组大小。FLUTE通过内核级基准测试和对LLaMA-3和Gemma-2等最先进的LLM的端到端评估来展示其性能。在A6000和A100 GPU上进行单并行和张量并行设置的测试表明,FLUTE在未量化、3位和4位配置中表现出效率。这种多功能性和性能使FLUTE成为使用高级量化技术加速LLM推理的有希望的解决方案。

📚 **FLUTE的设计目标是解决LLM部署中低位和非均匀量化带来的挑战,它通过三种关键策略来实现高效的量化矩阵乘法:** 1. **离线矩阵重构:** FLUTE通过将权重拆分为位切片并在寄存器中组合,处理非标准位宽(例如,3位)来重新排序量化权重以优化Tensor Core操作。 2. **共享内存中的矢量化查找:** 为了优化反量化,FLUTE使用存储在共享内存中的矢量化查找表,同时访问两个元素。它还采用表复制来减少银行冲突。 3. **Stream-K 工作负载分区:** FLUTE实现Stream-K分解,将工作负载均匀分布在SM上,从而减轻低位和低批次场景中的波浪量化问题。

📢 **FLUTE在各种量化设置下展示出优异的性能,它超越了标准方法,并且与AWQ很好地结合:** 1. **学习的NF量化方法优于标准方法,并且与AWQ很好地结合。** 2. **FLUTE的灵活性允许使用不同的位宽和组大小进行实验,几乎可以匹配16位基线困惑度,同时使用较小的组大小。** 3. **使用vLLM框架进行的端到端延迟测试表明,在各种配置中,包括Gemma-2模型,都取得了显著的加速。** 4. **发现64的组大小可以有效地平衡质量和速度。** 5. **总的来说,FLUTE被证明是一种通用的高效量化LLM部署解决方案,在多种场景中都提供了改进的性能。**

📕 **FLUTE的性能通过内核级基准测试和对最先进的LLM的端到端评估来展示:** 1. **在单并行和张量并行设置中,在A6000和A100 GPU上测试,FLUTE在未量化、3位和4位配置中表现出效率。** 2. **FLUTE在各种矩阵形状上都表现出令人印象深刻的性能,在A6000上偶尔接近理论上的最大加速4倍。** 3. **这种性能在不同的批次大小上也保持一致,这与其他与LUT兼容的内核不同,其他内核通常只有在批次大小为1时才能达到类似的加速,然后随着批次大小的增加而迅速下降。** 4. **此外,FLUTE的性能甚至可以与Marlin相媲美,Marlin是一个专门针对FP16输入和均匀量化INT4权重的内核。** 5. **这证明了FLUTE能够有效地处理均匀和非均匀量化方案。**

Large Language Models (LLMs) face deployment challenges due to latency issues caused by memory bandwidth constraints. Researchers use weight-only quantization to address this, compressing LLM parameters to lower precision. This approach improves latency and reduces GPU memory requirements. Implementing this effectively requires custom mixed-type matrix-multiply kernels that move, dequantize, and process weights efficiently. Existing kernels like bits and bytes, Marlin, and BitBLAS have shown significant speed-ups but are often limited to 4-bit quantization. Recent advancements in odd-bit and non-uniform quantization methods highlight the need for more flexible kernels that can support a wider range of settings to maximize the potential of weight quantization in LLM deployment.

Researchers have attempted to solve the LLM deployment challenges using weight-only quantization. Uniform quantization converts full-precision weights to lower-precision intervals, while non-uniform methods like lookup table (LUT) quantization offer more flexibility. Existing kernels like bits and bytes, Marlin, and BitBLAS move quantized weights from main memory to on-chip SRAM, performing matrix multiplications after de-quantizing to floating-point. These show significant speed-ups but often specialize in 4-bit uniform quantization, with LUT-quantization kernels underperforming. Non-uniform methods like SqueezeLLM and NormalFloat face trade-offs between lookup table size and quantization granularity. Also, non-uniformly quantized operations can’t utilize GPU accelerators optimized for floating-point calculations. This highlights the need for efficient kernels that can utilize quantized representations to minimize memory movement and GPU-native floating-point matrix multiplications, balancing the benefits of quantization with hardware optimization.

Researchers from Massachusetts Institute of Technology, High School of Mathematics Plovdiv and Carnegie Mellon University, MBZUAI, Petuum Inc. introduce an innovative approach that,  flexible lookup-table engine (FLUTE) for deploying weight-quantized LLMs, focusing on low-bit and non-uniform quantization. It addresses three main challenges: handling sub-8-bit matrices, optimizing lookup table-based dequantization, and improving workload distribution for small batches and low-bit-width weights. FLUTE overcomes these issues through three key strategies: offline weight restructuring, a shared-memory lookup table for efficient dequantization, and Stream-K partitioning for optimized workload distribution. This approach enables FLUTE to effectively manage the complexities of low-bit and non-uniform quantization in LLM deployment, improving efficiency and performance in scenarios where traditional methods fall short.

FLUTE is an innovative approach for, flexible mixed-type matrix multiplications in weight-quantized LLMs. It addresses key challenges in deploying low-bit and non-uniform quantized models through three main strategies:

    Offline Matrix Restructuring: FLUTE reorders quantized weights to optimize for Tensor Core operations, handling non-standard bit widths (e.g., 3-bit) by splitting weights into bit-slices and combining them in registers.Vectorized Lookup in Shared Memory: To optimize dequantization, FLUTE uses a vectorized lookup table stored in shared memory, accessing two elements simultaneously. It also employs table duplication to reduce bank conflicts.Stream-K Workload Partitioning: FLUTE implements Stream-K decomposition to evenly distribute workload across SMs, mitigating wave quantization issues in low-bit and low-batch scenarios.

These innovations allow FLUTE to efficiently fuse dequantization and matrix multiplication operations, optimizing memory usage and computational throughput. The kernel employs a sophisticated pipeline of data movement between global memory, shared memory, and registers, utilizing GPU hardware capabilities for maximum performance in weight-quantized LLM deployments.

FLUTE shows impressive performance across various matrix shapes on both A6000 and A100 GPUs. On the A6000, it occasionally approaches the theoretical maximum speedup of 4x. This performance is also consistent across different batch sizes, unlike other LUT-compatible kernels which typically achieve similar speedups only at a batch size of 1 and then degrade rapidly as batch size increases. Also, FLUTE’s performance compares well even to Marlin, a kernel highly specialized for FP16 input and uniform-quantized INT4 weights. This demonstrates FLUTE’s ability to efficiently handle both uniform and non-uniform quantization schemes.

FLUTE demonstrates superior performance in LLM deployment across various quantization settings. The learned NF quantization approach outperforms standard methods and combines well with AWQ. FLUTE’s flexibility allows for experiments with different bit widths and group sizes, nearly matching 16-bit baseline perplexity with small group sizes. End-to-end latency tests using vLLM framework showed meaningful speedups across various configurations, including with Gemma-2 models. A group size of 64 was found to balance quality and speed effectively. Overall, FLUTE proves to be a versatile and efficient solution for quantized LLM deployment, offering improved performance across multiple scenarios.

FLUTE is a CUDA kernel designed to accelerate LLM inference through fused quantized matrix multiplications. It offers flexibility in mapping quantized to de-quantized values via lookup tables and supports various bit widths and group sizes. FLUTE’s performance is demonstrated through kernel-level benchmarks and end-to-end evaluations on state-of-the-art LLMs like LLaMA-3 and Gemma-2. Tested on A6000 and A100 GPUs in single and tensor parallel setups, FLUTE shows efficiency across unquantized, 3-bit, and 4-bit configurations. This versatility and performance make FLUTE a promising solution for accelerating LLM inference using advanced quantization techniques.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Accelerate LLM Inference appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 量化 FLUTE CUDA 推理
相关文章