MarkTechPost@AI 2024年10月01日
Researchers from China Introduce INT-FlashAttention: INT8 Quantization Architecture Compatible with FlashAttention Improving the Inference Speed of FlashAttention on Ampere GPUs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

INT-FlashAttention是一种专为Ampere GPU设计的INT8量化架构,它与FlashAttention兼容,并显著提高了Ampere GPU上FlashAttention的推理速度。该架构利用INT8通用矩阵乘法(GEMM)内核,取代了自注意力模块中通常使用的浮点运算。与FP16或FP8相比,INT8运算需要更少的计算资源,从而显著提高了推理速度和能效。INT-FlashAttention还采用了一种令牌级后训练量化技术,即使在精度降低的情况下也能保持准确性。研究表明,与传统的FP16 FlashAttention相比,INT-FlashAttention的推理速度提高了72%。与基于FP8的FlashAttention相比,它可以将量化误差减少高达82%。

🤔 INT-FlashAttention是一种专为Ampere GPU设计的INT8量化架构,它与FlashAttention兼容,并显著提高了Ampere GPU上FlashAttention的推理速度。

🚀 该架构利用INT8通用矩阵乘法(GEMM)内核,取代了自注意力模块中通常使用的浮点运算,从而显著提高了推理速度和能效。

🧠 INT-FlashAttention还采用了一种令牌级后训练量化技术,即使在精度降低的情况下也能保持准确性,它可以处理完全INT8的输入,包括查询(Q)、键(K)和值(V)矩阵。

📊 研究表明,与传统的FP16 FlashAttention相比,INT-FlashAttention的推理速度提高了72%。与基于FP8的FlashAttention相比,它可以将量化误差减少高达82%。

Large Language Models (LLMs) evaluate and interpret links between words or tokens in a sequence primarily through the self-attention mechanism. However, this module’s time and memory complexity rises quadratically with sequence length, which is a disadvantage. Longer sequences demand exponentially more memory and processing, which makes scaling LLMs for applications involving longer contexts inefficient and challenging.

FlashAttention was developed as a way to overcome this restriction by accelerating attention computations and utilizing less memory. It does this by making use of the GPU memory hierarchy, which is the arrangement and accessibility of memory on a GPU. By dividing the computations into smaller, more manageable chunks that fit more effectively into the GPU memory, FlashAttention optimizes the attention process, resulting in faster performance and less memory overhead. This increases the scalability of the attention mechanism, especially for longer sequences.

Combining quantization methods with FlashAttention is an intriguing new research topic. Quantization uses less complex numerical forms, such as INT8 (8-bit integer), to minimize the precision of the data used in model simulations, enabling faster processing and less memory usage. This can result in even higher efficiency gains when combined with FlashAttention, particularly in the inference stage, which is when the model generates predictions based on previously learned data.

In recent research from China, INT-FlashAttention has been proposed, which is a significant innovation in this regard. As the first architecture created especially for Ampere GPUs, like NVIDIA’s A100 series, it completely integrates INT8 quantization with the forward process of FlashAttention. INT-FlashAttention uses much more efficient INT8 general matrix-multiplication (GEMM) kernels in place of the floating-point operations normally utilized in the self-attention module. Compared to floating-point formats like FP16 or FP8, INT8 operations demand substantially fewer processing resources, which significantly increases inference speed and energy savings.

INT-FlashAttention is unique in that it can process fully INT8 inputs, including the query (Q), key (K), and value (V) matrices that are essential to the attention mechanism for all calculations related to attention. To retain accuracy even with reduced precision, INT-FlashAttention preserves token-specific information by utilizing a token-level post-training quantization technique. Furthermore flexible, this token-level approach makes the framework compatible with various lower-precision formats, such as INT4 (4-bit integers), providing additional memory and computational savings without compromising performance.

The team has shared that upon evaluation when INT-FlashAttention is used instead of the typical FP16 (16-bit floating-point) implementation of FlashAttention, the inference speed is 72% faster. Compared to FP8-based FlashAttention, it can eliminate quantization error by up to 82%, which means that in addition to operating more quickly, it also maintains greater accuracy. These findings have shown that INT-FlashAttention can greatly increase the scalability and efficiency of LLMs on commonly used hardware, such as Ampere GPUs.

The team has summarized their primary contributions as follows.

    The research has presented INT-FlashAttention, a unique token-level post-training quantization architecture that enhances efficiency without compromising the core attention mechanism. It smoothly integrates into the forward computational workflow of FlashAttention.
    The team has implemented the INT8 version of the INT-FlashAttention prototype, which is a major advancement in attention computing and quantization techniques.
    Extensive tests have been conducted to validate the experimental results, which show that INT-FlashAttention achieves a much higher inference speed than baseline solutions. It also exhibits better quantization accuracy than previous solutions, meaning that in addition to being faster, it preserves a more accurate representation of the data than FP16 or FP8 FlashAttention implementations.

In conclusion, the release of INT-FlashAttention is a key step towards improving the efficiency and accessibility of high-performance LLMs for a wider range of applications, especially in data centers where older GPU architectures like Ampere are still widely used. By using quantization and FlashAttention together, INT-FlashAttention provides a potent way to improve large-scale language model inference speed and accuracy.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post Researchers from China Introduce INT-FlashAttention: INT8 Quantization Architecture Compatible with FlashAttention Improving the Inference Speed of FlashAttention on Ampere GPUs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

INT-FlashAttention FlashAttention 量化 推理速度 Ampere GPU
相关文章