MarkTechPost@AI 2024年07月20日
Efficient Quantization-Aware Training (EfficientQAT): A Novel Machine Learning Quantization Technique for Compressing LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

为了解决大型语言模型(LLM)在自然语言处理和人工智能中对大量内存的需求,研究人员提出了高效量化感知训练(EfficientQAT)。EfficientQAT 是一种两阶段框架,通过在每个 Transformer 块中对所有参数进行量化感知训练,并使用块级重建来保持效率,从而有效地减少了量化感知训练的资源需求,同时保持了高性能。

🤔 EfficientQAT 框架通过其两个主要阶段来运作。在 Block-AP 阶段,对每个 Transformer 块中的所有参数进行量化感知训练,利用块级重建来保持效率。这种方法避免了对整个模型进行训练的需要,从而节省了内存资源。随后,E2E-QP 阶段固定量化权重,仅训练量化参数(步长),从而提高模型的效率和性能,而不会产生与训练整个模型相关的开销。这种双阶段策略提高了收敛速度,并允许对量化模型进行有效的指令微调。

🚀 EfficientQAT 的 Block-AP 阶段从标准的均匀量化方法开始,以块级方式对权重进行量化,然后解量化。受 BRECQ 和 OmniQuant 的启发,这种方法允许在训练过程中使用更少的数据和内存,相比于传统的端到端 QAT 方法。通过训练所有参数,包括缩放因子和零点,Block-AP 确保了精确的校准,并避免了同时训练整个模型通常会导致的过拟合问题。

💡 在 E2E-QP 阶段,仅对量化参数进行端到端训练,同时保持量化权重不变。此阶段利用 Block-AP 提供的稳健初始化,允许对量化模型进行高效、准确的微调,以适应特定任务。E2E-QP 使得对量化模型进行指令微调成为可能,确保内存效率,因为可训练参数只占整个网络的一小部分。

💪 EfficientQAT 在性能方面明显优于以前的量化方法。例如,它可以在单个 A100-80GB GPU 上在 41 小时内实现 Llama-2-70B 模型的 2 位量化,与全精度模型相比,精度下降不到 3%。此外,它在低位场景下优于现有的 Q-PEFT 方法,提供了一种更具硬件效率的解决方案。

As LLMs become increasingly integral to various AI tasks, their massive parameter sizes lead to high memory requirements and bandwidth consumption. While quantization-aware training (QAT) offers a potential solution by allowing models to operate with lower-bit representations, existing methods often require extensive training resources, making them impractical for large models. The research paper addresses the challenge of managing the significant memory requirements of large language models (LLMs) in natural language processing and artificial intelligence. 

Current quantization techniques for LLMs include post-training quantization (PTQ) and quantized parameter-efficient fine-tuning (Q-PEFT). PTQ minimizes memory usage during inference by converting pre-trained model weights to low-bit formats, but it can compromise accuracy, especially in low-bit regimes. Q-PEFT methods, like QLoRA, allow for fine-tuning on consumer-grade GPUs but require reverting to higher-bit formats for additional tuning, necessitating another round of PTQ, which can degrade performance. 

The researchers propose Efficient Quantization-Aware Training (EfficientQAT) to address these limitations. The EfficientQAT framework operates through its two main phases. In the Block-AP phase, quantization-aware training is performed on all parameters within each transformer block, utilizing block-wise reconstruction to maintain efficiency. This approach circumvents the need for full model training, thus preserving memory resources. Following this, the E2E-QP phase fixes the quantized weights and trains only the quantization parameters (step sizes), which enhances the model’s efficiency and performance without the overhead associated with training the entire model. This dual-phase strategy improves convergence speed and allows for effective instruction tuning of quantized models.

The Block-AP phase of EfficientQAT begins with a standard uniform quantization method, quantizing and then dequantizing weights in a block-wise manner. Inspired by BRECQ and OmniQuant, this method allows for efficient training with less data and memory compared to traditional end-to-end QAT approaches. By training all parameters, including scaling factors and zero points, Block-AP ensures precise calibration and avoids the overfitting issues typically associated with training the entire model simultaneously.

In the E2E-QP phase, only the quantization parameters are trained end-to-end while keeping the quantized weights fixed. This phase leverages the robust initialization provided by Block-AP, allowing for efficient and accurate tuning of the quantized model for specific tasks. E2E-QP enables instruction tuning of quantized models, ensuring memory efficiency as the trainable parameters constitute only a small fraction of the total network.

EfficientQAT demonstrates significant improvements over previous quantization methods. For instance, it achieves a 2-bit quantization of a Llama-2-70B model on a single A100-80GB GPU in just 41 hours, with less than 3% accuracy degradation compared to the full-precision model. Additionally, it outperforms existing Q-PEFT methods in low-bit scenarios, providing a more hardware-efficient solution.

The EfficientQAT framework presents a compelling solution to the challenges posed by large language models in terms of memory and computational efficiency. By introducing a two-phase training approach focusing on block-wise training and end-to-end quantization parameter optimization, the researchers effectively reduce the resource demands of quantization-aware training while maintaining high performance. This method represents a significant advancement in the field of model quantization, providing a practical pathway for deploying large language models in resource-constrained environments.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post Efficient Quantization-Aware Training (EfficientQAT): A Novel Machine Learning Quantization Technique for Compressing LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

EfficientQAT LLM 量化感知训练 模型压缩
相关文章