MarkTechPost@AI 01月30日
Optimization Using FP4 Quantization For Ultra-Low Precision Language Model Training
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了中国科技大学、微软SIGMA团队和微软亚洲研究院提出的FP4格式训练大语言模型框架。该框架通过可微分量化估计器和激活异常值处理机制,解决了超低精度训练中的量化误差问题,使得FP4训练的LLaMA模型在性能上可与BF16模型媲美。实验结果表明,FP4训练在多种规模模型上表现出与BF16相似的训练曲线,并在下游任务中取得有竞争力的性能。然而,当前硬件缺乏专用FP4张量核,限制了该框架效率的直接测量,未来需要硬件进步以充分发挥FP4计算的潜力。

🚀 研究团队提出了首个FP4预训练框架,用于训练大语言模型,实现了超低精度计算的重大突破。

⚙️ 该框架通过引入可微分量化估计器来增强FP4计算中的梯度更新,并通过结合钳位和稀疏辅助矩阵来处理激活中的异常值,有效解决了量化误差问题。

📊 实验结果表明,在1.3B、7B和13B参数的LLaMA模型上,FP4训练与BF16训练具有相似的训练曲线,且在多个下游任务中,FP4训练的模型性能与BF16模型相当甚至更优。

🚧 当前框架的主要限制在于缺乏专用的FP4张量核心硬件支持,测试依赖于模拟,这引入了计算开销,并限制了直接测量潜在的效率提升。

Large Language Models (LLMs) have emerged as transformative tools in research and industry, with their performance directly correlating to model size. However, training these massive models presents significant challenges, related to computational resources, time, and cost. The training process for state-of-the-art models like Llama 3 405B requires extensive hardware infrastructure, utilizing up to 16,000 H100 GPUs over 54 days. Similarly, models like GPT-4, estimated to have one trillion parameters, demand extraordinary computational power. These resource requirements create barriers to entry and development in the field, highlighting the critical need for more efficient training methodologies for advancing LLM technology while reducing the associated computational burden.

Various approaches have been explored to address the computational challenges in LLM training and inference. Mixed Precision Training has been widely adopted to accelerate model training while maintaining accuracy, initially focusing on CNNs and DNNs before expanding to LLMs. For inference optimization, Post-Training Quantization (PTQ) and Quantization Aware Training (QAT) have achieved significant compression using 4-bit, 2-bit, and even 1-bit quantization. While differentiable quantization techniques have been proposed using learnable parameters updated through backpropagation, they face limitations in handling activation outliers effectively. Existing solutions for managing outliers depend on offline pre-processing methods, making them impractical for direct application in training scenarios.

Researchers from the University of Science and Technology of China, Microsoft SIGMA Team, and Microsoft Research Asia have proposed a framework for training language models using the FP4 format, marking the first comprehensive validation of this ultra-low precision representation. The framework addresses quantization errors through two key innovations: 

These techniques help to maintain model performance while enabling efficient training in ultra-low precision formats, representing a significant advancement in efficient LLM training.

The framework primarily targets General Matrix Multiplication (GeMM) operations, containing over 95% of LLM training computations. The architecture implements 4-bit quantization for GeMM operations using distinct quantization approaches: token-wise quantization for activation tensors and channel-wise quantization for weight tensors. Due to hardware limitations, the system’s performance is validated using Nvidia H-series GPUs’ FP8 Tensor Cores, which can accurately simulate FP4’s dynamic range. The framework employs FP8 gradient communication and a mixed-precision Adam optimizer for memory efficiency. The system was validated using the LLaMA 2 architecture, trained from scratch on the DCLM dataset, with carefully tuned hyperparameters including a warm-up and cosine decay learning rate schedule, and specific parameters for the FP4 method’s unique components.

The proposed FP4 training framework shows that training curves for LLaMA models of 1.3B, 7B, and 13B parameters have similar patterns between FP4 and BF16 implementations, with FP4 showing marginally higher training losses: 2.55 vs. 2.49 (1.3B), 2.17 vs. 2.07 (7B), and 1.97 vs. 1.88 (13B) after 100B tokens of training. Zero-shot evaluations across diverse downstream tasks, including Arc, BoolQ, HellaSwag, LogiQA, PiQA, SciQ, OpenbookQA, and Lambada, reveal that FP4-trained models achieve competitive or occasionally superior performance compared to their BF16 counterparts. The results demonstrate that larger models achieve higher accuracy, validating the scalability of the FP4 training approach.

In conclusion, researchers have successfully developed and validated the first FP4 pretraining framework for LLMs, marking a significant advancement in ultra-low-precision computing. The framework achieves performance comparable to higher-precision formats across various model scales through innovative solutions like the differentiable gradient estimator and outlier compensation mechanism. However, the current implementation faces a notable limitation: the lack of dedicated FP4 Tensor Cores in existing hardware necessitates simulation-based testing, which introduces computational overhead and prevents direct measurement of potential efficiency gains. This limitation underscores the need for hardware advancement to realize the benefits of FP4 computation.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Optimization Using FP4 Quantization For Ultra-Low Precision Language Model Training appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FP4量化 大语言模型 低精度训练 模型优化 AI硬件
相关文章