MarkTechPost@AI 2024年09月21日
Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了量化指令微调的LLM在不同量化方法下的性能表现, 研究涵盖了70亿到4050亿参数的模型, 并使用GPTQ, AWQ, SmoothQuant和FP8等量化技术进行评估. 结果表明, 量化后的LLM在大多数任务中表现优于较小的模型, 但在幻觉检测和指令遵循方面存在例外. 此外, 研究还发现, 在4050亿参数的模型中, 只对权重进行量化(GPTQ和AWQ)表现更好.

👨‍💻 研究人员对70亿到4050亿参数的指令微调LLM进行了全面评估, 使用了GPTQ, AWQ, SmoothQuant和FP8四种量化方法.

📊 评估结果表明, 量化后的LLM在大多数任务中表现优于较小的模型, 例如, 4位量化的Llama-2-13B(6.5GB)在大多数基准测试中胜过了FP16 Llama-2-7B(14GB).

🤔 研究还发现, 在4050亿参数的模型中, 只对权重进行量化(GPTQ和AWQ)比对激活值进行量化(SmoothQuant)表现更好. 并且, SmoothQuant在大型模型(如Llama3.1-405B)中会导致精度下降.

🔎 研究还指出, MT-Bench评估方法在区分高性能LLM方面存在局限性.

🚀 研究结果对于理解量化技术在最新LLM上的有效性具有重要意义, 并为开发更高效的LLM提供指导.

Large Language Models (LLMs) have gained significant attention due to their impressive performance, with the release of Llama 3.1 in July 2024 being a notable example. However, deploying these models in resource-constrained environments poses significant challenges due to their huge parameter count. Low-bit quantization has emerged as a popular technique to compress LLMs, reducing memory and computational demands during inference. Existing research on quantization algorithms has been limited in scope, focusing mainly on pre-trained models rather than the more widely used instruction-tuned models. Understanding the impact of using these quantization methods efficiently on accuracy across various datasets, model sizes, and training approaches is important.

Existing methods to address LLM quantization challenges include Quantization Aware Training (QAT) and Post-Training Quantization (PTQ), where QAT is difficult to apply, and hence PTQ is more widely adopted for LLMs despite potential accuracy reduction. Other methods include LLM.int8(), which uses 8-bit weights and activations, and GPTQ, a layer-wise quantization technique utilizing inverse Hessian information. For evaluating LLMs, aspects like weight and activation quantization in language modeling tasks, emergent abilities of quantized LLMs, and trustworthiness dimensions have been explored. However, most research depends heavily on accuracy as the primary evaluation metric, which has left gaps in understanding quantization impacts on crucial tasks like trustworthiness, dialogue, and long-context scenarios.

A team from ETRI, KETI, and Neubla have proposed a comprehensive evaluation of instruction-tuned LLMs across various quantization methods. Their study encompasses models ranging from 7B to 405B parameters, utilizing GPTQ, AWQ, SmoothQuant, and FP8 quantization techniques. This approach provides a detailed understanding of how different quantization methods affect LLM performance across diverse tasks and model sizes. It also addresses the limitations of previous studies by including the latest models and a wider range of parameters, offering insights into the effectiveness of quantization techniques on cutting-edge LLMs.

The study includes a comprehensive evaluation framework, utilizing 13 widely used datasets and benchmarks across 6 task types. For CommonSenseQA, datasets like ARC, HellaSwag, and Winogrande are used to evaluate the ability of AI to handle human-like reasoning and elementary knowledge. Moreover, activation quantization (SmoothQuant) and weight-only quantization methods like GPTQ and AWQ are implemented using tools like AutoGPTQ, llmcompressor, and AutoAWQ. GPTQ uses layer-wise quantization, and utilizes inverse Hessian information to mitigate accuracy loss, while AWQ is designed to preserve the precision of critical weights in LLMs. Both methods used a group size of 128 for quantization.

The experimental results show that the quantized larger LLMs generally outperform smaller models across most benchmarks, except for hallucination and instruction-following tasks. For example, a 4-bit quantized Llama-2-13B (6.5 GB) outperformed an FP16 Llama-2-7B (14 GB) on most benchmarks, with 4.66% and 1.16% higher accuracy on OpenLLM Leaderboard-v1 and v2 datasets, respectively. Further, the comparison of quantization methods showed little difference between weight-only (GPTQ and AWQ) and activation quantization (SmoothQuant) in most cases. However, SmoothQuant caused accuracy drops, up to -2.93% and -9.23% on average for large models like Llama3.1-405B compared to FP8 on OpenLLM Leaderboard-v1 and v2 datasets, respectively.

In this paper, a team from ETRI, KETI, and Neubla presented a comprehensive evaluation of instruction-tuned LLMs across various quantization methods across a wide range of 13 datasets and 6 task types. The paper covers models ranging from 7B to 405B parameters and uses four quantization methods: GPTQ, AWQ, SmoothQuant, and FP8. The findings revealed that quantized LLMs outperformed smaller models in most tasks, with notable exceptions in hallucination detection and instruction following. The weight-only (GPTQ and AWQ)  quantization showed superior results in the 405B model. The study also highlighted the limitations of the MT-Bench evaluation method in differentiating between high-performing LLMs.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Comprehensive Evaluation of Quantized Instruction-Tuned LLMs: Exploring Quantization Methods for Models Ranging from 7B to 405B Parameters appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 量化 指令微调 GPTQ AWQ SmoothQuant FP8
相关文章