少点错误 2024年07月16日
Comparing Quantized Performance in Llama Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了量化模型在不同任务上的性能表现,特别是针对Llama 3模型,比较了不同量化方案(如GGUF、BNB、HQQ、GPTQ、AWQ)在不同位宽(16位、8位、4位)下的效果。研究发现,8位量化模型在大多数任务上表现良好,与16位模型性能相当。而4位量化模型在某些任务上存在明显的性能下降,具体取决于任务类型。例如,在单token预测任务中,4位量化模型的表现相对较好,而在Chain of Thought(CoT)任务中,4位量化模型的性能则明显下降。

🤔 **不同量化方案的比较:**本文比较了多种量化方案,包括GGUF、BNB、HQQ、GPTQ和AWQ,并评估了它们在不同位宽下的性能表现。研究发现,8位量化模型在大多数任务上表现良好,与16位模型性能相当,而4位量化模型在某些任务上存在明显的性能下降。

🤔 **8位量化模型表现良好:**研究发现,8位量化模型在大多数任务上表现良好,与16位模型性能相当。这表明在实际应用中,可以使用8位量化模型来降低模型的内存占用和计算成本,而不会显著影响模型的性能。

🤔 **4位量化模型性能下降:**研究发现,4位量化模型在某些任务上存在明显的性能下降,特别是对于需要生成多个正确token的CoT任务。这表明,4位量化模型可能不适用于所有类型的任务,需要根据具体任务进行选择。

🤔 **其他因素的影响:**本文还讨论了其他因素对量化模型性能的影响,例如模型架构、训练数据、硬件平台等。研究发现,这些因素都会对量化模型的性能产生一定的影响,需要在实际应用中进行综合考虑。

🤔 **未来研究方向:**未来研究可以进一步探讨如何提高4位量化模型的性能,以及如何将量化技术应用到其他类型的模型和任务中。

Published on July 15, 2024 4:01 PM GMT

Epistemic Status: Quick tests, most of this was done in less than 48 hours

TL;DR: Can you skimp on GPU VRAM?  8bit quantized seems fine, for 4bit it depends. 

I was asked by @Teun van der Weij, to what degree one can run evaluations on quantized models, and I was unsure. I have run some evaluations with Llama 3 and have some quick comparisons now. 

Main Quantization Schemes

Here is a list of some different quantization schemes discussed:

Llama 2

Some previous paperxs have compare perplexity of different methods. We can see an example of some research shown in the recent research paper using HQQ quantization:

MethodnBitsLlama-2-7BLlama-2-13BLlama-2-70B
PPL ↓MEM ↓PPL ↓MEM ↓PPL ↓MEM ↓
FP165.1813.54.6325.6OOMOOM
BNB85.227.94.6714.43.1768.15
GPTQ_g12885.197.84.6314.83.1274.87
HQQ_g12885.197.64.63143.1269.32
BNB_g6445.434.74.798.23.2939.11
GPTQ_g6445.3854.739.13.2341.13
AWQ_g6445.284.64.78.53.237.08
HQQ_g6445.34.64.78.23.1937.52

Table 1: Comparison of perplexities at different levels of quantization with different methods on WikiText2 dataset for Llama 2 7b, 13b and 70b.

However, it is quite unclear what effect this has on real world performance. 

Llama 3 8B

Here is some examples of what perplexity looks like with different levels of quantization on Llama 3 8B, found from the llama.cpp repository[1]. This only includes GGUF, and not any other quantization methods as far as I can tell.

TypeSize (GB)PPLMean ΔpRMS Δp
f1614.976.2331--
q8_07.966.2342-0.019 ± 0.003 %1.198 %
q6_K6.146.2533-0.007 ± 0.006 %2.295 %
q5_K_M5.336.2886-0.114 ± 0.008 %3.160 %
q5_05.216.3632-0.416 ± 0.012 %4.634 %
q4_K_M4.586.3830-0.389 ± 0.014 %5.251 %
q4_04.346.7001-1.588 ± 0.022 %8.434 %

Table 2: Comparison of perplexities at different levels of GGUF quantization on the WikiText2 dataset for Llama 3 8b.

This is all fine and good, but a lot of use are trying to do interpretability and whatnot, and I personally have found this easiest when using the HuggingFace transformers library. While for some it may make sense to retool, what about those of us who do not want to? While we can wait for potential compatibility with GGUF, there are other quantization methods we can use in the meantime, and I have tried to run some tests.

Llama 3 8B Instruct

Here are some benchmarks with Llama 3 8B, run with different quantization levels and schemes on MMLU, WMDP (% accuracy), and The Pile (perplexity):

gptq4 - Acc=55.21% PPL=8.575 T=233s
awq4 - Acc=55.55% PPL=8.483 T=270s

MethodnBitsLlama-3-8B-Instruct
MMLUWMDPThe Pile (100k tokens)
(0-shot)(0-shot)PPL ↓Acc↑Time↓
BFloat161663.87 %54.99%8.28356.52%53s[2]
Float161663.84 %54.93%8.27956.50%55s[2]
Hqq Int8863.87 %54.66%8.29856.49%122s
BnB Int8863.05 %54.96%8.30556.25%74s
Hqq Int4462.29 %54.23%8.48255.85%130s
BnB NF4461.44 %54.42%8.49955.80%95s
BnB Int4460.80 %52.73%8.63355.19%277s
GPTQ Int4461.58 %53.30%8.57555.21% 233s
AWQ Int4461.84 %54.55%8.48355.55%270s
Hqq Int3362.26 %51.23%8.87254.49%201s

We can see that the scored going from 16-bit to 8-bit are relatively unaffected, so likely it is fine to use 8bit. Going to 4-bit has a more noticeable effect, but it is not massively different on a qualitative level. 

Chain of Though

OK, sure, maybe single-token prediction tasks are relatively unaffected, but Chain of Thought (CoT) requires many "correct" tokens at a time. Maybe these are affected more? I run some experiments on Minerva MATH Algebra dataset, with zero-shot and Chain-of-Thought. (I only use only a subset "Algebra" because CoT takes a long time to generate). Here are the results:

MethodnBitsMinerva MATH Algebra
(0-shot CoT)Time↓
BFloat161637.2%2.3h[2]
Float161637.5%2.3h[2]
Hqq Int8837.9%15.0h
BnB Int8836.3%5.5h
Hqq Int4433.7 %2.5h
BnB NF4431.3 %3.2h
BnB Int4429.3 %3.1h
GPTQ Int44 DNF
AWQ Int44 DNF
Hqq Int33 DNF

Overall, we can see again that for 8-bit quantization, the effect on performance doesn't seem that large, though there is some noticeable degradation when going to 4-bits.

One thing to note is that the 8-bit implementations, for some reason, were rather slow. I think this is likely some problem I have with torch.compile / torch._dynamo (since the effects were not as noticeable for 4bits or for perplexity results) but I did not have time to test this. This may be because in some runtimes, the weights need to be de-quantized at every step. Also, these results were run on an (architecturally) older A4000 which does not support FP8/FP4 compute, so results may vary.

Conclusion

 

  1. ^

    Note that some earlier implementations of GGUF with Llama 3 8B had some error when loading the tokeniser, and had much worse performance because of this.

  2. ^

    Note that float16 and bfloat16 experiments were run on a dual-GPU setup, so the time may not be directly comparable



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

量化模型 性能评估 Llama 3 GPTQ AWQ
相关文章