MarkTechPost@AI 2024年07月30日
Neural Magic Releases Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model: FP8 Dynamic Quantization and FP8 Static Quantization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Neural Magic推出Meta的Llama 3.1 405B模型的全量化FP8版本,解决内存限制,提高推理速度,具有多种优势和应用。

🎯Neural Magic发布全量化FP8版本的Meta’s Llama 3.1 405B模型,该模型可在8xH100或8xA100系统上运行,解决了原FP8和FP16版本的内存不足问题,提高了推理速度两倍以上。

💻该模型提供两个关键版本,其中Meta-Llama-3.1-405B-Instruct-FP8-dynamic保持了Meta-Llama-3.1的架构,用于多语言的助理式聊天,但仅限英语和合法应用。通过权重和激活量化到FP8数据类型,减少了参数的位数,降低了磁盘大小和GPU内存需求。

🚀Neural Magic的量化模型可使用vLLM后端进行高效部署,通过Python中的`vllm`和`transformers`库进行操作。模型在多个基准上进行了评估,如MMLU、ARC-Challenge等,量化模型在保持高准确性的同时,大幅降低了内存需求并提高了推理速度。

Neural Magic has recently announced a significant breakthrough in AI model compression, introducing a fully quantized FP8 version of Meta’s Llama 3.1 405B model. This achievement marks a milestone in AI, allowing the massive 405 billion parameter model to fit seamlessly on any 8xH100 or 8xA100 system without the common out-of-memory (OOM) errors typically encountered with the original FP8 and FP16 versions. The new model solves memory constraints and enhances inference speeds by over 2X, leveraging faster memory and computing capabilities and eliminating the need for CPU offloading or distribution across multiple nodes.

Neural Magic provides two key versions of the model:

The fully quantized FP8 version, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, maintains the architecture of Meta-Llama-3.1, designed for an assistant-like chat in multiple languages. However, it is restricted to usage in English and for lawful applications only. Released under version 1.0, this model was developed by Neural Magic and operates under the llama3.1 license.

Quantization and Optimization

The model achieves remarkable efficiency through weight and activation quantization to the FP8 data type. This process reduces the number of bits per parameter from 16 to 8, halving the disk size and GPU memory requirements. Consequently, the model can be loaded and evaluated on a single node of 8xH100 GPUs instead of requiring multiple nodes.

The quantization process involves symmetric per-channel quantization, where a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are quantized dynamically on a per-token basis. This was accomplished using LLM Compressor with 512 sequences from UltraChat, ensuring optimal performance.

Deployment and Evaluation

Neural Magic’s quantized model can be deployed efficiently using the vLLM backend. The deployment process involves using the vllm and transformers libraries in Python, as demonstrated in the provided code snippets. The example highlights the integration of the model with vLLM, showcasing the ease of generating text using the optimized model.

The model was evaluated on several benchmarks, including MMLU, ARC-Challenge, GSM-8K, Hellaswag, Winogrande, and TruthfulQA. The evaluation utilized Neural Magic’s fork of the ‘lm-evaluation-harness’ and the vLLM engine. The quantized model, Meta-Llama-3.1-405B-Instruct-FP8-dynamic, achieved an average score of 86.55 on the OpenLLM benchmark, closely mirroring the unquantized model’s score of 86.63, demonstrating a near-perfect recovery of 99.91%.

Reproduction and Accuracy

Neural Magic provides detailed commands for reproducing the evaluation results across various benchmarks. These commands illustrate the robustness of the quantized model, maintaining high accuracy across different tasks and few-shot settings. For instance, the model achieved a 99.91% recovery rate on MMLU (5-shot) and 100.2% on Winogrande (5-shot), underscoring its reliability and precision.

Conclusion

In conclusion, the release of the fully quantized FP8 version of Meta’s Llama 3.1 405B model by Neural Magic by effectively reducing memory requirements and enhancing inference speeds, this model opens new avenues for efficient and scalable AI applications. The success of this quantization effort, with minimal loss in accuracy, highlights the potential for further innovations in the field, making powerful AI models more accessible & practical for various users.


Check out the FP8 Dynamic Quantization and FP8 Static Quantization. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Neural Magic Releases Fully Quantized FP8 Version of Meta’s Llama 3.1 405B Model: FP8 Dynamic Quantization and FP8 Static Quantization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Neural Magic FP8量化 Llama 3.1 405B AI模型
相关文章