MarkTechPost@AI 2024年08月17日
Nvidia AI Released Llama-Minitron 3.1 4B: A New Language Model Built by Pruning and Distilling Llama 3.1 8B
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Nvidia发布新语言模型Llama-3.1-Minitron 4B,它是Llama-3.1 8B的精简版,通过修剪和知识蒸馏等技术,在保证性能的同时提高效率。

🎯Llama-3.1-Minitron 4B是Llama-3.1 8B的蒸馏和修剪版本,通过在深度和宽度方向进行结构化修剪,将模型从8B缩小到4B,同时保留性能。

💡Nvidia还应用了经典的知识蒸馏技术,使较小的模型能够模仿较大且更复杂的模型的行为,从而在保留原模型预测能力的同时,提高了资源利用效率。

🌟该模型在各种基准测试中表现出色,在推理、编码和数学等方面具有更好的准确性和效率,且在资源利用上具有优势,计算成本显著降低。

🚀Nvidia使用TensorRT-LLM工具包进一步优化了该模型,使其推理性能得到提升,例如在FP8精度下的吞吐量比原Llama 3.1 8B模型提高了2.7倍。

Nvidia has just announced a new release in language models, but this time, a small language model: the Llama-3.1-Minitron 4B model. This means it is one of the major steps in the continuous evolution of language models, combining the efficiency of large-scale models with smaller models through cutting-edge techniques such as pruning and knowledge distillation.

The Llama-3.1-Minitron 4B model is the distilled and pruned version of the bigger Llama-3.1 8B sister model. To create this smaller model from the original 8B model, Nvidia used structured pruning in the depth and width directions. Pruning is a technique that deletes less important layers or neurons of the network to reduce model size and complexity while retaining its performance. In this case, Nvidia performed the depth pruning by removing 16 layers from the model and downsizing it from an 8B to a 4B model. Another technique applied is width pruning through trimming embedding dimensions and MLP intermediate.

Besides pruning, Nvidia also applied classical distillation to enhance the efficiency of Llama-3.1-Minitron 4B. Knowledge distillation is a process whereby a smaller model, the student, is trained to mimic the behavior of a larger and more complex one, the teacher. In this way, much of the predictive power of the original model is preserved in the smaller model, but it is faster and more frugal in terms of resources. Nvidia has combined this with the distillation technique and pruning, making sure that the retrained model of 4B is high-performing and is well-spent in larger models.

The Llama-3.1-Minitron 4B model excels in various benchmarks, producing competitive performance against larger state-of-the-art open-source models. It highly outperforms many other small language models in most domains, like Minitron 4B, Phi-2 2.7B, Gemma2 2.6B, and Qwen2-1.5B. Extensive benchmarking has proven this model’s effectiveness in terms of better accuracy and efficiency for reasoning, coding, and math.

One of the biggest advantages of the Llama-3.1-Minitron 4B model lies in its ability to compete equally well, yet it’s resource-efficient. It uses a fraction of the number of training tokens required by training from scratch, up to 40 times smaller. This translates to considerable compute cost savings. It makes this a very appealing option to deploy in scenarios where there might be limits to computational resources to deploy large-scale language models.

Nvidia has further optimized the Llama-3.1-Minitron 4B model to deploy it using its TensorRT-LLM toolkit, which enhances its inference performance. For instance, the model’s throughput in FP8 precision for various cases increased to 2.7x higher than the original Llama 3.1 8B model. The additional optimization performed on Llama-3.1-Minitron 4B renders this model extremely powerful and efficient, easily applicable in many domains.

In conclusion, Nvidia’s release of the Llama-3.1-Minitron 4B model is a huge leap in the creation of LLMs. Thus, the model designed by Nvidia has achieved good performance while being resource-efficient; hence, it is very useful in many NLP tasks. The Llama-3.1-Minitron 4B model will become part of Nvidia’s Hugging Face collection and add to the shifting landscape of powerful, freely available AI models.


Check out the Model Card and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post Nvidia AI Released Llama-Minitron 3.1 4B: A New Language Model Built by Pruning and Distilling Llama 3.1 8B appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Nvidia Llama-3.1-Minitron 4B 知识蒸馏 性能优化
相关文章