MarkTechPost@AI 2024年11月26日
Neural Magic Releases 2:4 Sparse Llama 3.1 8B: Smaller Models for Efficient GPU Inference
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着AI模型规模的快速增长,计算和环境挑战日益凸显。Neural Magic推出了Sparse Llama 3.1 8B,这是一个经过50%剪枝的、兼容2:4 GPU的稀疏模型,旨在提高推理效率并降低AI的碳足迹。该模型利用SparseGPT、SquareHead知识蒸馏和精心策划的预训练数据集,显著降低了训练大型模型的碳排放,同时保持了强大的性能。Sparse Llama在Open LLM排行榜上取得了优异成绩,并展现了在各种任务(如聊天、代码生成和数学)中的应用潜力,为AI社区提供了一种更有效、更易获取且更环保的解决方案。

🤔 **AI模型规模增长带来计算和环境挑战:** 大型语言模型的训练和部署需要大量资源,导致基础设施成本增加和碳排放上升,同时中小企业和个人也面临着进入门槛过高的困境。

🚀 **Sparse Llama 3.1 8B模型特点:** 该模型经过50%剪枝,兼容2:4 GPU,利用SparseGPT和SquareHead知识蒸馏技术,在减少计算需求的同时保持了较高的精度,推理延迟降低1.8倍,吞吐量提升40%。

💡 **显著降低训练成本和碳排放:** Sparse Llama仅需额外130亿个token进行训练,显著降低了与训练大型模型相关的碳排放,实现了高效和可持续的AI发展。

📊 **性能表现优异:** Sparse Llama在Open LLM排行榜上取得了98.4%的准确率,并在微调后的聊天、代码生成和数学任务中展现出准确性恢复甚至性能提升。

🌍 **推动AI更公平、更环保:** Sparse Llama为AI社区提供了一种更有效、更易获取且更环保的解决方案,使更多人能够使用强大的AI模型,推动AI的公平性和可持续发展。

The rapid growth in AI model sizes has brought significant computational and environmental challenges. Deep learning models, particularly language models, have expanded considerably in recent years, demanding more resources for training and deployment. This increased demand not only raises infrastructure costs but also contributes to a growing carbon footprint, making AI less sustainable. Additionally, smaller enterprises and individuals face a growing barrier to entry, as the computational requirements are beyond their reach. These challenges highlight the need for more efficient models that can deliver strong performance without demanding prohibitive computing power.

Neural Magic has responded to these challenges by releasing Sparse Llama 3.1 8B—a 50% pruned, 2:4 GPU-compatible sparse model that delivers efficient inference performance. Built with SparseGPT, SquareHead Knowledge Distillation, and a curated pretraining dataset, Sparse Llama aims to make AI more accessible and environmentally friendly. By requiring only 13 billion additional tokens for training, Sparse Llama has significantly reduced the carbon emissions typically associated with training large-scale models. This approach aligns with the industry’s need to balance progress with sustainability while offering reliable performance.

Technical Details

Sparse Llama 3.1 8B leverages sparse techniques, which involve reducing model parameters while preserving predictive capabilities. The use of SparseGPT, combined with SquareHead Knowledge Distillation, has enabled Neural Magic to achieve a model that is 50% pruned, meaning half of the parameters have been intelligently eliminated. This pruning results in reduced computational requirements and improved efficiency. Sparse Llama also utilizes advanced quantization techniques to ensure that the model can run effectively on GPUs while maintaining accuracy. The key benefits include up to 1.8 times lower latency and 40% better throughput through sparsity alone, with the potential to reach 5 times lower latency when combined with quantization—making Sparse Llama suitable for real-time applications.

The release of Sparse Llama 3.1 8B is an important development for the AI community. The model addresses efficiency and sustainability challenges while demonstrating that performance does not need to be sacrificed for computational economy. Sparse Llama recovers 98.4% accuracy on the Open LLM Leaderboard V1 for few-shot tasks and has shown full accuracy recovery and in some cases, improved performance in fine-tuning for chat, code generation, and math tasks. These results demonstrate that sparsity and quantization have practical applications that enable developers and researchers to achieve more with fewer resources.

Conclusion

Sparse Llama 3.1 8B illustrates how innovation in model compression and quantization can lead to more efficient, accessible, and environmentally sustainable AI solutions. By reducing the computational burden associated with large models while maintaining strong performance, Neural Magic has set a new standard for balancing efficiency and effectiveness. Sparse Llama represents a step forward in making AI more equitable and environmentally friendly, offering a glimpse of a future where powerful models are accessible to a wider audience, regardless of compute resources.


Check out the Details and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post Neural Magic Releases 2:4 Sparse Llama 3.1 8B: Smaller Models for Efficient GPU Inference appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Sparse Llama AI模型压缩 稀疏化 GPU推理 可持续AI
相关文章