Meta AI Releases New Quantized Versions of Llama 3.2 (1B & 3B): Delivering Up To 2-4x Increases in Inference Speed and 56% Reduction in Model Size

MarkTechPost@AI 2024年10月25日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Meta AI 最近发布了 Llama 3.2 量化模型（1B 和 3B），这是一项重大进展，旨在使最先进的 AI 技术能够被更广泛的用户使用。这些是首批轻量级量化 Llama 模型，它们足够小且性能出色，可以在许多流行的移动设备上运行。研究团队采用两种不同的技术来量化这些模型：带有 LoRA 适配器的量化感知训练 (QAT)，它优先考虑准确性；以及 SpinQuant，这是一种最先进的训练后量化方法，专注于可移植性。这两个版本均可在此版本中下载。这些模型代表了原始 Llama 3 系列的量化版本，旨在优化计算效率并显着减少运行它们所需的硬件占用空间。通过这样做，Meta AI 旨在提高大型模型的性能，同时减少部署所需的计算资源。这使得研究人员和企业都可以使用强大的 AI 模型，而无需专门的、昂贵的基础设施，从而使人们更容易获得尖端的 AI 技术。

🚀 **Meta AI 发布了 Llama 3.2 量化模型 (1B 和 3B)，旨在使最先进的 AI 技术能够被更广泛的用户使用。** 这些模型代表了原始 Llama 3 系列的量化版本，旨在优化计算效率并显着减少运行它们所需的硬件占用空间。通过这样做，Meta AI 旨在提高大型模型的性能，同时减少部署所需的计算资源。这使得研究人员和企业都可以使用强大的 AI 模型，而无需专门的、昂贵的基础设施，从而使人们更容易获得尖端的 AI 技术。

🤖 **该团队采用两种不同的技术来量化这些模型：带有 LoRA 适配器的量化感知训练 (QAT)，它优先考虑准确性；以及 SpinQuant，这是一种最先进的训练后量化方法，专注于可移植性。** 这些模型在保持原始 Llama 3 模型的质量和安全要求的同时，实现了 2-4 倍的加速。它们还实现了平均 56% 的模型尺寸缩减和 41% 的平均内存使用量缩减，与原始 BF16 格式相比。这些令人印象深刻的优化是 Meta 努力使高级 AI 更易获得，同时保持高性能和安全标准的一部分。

💡 **量化 Llama 3.2 的核心基于量化——一种技术，它将模型权重和激活的精度从 32 位浮点数降低到更低位表示。** 具体来说，Meta AI 利用 8 位甚至 4 位量化策略，这使得模型能够在显着减少的内存和计算能力下有效运行。这种量化方法保留了 Llama 3 的关键特征和功能，例如它执行高级自然语言处理 (NLP) 任务的能力，同时使模型更轻量级。

📊 **早期基准测试结果表明，量化 Llama 3.2 在关键 NLP 基准测试中的性能约为完整 Llama 3 模型的 95%，但内存使用量减少了近 60%。** 这种效率对于希望实施 AI 但不希望投资高端基础设施的企业和研究人员来说至关重要。此外，能够在商品硬件上部署这些模型与当前可持续 AI 的趋势相一致，减少了训练和部署 LLM 的环境影响。

🌐 **通过降低模型尺寸，同时保持高性能水平，Meta AI 使这些模型更适用于边缘计算环境，在这些环境中，计算资源有限。** 量化 Llama 3.2 可以运行在功能较弱的硬件上，例如消费级 GPU 甚至 CPU，而不会造成性能上的重大损失。这也使得这些模型更适合实时应用，因为较低的计算要求会导致更快的推断时间。

🤝 **Meta AI 与行业领先的合作伙伴合作，使这些模型能够在配备 Arm CPU 的 Qualcomm 和 MediaTek 系统芯片 (SoC) 上使用。** 这种合作确保了这些模型可以有效地部署在各种设备上，包括流行的移动平台，从而进一步扩展了 Llama 3.2 的覆盖范围和影响力。

🚀 **Meta AI 发布量化 Llama 3.2 模型，标志着高效 AI 模型发展的一个重要里程碑。** 通过专注于量化，Meta 提供了一种平衡性能和可访问性的解决方案，使更广泛的受众能够从高级 NLP 功能中受益。这些量化模型解决了采用 LLM 的关键障碍，例如成本、能耗和基础设施要求。该技术的更广泛影响可能导致更公平地获取 AI，从而促进以前对小型企业和研究人员来说遥不可及的领域的创新。Meta AI 努力推动高效 AI 建模的边界，突出了对可持续、包容性 AI 开发的日益重视——这一趋势必将塑造 AI 研究和应用的未来。

The rapid growth of large language models (LLMs) has brought significant advancements across various sectors, but it has also presented considerable challenges. Models such as Llama 3 have made impressive strides in natural language understanding and generation, yet their size and computational requirements have often limited their practicality. High energy costs, lengthy training times, and the need for expensive hardware are barriers to accessibility for many organizations and researchers. These challenges not only impact the environment but also widen the gap between tech giants and smaller entities trying to leverage AI capabilities.

Meta AI’s Quantized Llama 3.2 Models (1B and 3B)

Meta AI recently released Quantized Llama 3.2 Models (1B and 3B), a significant step forward in making state-of-the-art AI technology accessible to a broader range of users. These are the first lightweight quantized Llama models that are small and performant enough to run on many popular mobile devices. The research team employed two distinct techniques to quantize these models: Quantization-Aware Training (QAT) with LoRA adapters, which prioritizes accuracy, and SpinQuant, a state-of-the-art post-training quantization method that focuses on portability. Both versions are available for download as part of this release. These models represent a quantized version of the original Llama 3 series, designed to optimize computational efficiency and significantly reduce the hardware footprint required to operate them. By doing so, Meta AI aims to enhance the performance of large models while reducing the computational resources needed for deployment. This makes it feasible for both researchers and businesses to utilize powerful AI models without needing specialized, costly infrastructure, thereby democratizing access to cutting-edge AI technologies.

Meta AI is uniquely positioned to provide these quantized models due to its access to extensive compute resources, training data, comprehensive evaluations, and a focus on safety. These models apply the same quality and safety requirements as the original Llama 3 models while achieving a significant 2-4x speedup. They also achieved an average reduction of 56% in model size and a 41% average reduction in memory usage compared to the original BF16 format. These impressive optimizations are part of Meta’s efforts to make advanced AI more accessible while maintaining high performance and safety standards.

Technical Details and Benefits

The core of Quantized Llama 3.2 is based on quantization—a technique that reduces the precision of the model’s weights and activations from 32-bit floating-point numbers to lower-bit representations. Specifically, Meta AI utilizes 8-bit and even 4-bit quantization strategies, which allows the models to operate effectively with significantly reduced memory and computational power. This quantization approach retains the critical features and capabilities of Llama 3, such as its ability to perform advanced natural language processing (NLP) tasks, while making the models much more lightweight. The benefits are clear: Quantized Llama 3.2 can be run on less powerful hardware, such as consumer-grade GPUs and even CPUs, without a substantial loss in performance. This also makes these models more suitable for real-time applications, as lower computational requirements lead to faster inference times.

Inference using both quantization techniques is supported in the Llama Stack reference implementation via PyTorch’s ExecuTorch framework. Additionally, Meta AI has collaborated with industry-leading partners to make these models available on Qualcomm and MediaTek System on Chips (SoCs) with Arm CPUs. This partnership ensures that the models can be efficiently deployed on a wide range of devices, including popular mobile platforms, further extending the reach and impact of Llama 3.2.

Importance and Early Results

Quantized Llama 3.2 is important because it directly addresses the scalability issues associated with LLMs. By reducing the model size while maintaining a high level of performance, Meta AI has made these models more applicable for edge computing environments, where computational resources are limited. Early benchmarking results indicate that Quantized Llama 3.2 performs at approximately 95% of the full Llama 3 model’s effectiveness on key NLP benchmarks but with a reduction in memory usage by nearly 60%. This kind of efficiency is critical for businesses and researchers who want to implement AI without investing in high-end infrastructure. Additionally, the ability to deploy these models on commodity hardware aligns well with current trends in sustainable AI, reducing the environmental impact of training and deploying LLMs.

Conclusion

Meta AI’s release of Quantized Llama 3.2 marks a significant step forward in the evolution of efficient AI models. By focusing on quantization, Meta has provided a solution that balances performance with accessibility, enabling a wider audience to benefit from advanced NLP capabilities. These quantized models address the key barriers to the adoption of LLMs, such as cost, energy consumption, and infrastructure requirements. The broader implications of this technology could lead to more equitable access to AI, fostering innovation in areas previously out of reach for smaller enterprises and researchers. Meta AI’s effort to push the boundaries of efficient AI modeling highlights the growing emphasis on sustainable, inclusive AI development—a trend that is sure to shape the future of AI research and application.

Check out the Details and Try the model here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Meta AI Releases New Quantized Versions of Llama 3.2 (1B & 3B): Delivering Up To 2-4x Increases in Inference Speed and 56% Reduction in Model Size appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签