MarkTechPost@AI 2024年09月28日
torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Torchao 是一个 PyTorch 原生库,它利用低位数据类型、量化和稀疏性来优化 PyTorch 模型,使其运行更快、占用更少的内存。它支持各种生成式 AI 模型,例如 Llama 3 和扩散模型,并提供量化感知训练 (QAT) 以最大程度地减少低位量化带来的精度下降。Torchao 还支持低精度训练,并提供了实验性的 8 位和 4 位优化器,可以作为 AdamW 的替代品来优化模型训练。

🚀 **全面支持生成式 AI 模型:** Torchao 支持各种生成式 AI 模型,例如 Llama 3 和扩散模型,确保兼容性和易用性。

⚡ **显著的性能提升:** Torchao 在模型推理和训练期间展示了令人印象深刻的性能提升,速度提升高达 97%,内存使用量大幅减少。

🧮 **多功能量化技术:** Torchao 提供各种量化技术,包括低位数据类型(如 int4 和 float8),以优化模型的推理和训练。

🧠 **量化感知训练 (QAT):** Torchao 包含对 QAT 的支持,这是一种技术,可以最大程度地减少低位量化带来的精度下降。

📈 **低精度训练优化:** Torchao 提供了用于减少训练计算和分布式通信精度的易于使用的流程,从 `torch.nn.Linear` 层的 float8 开始。

🤖 **低位优化器:** Torchao 提供了实验性的 8 位和 4 位优化器,作为广泛使用的 AdamW 优化器的替代品。

🤝 **集成和未来发展:** Torchao 已积极集成到机器学习社区中一些最重要的开源项目中,包括作为 HuggingFace transformers 的推理后端,为加速扩散模型贡献 diffusers-torchao,以及在 torchtune 中提供 QLoRA 和 QAT 配方。

🚀 **性能提升:** Torchao 在 Llama 3 8B 推理中使用先进的量化技术实现了高达 97% 的速度提升。

📉 **资源消耗减少:** Torchao 展示了峰值 VRAM 使用量减少 73%,这证明了它的效率。

💡 **未来展望:** PyTorch 团队概述了 Torchao 的几个令人兴奋的未来发展方向,包括将量化推向更低位,开发用于高吞吐量推理的高性能内核,以及扩展到更多层、缩放类型或粒度,以及支持更多硬件后端,例如 MX 硬件。

💪 **结论:** Torchao 是一个强大的工具,可以帮助研究人员和开发人员优化他们的 PyTorch 模型,使其运行更快、占用更少的内存。它提供了一系列先进的功能,例如低位量化、QAT 和低精度训练,使其成为任何希望提高其深度学习模型性能的人的理想选择。

PyTorch has officially launched torchao, a comprehensive native library designed to optimize PyTorch models for better performance and efficiency. The launch of this library is a milestone in deep learning model optimization, providing users with an accessible toolkit that leverages advanced techniques such as low-bit types, quantization, and sparsity. The library is predominantly written in PyTorch code, ensuring ease of use and integration for developers working on inference and training workloads.

Key Features of torchao

These key features establish torchao as a versatile and efficient deep-learning model optimization library.

Advanced Quantization Techniques

One of the standout features of torchao is its robust support for quantization. The library’s inference quantization algorithms work over arbitrary PyTorch models that contain ‘nn.Linear’ layers, providing weight-only and dynamic activation quantization for various dtypes and sparse layouts. Developers can select the most suitable quantization techniques using the top-level ‘quantize_’ API. This API includes options for memory-bound models, such as int4_weight_only and int8_weight_only, and compute-bound models. For compute-bound models, torchao can perform float8 quantization, providing additional flexibility for high-performance model optimization. Moreover, torchao’s quantization techniques are highly composable, enabling the combination of sparsity and quantization for enhanced performance.

Quantization Aware Training (QAT)

Torchao addresses the potential accuracy degradation associated with post-training quantization, particularly for models quantized at less than 4 bits. The library includes support for Quantization Aware Training (QAT), which has been shown to recover up to 96% of the accuracy degradation on challenging benchmarks like Hellaswag. This feature is integrated as an end-to-end recipe in torchtune, with a minimal tutorial to facilitate its implementation. Incorporating QAT makes torchao a powerful tool for training models with low-bit quantization while maintaining accuracy.

Training Optimization with Low Precision

In addition to inference optimization, torchao offers comprehensive support for low-precision computing and communication during training. The library includes easy-to-use workflows for reducing the precision of training compute and distributed communications, beginning with float8 for torch.nn.Linear layers.

Torchao has demonstrated impressive results, such as a 1.5x speedup for Llama 3 70B pretraining when using float8. The library also provides experimental support for other training optimizations, such as NF4 QLoRA in torchtune, prototype int8 training, and accelerated sparse 2:4 training. These features make torchao a compelling choice for users looking to accelerate training while minimizing memory usage.

Low-Bit Optimizers

Inspired by the pioneering work of Bits and Bytes in low-bit optimizers, torchao introduces prototype support for 8-bit and 4-bit optimizers as a drop-in replacement for the widely used AdamW optimizer. This feature enables users to switch to low-bit optimizers seamlessly, further enhancing model training efficiency without significantly modifying their existing codebases.

Integrations and Future Developments

Torchao has been actively integrated into some of the most significant open-source projects in the machine-learning community. These integrations include serving as an inference backend for HuggingFace transformers, contributing to diffusers-torchao for accelerating diffusion models, and providing QLoRA and QAT recipes in torchtune. torchao’s 4-bit and 8-bit quantization techniques are also supported in the SGLang project, making it a valuable tool for those working on research and production deployments.

Moving forward, the PyTorch team has outlined several exciting developments for torchao. These include pushing the boundaries of quantization by going lower than 4-bit, developing performant kernels for high-throughput inference, expanding to more layers, scaling types, or granularities, and supporting additional hardware backends, such as MX hardware.

Key Takeaways from the Launch of torchao

In conclusion, the launch of torchao represents a major step forward for PyTorch, providing developers with a powerful toolkit to make models faster and more efficient across training and inference scenarios.


Check out the Details and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post torchao: A PyTorch Native Library that Makes Models Faster and Smaller by Leveraging Low Bit Dtypes, Quantization and Sparsity appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Torchao PyTorch 模型优化 量化 稀疏性 低位数据类型 生成式 AI Llama 3 扩散模型 推理 训练
相关文章