MarkTechPost@AI 02月07日
Microsoft AI Researchers Introduce Advanced Low-Bit Quantization Techniques to Enable Efficient LLM Deployment on Edge Devices without High Computational Costs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软研究人员提出了一系列先进的低比特量化技术,旨在实现大型语言模型(LLM)在边缘设备上的高效部署,无需高昂的计算成本。该方案通过Ladder数据类型编译器、T-MAC mpGEMM库和LUT Tensor Core硬件架构三大创新,克服了传统硬件在处理低比特量化LLM时面临的挑战。这些技术减少了模型大小和功耗,提高了推理速度,并实现了在各种设备上的有效运行,从高端笔记本电脑到低功耗物联网设备。实验结果表明,该方案在性能和能效方面均取得了显著提升。

💾 **低比特量化技术:** 通过降低模型大小,使LLM能够在边缘设备上高效执行,解决了传统方法中模型过大、难以部署的问题。

🧮 **T-MAC mpGEMM库:** 利用查找表(LUT)的方法优化混合精度计算,消除了反量化的需求,显著提高了CPU的计算效率,提升了推理速度。

🚀 **Ladder数据类型编译器:** 弥合了低比特模型表示和硬件约束之间的差距,将不支持的数据格式转换为硬件兼容的表示,确保现代深度学习架构可以利用自定义数据类型而不牺牲性能。

⚡ **LUT Tensor Core硬件架构:** 引入了专为低比特量化设计的专用加速器,利用优化的指令集提高性能,同时降低功耗,大幅提升了能源效率和计算密度。

Edge devices like smartphones, IoT gadgets, and embedded systems process data locally, improving privacy, reducing latency, and enhancing responsiveness, and AI is getting integrated into these devices rapidly. But, deploying large language models (LLMs) on these devices is difficult and complex due to their high computational and memory demands. 

LLMs are massive in size and power requirements. With billions of parameters, they demand significant memory and processing capacity that exceeds the capabilities of most edge devices. While quantization techniques reduce model size and power consumption, conventional hardware is optimized for symmetric computations, limiting support for mixed-precision arithmetic. This lack of native hardware support for low-bit computations restricts deployment across mobile and embedded platforms. 

Prior methods for running LLMs on edge devices use high-bit precision formats like FP32 and FP16, which improve numerical stability but require significant memory and energy. Some approaches use lower-bit quantization (e.g., int8 or int4) to reduce resource demands, but compatibility issues arise with existing hardware. Another technique, dequantization, re-expands compressed models before computation but introduces latency and negates efficiency gains. Also, traditional matrix multiplication (GEMM) requires uniform precision levels, which makes performance optimization across different hardware architectures complex.

Microsoft researchers introduced a series of advancements to enable efficient low-bit quantization for LLMs on edge devices. Their approach includes three major innovations: 

    Ladder data type compiler T-MAC mpGEMM libraryLUT Tensor Core hardware architecture 

These techniques aim to overcome hardware limitations by facilitating mixed-precision general matrix multiplication (mpGEMM) and reducing computational overhead. With these solutions, researchers propose a practical framework that supports efficient LLM inference without requiring specialized GPUs or high-power accelerators.

The Ladder data type compiler’s first component bridges the gap between low-bit model representations and hardware constraints. It converts unsupported data formats into hardware-compatible representations while maintaining efficiency. This approach ensures modern deep learning architectures can utilize custom data types without sacrificing performance. 

The T-MAC mpGEMM library optimizes mixed-precision computations using a lookup table (LUT)–based method instead of traditional multiplication operations. This innovation eliminates the need for dequantization and significantly enhances CPU computational efficiency. 

Also, the LUT Tensor Core hardware architecture introduces a specialized accelerator designed for low-bit quantization. It leverages an optimized instruction set to improve performance while reducing power consumption.

In evaluations, the Ladder data type compiler outperforms conventional deep neural network (DNN) compilers by up to 14.6 times for specific low-bit computations. When tested on edge devices like the Surface Laptop 7 with the Qualcomm Snapdragon X Elite chipset, the T-MAC library achieved 48 tokens per second for the 3B BitNet-b1.58 model, outperforming existing inference libraries. On lower-end devices such as the Raspberry Pi 5, it achieved 11 tokens per second, demonstrating significant efficiency improvements. Meanwhile, the LUT Tensor Core hardware achieved an 11.2-fold increase in energy efficiency and a 20.9-fold boost in computational density.

Several key takeaways from the research by Microsoft include: 

    Low-bit quantization reduces model size, enabling efficient execution on edge devices.The T-MAC library enhances inference speed by eliminating traditional multiplication operations.The Ladder compiler ensures seamless integration of custom low-bit data formats with existing hardware.Optimized techniques reduce power usage, making LLMs feasible for low-energy devices.These methods allow LLMs to operate effectively on a wide range of hardware, from high-end laptops to low-power IoT devices.These innovations achieve 48 tokens per second on Snapdragon X Elite, 30 tokens per second on 2-bit 7B Llama, and 20 tokens per second on 4-bit 7B Llama.They also enable AI-driven applications across mobile, robotic, and embedded AI systems by making LLMs more accessible.

In conclusion, the study highlights the importance of hardware-aware quantization techniques for deploying LLMs on edge devices. The proposed solutions effectively address the long-standing challenges of memory consumption, computational efficiency, and hardware compatibility. By implementing Ladder, T-MAC, and LUT Tensor Core, researchers have paved the way for next-generation AI applications that are faster, more energy-efficient, and more scalable across various platforms.


Check out the Details and Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System’ (Promoted)

The post Microsoft AI Researchers Introduce Advanced Low-Bit Quantization Techniques to Enable Efficient LLM Deployment on Edge Devices without High Computational Costs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 边缘设备 低比特量化 人工智能 微软AI
相关文章