MarkTechPost@AI 2024年10月16日
SeedLM: A Post-Training Compression Method that Uses Pseudo-Random Generators to Efficiently Encode and Compress LLM Weights
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

SeedLM是一种新型的无数据压缩方法,旨在解决大型语言模型部署中的难题。它利用伪随机生成器编码和压缩模型权重,减少内存访问,保持计算效率,在多种任务中表现出色,且无需校准数据。

🌱SeedLM利用伪随机生成器的种子来编码和压缩模型权重,显著降低内存访问,同时保持计算效率。通过线性反馈移位寄存器(LFSRs)生成伪随机矩阵,以增加计算量换取更少的内存访问。

📦SeedLM将LLM的每个权重块投影到由最优种子生成的随机基中,有效最小化压缩误差。其压缩过程涉及找到最优种子和投影系数,仅使用种子和少数系数就能高效重建权重,而无需存储所有个体权重值。

💪SeedLM在包括Llama 2和Llama 3等各种LLM上进行了测试,在4位和3位精度水平上始终优于现有压缩技术。例如,在4位配置下,其在多种任务中的平均零样本准确率达到约97.9%。

🎉SeedLM在基准数据集上的准确性评估显示,它在实现显著压缩的同时有效保留了准确性。例如,在Llama 2 70B中,其4位版本保留了几乎99%的基线性能。

The ever-increasing size of Large Language Models (LLMs) presents a significant challenge for practical deployment. Despite their transformative impact on natural language processing, these models are often hindered by high memory transfer requirements, which pose a bottleneck during autoregressive generation. This results in high energy consumption and substantial inference time, limiting their scalability and use on memory-constrained hardware. Post-training compression has emerged as a viable solution, but many current state-of-the-art methods require calibration data, making them cumbersome for data-free scenarios. The key problem, therefore, is how to effectively compress LLM weights without sacrificing accuracy or requiring calibration data.

Researchers from Apple and Meta AI introduce SeedLM, a novel approach that aims to overcome the challenges associated with the deployment of large-scale LLMs by providing a data-free compression method. SeedLM utilizes seeds of pseudo-random generators to encode and compress model weights, significantly reducing memory access while preserving computational efficiency. By leveraging Linear Feedback Shift Registers (LFSRs), SeedLM generates pseudo-random matrices during inference, trading off increased computation for fewer memory accesses. Unlike existing compression techniques, SeedLM operates without calibration data and achieves competitive results across diverse tasks, maintaining high zero-shot accuracy even at lower bit precision. The approach specifically focuses on compressing the weights of models such as Llama 3 70B into 3-4 bits with minimal accuracy degradation.

SeedLM compresses model weights using pseudo-random projection bases generated by LFSRs, widely used in hardware implementations like cryptography and communication systems. Each weight block of the LLM is projected into a random basis generated from an optimal seed, effectively minimizing compression error. The compression process involves finding optimal seeds and projection coefficients that enable the efficient reconstruction of weights using only the seed and a few coefficients instead of storing all individual weight values. The LFSR mechanism is implemented in silicon, making it energy-efficient and suitable for memory-bound tasks.

The primary goal of SeedLM is to generate a pseudo-random matrix using an LFSR with a given seed, which is then linearly combined with compressed coefficients to approximate the weight block. This matrix is reconstructed on the fly during inference, allowing SeedLM to avoid storing the full model parameters in memory. The process involves segmenting the weight matrix into smaller blocks, which are then compressed using a random matrix derived from the LFSR, thereby reducing the memory footprint required for large models.

SeedLM was tested on various LLMs, including Llama 2 and Llama 3 models, with parameters ranging up to 70 billion. In these experiments, SeedLM consistently outperformed state-of-the-art compression techniques, particularly at 4-bit and 3-bit precision levels. For instance, using the 4-bit configuration, SeedLM achieved approximately 97.9% of the zero-shot accuracy on average across diverse tasks compared to the full-precision FP16 baseline. Notably, SeedLM is entirely data-free, which distinguishes it from other methods, such as AWQ and OmniQuant, that rely on calibration data for fine-tuning. The FPGA-based tests further demonstrated that as model size increased to 70B, SeedLM provided nearly a 4x speed-up over the FP16 baseline in terms of memory-bound task performance.

The accuracy evaluation on benchmark datasets like WikiText-2 and zero-shot tasks using the LM Evaluation Harness showed that SeedLM retained accuracy effectively while achieving significant compression. For instance, in Llama 2 70B, SeedLM’s 4-bit version retained almost 99% of the baseline performance, showcasing its capability to balance compression and accuracy without calibration dependencies. Additionally, the FPGA implementation of SeedLM highlighted its efficiency in hardware environments, achieving significant reductions in inference latency by efficiently managing memory bandwidth and utilizing LFSR blocks for rapid weight reconstruction.

SeedLM presents an effective solution for compressing LLM weights by utilizing pseudo-random generators, offering a practical approach for scaling large models on memory-limited hardware. By eliminating the need for calibration data and relying on deterministic offline algorithms, SeedLM simplifies the compression process while retaining high accuracy levels. The FPGA implementation further emphasizes its potential in real-world applications, providing up to a 4x speed-up in memory-bound tasks. SeedLM represents a promising step in making LLMs more efficient and deployable without compromising their performance, particularly on devices with limited computational resources.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post SeedLM: A Post-Training Compression Method that Uses Pseudo-Random Generators to Efficiently Encode and Compress LLM Weights appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SeedLM LLM压缩 伪随机生成器 准确性保留
相关文章