MarkTechPost@AI 2024年10月02日
MaskLLM: A Learnable AI Method that Facilitates End-to End Training of LLM Sparsity on Large-Scale Datasets
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MaskLLM是一种应用于LLM的可学习剪枝方法,通过N:M稀疏性减少推理时的计算开销。它采用Gumbel Softmax采样建模稀疏性,可在大型数据集上进行端到端训练,在多个LLM模型上表现出色,优于其他先进方法。

🎯MaskLLM是一种可学习的剪枝方法,将N:M稀疏性应用于LLM,以降低推理时的计算成本。它使用Gumbel Softmax采样将稀疏性建模为可学习的分布,能够在大型数据集上进行高效的端到端训练。

💪该方法在多个LLM模型上进行了测试,包括LLaMA - 2、Nemotron - 4和GPT - 3等,参数规模从843M到15B不等。MaskLLM通过端到端训练学习2:4稀疏性掩码,在准确性和困惑度方面优于SparseGPT和Wanda等基线方法。

🌟MaskLLM的掩码具有高质量的学习和跨域可转移性。它通过从大规模数据中学习掩码,并将其转移到下游任务中,同时采用稀疏权重正则化来保持剪枝后的质量,先前的掩码也有助于改进学习过程,确保了模型压缩的高效性和有效性。

LLMs, characterized by their massive parameter sizes, often lead to inefficiencies in deployment due to high memory and computational demands. One practical solution is semi-structured pruning, particularly the N: M sparsity pattern, which enhances efficiency by maintaining N non-zero values among M parameters. While hardware-friendly, such as for GPUs, this approach faces challenges due to the vast parameter space in LLMs. Methods like SparseGPT and Wanda use small calibration sets and importance criteria to select redundant parameters. Still, these are limited in scope, hindering generalization and introducing errors in representing model quality across diverse domains.

Researchers from NVIDIA and the National University of Singapore introduced MaskLLM, a learnable pruning method that applies N: M sparsity to LLMs, reducing computational overhead during inference. Unlike traditional methods, MaskLLM uses Gumbel Softmax sampling to model sparsity as a learnable distribution, enabling efficient end-to-end training on large datasets. This approach enhances mask accuracy and transferability, allowing the learned sparsity patterns to be applied across different tasks or domains. Experiments on models like LLaMA-2 and GPT-3 show significant performance improvements, with MaskLLM achieving a perplexity of 6.72 compared to 10.42 in SparseGPT.

Pruning methods are effective in compressing LLMs by removing redundant parameters. These methods can be categorized into structured, unstructured, and semi-structured pruning. Structured pruning eliminates substructures like attention heads, while unstructured pruning zeros out individual parameters, offering more flexibility but less acceleration efficiency. Semi-structured pruning, such as N: M sparsity, strikes a balance by combining structured patterns with fine-grained sparsity to enhance efficiency and flexibility. Recently, learnable sparsity methods have gained attention, particularly in vision models, and this work pioneers the application of learnable N: M masks in frozen LLMs, addressing the challenge of large-scale parameters.

The MaskLLM framework introduces N: M sparsity to optimize LLMs by selecting binary masks for parameter blocks, ensuring efficient pruning without significantly degrading model performance. Focusing on 2:4 sparsity, it selects masks where two out of four values remain non-zero. The challenge of non-differentiable mask selection is tackled through Gumbel Softmax, enabling differentiable sampling and mask optimization via gradient descent. MaskLLM learns masks from large-scale data, transferring them to downstream tasks. Sparse weight regularization maintains post-pruning quality, and prior masks improve the learning process, ensuring efficient and effective model compression.

The researchers evaluated MaskLLM on multiple LLMs (LLaMA-2, Nemotron-4, GPT-3 multilingual) ranging from 843M to 15B parameters. MaskLLM learns 2:4 sparsity masks through end-to-end training, outperforming baselines like SparseGPT and Wanda in accuracy and perplexity. The method improves mask quality with large datasets and shows robustness in low-resource settings. Transfer learning using pre-computed masks accelerates training while maintaining large remaining weights enhances downstream task performance. MaskLLM’s stochastic exploration ensures high-quality mask discovery, with results surpassing SparseGPT in perplexity after training with 1280 samples. 

MaskLLM introduces a learnable pruning method for applying N: M sparsity in LLMs to reduce computational costs during inference. Instead of using a predefined importance criterion, it models N: M sparsity patterns through Gumbel Softmax sampling, enabling end-to-end training on large datasets. MaskLLM offers high-quality mask learning and transferability across domains. Tested on LLaMA-2, Nemotron-4, and GPT-3, with sizes ranging from 843M to 15B parameters, MaskLLM outperformed state-of-the-art methods in perplexity and efficiency. Its masks can be customized for lossless downstream task performance. 


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 50k+ ML SubReddit.

Subscribe to the fastest-growing ML Newsletter with over 26k+ subscribers.

The post MaskLLM: A Learnable AI Method that Facilitates End-to End Training of LLM Sparsity on Large-Scale Datasets appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MaskLLM LLM剪枝 Gumbel Softmax 模型优化
相关文章