MarkTechPost@AI 04月22日 15:05
LLMs Can Now Retain High Accuracy at 2-Bit Precision: Researchers from UNC Chapel Hill Introduce TACQ, a Task-Aware Quantization Approach that Preserves Critical Weight Circuits for Compression Without Performance Loss
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

来自北卡罗来纳大学教堂山分校的研究人员提出了一种名为TaskCircuit Quantization(TACQ)的新方法,实现了在极低比特位(2-3 bits)下LLM模型的高精度压缩。TACQ通过识别并保留关键权重,显著提升了模型在各种任务上的表现,尤其是在之前压缩方法表现不佳的2-bit设置下。这项技术在处理敏感数据、边缘设备和需要快速响应的应用场景中具有重要意义,为LLM的实际部署提供了更高效、更经济的解决方案。

💡LLMs面临计算和内存挑战,尤其是在需要本地部署和计算受限的环境中,如处理敏感数据或边缘设备。

🧠TACQ是一种新型的混合精度后训练量化方法,其核心在于定义与下游任务性能相关的特定权重电路,从而有选择地保留关键权重。

🔍TACQ包含两个关键组成部分:量化感知定位(QAL)用于追踪量化对模型性能的影响;幅度锐化梯度(MSG)用于评估权重的绝对重要性。

🥇在2-bit设置下,TACQ在GSM8k、MMLU和Spider等数据集上显著优于现有方法,在3-bit精度下,TACQ也保持了较高的准确率。

🚀TACQ在需要连续token输出的生成任务中表现出色,尤其是在Spider文本转SQL任务中,展现出其独特的优势,适用于代理和程序预测等应用。

LLMs show impressive capabilities across numerous applications, yet they face challenges due to computational demands and memory requirements. This challenge is acute in scenarios requiring local deployment for privacy concerns, such as processing sensitive patient records, or compute-constrained environments like real-time customer service systems and edge devices. Post-training quantization (PTQ) is a promising solution that allows efficient compression of pre-trained models, reducing memory consumption by 2-4 times. However, current processes have a bottleneck at 4-bit compression, with substantial performance degradation when attempting 2- or 3-bit precision. Most PTQ methods rely on small mini-batches of general-purpose pre-training data to account for activation changes resulting from quantization.

Current methods for LLM compression primarily fall into three categories. Uniform quantization represents the most basic approach, where weights stored as 16-bit float tensors are compressed by treating each row independently, mapping floats to integers based on maximum and minimum values within each channel. GPTQ-based quantization techniques advance this concept by focusing on layerwise reconstruction, aiming to minimize reconstruction loss after quantization. Further, Mixed-precision quantization methods offer a more nuanced strategy, moving beyond fixed precision for all weights. These techniques assign bit-width based on weight importance to maintain performance, with some approaches preserving high-sensitivity “outlier” weights at higher precision.

Researchers from UNC Chapel Hill have proposed a novel mixed-precision post-training quantization approach called TaskCircuit Quantization (TACQ). The method shows similarities to automated circuit discovery by directly conditioning the quantization process on specific weight circuits, defined as sets of weights associated with downstream task performance. TACQ compares unquantized model weights with uniformly quantized ones to estimate expected weight changes from quantization, then uses gradient information to predict impacts on task performance, enabling preservation of task-specific weights. TACQ consistently outperforms baselines with the same calibration data and lower weight budgets, and achieves significant improvements in the challenging 2-bit and 3-bit regimes.

TACQ is defined by a saliency metric that identifies critical weights to preserve during quantization, building on concepts from model interpretability like automatic circuit discovery, knowledge localization, and input attribution. This metric uses two components:

MSG helps stabilize TACQ and addresses biases from QAL’s estimations. These factors combine into a unified saliency metric that can be efficiently evaluated for every weight in a single backward pass, allowing preservation of the top p% highest-scoring weights at 16-bit precision.

In the challenging 2-bit setting, TACQ outperforms SliM-LLM with absolute margin improvements of 16.0% (from 20.1% to 36.1%) on GSM8k, 14.1% (from 34.8% to 49.2%) on MMLU, and 21.9% (from 0% to 21.9%) on Spider. Other baseline methods like GPTQ, SqueezeLLM, and SPQR deteriorate to near-random performance at this compression level. At 3-bit precision, TACQ preserves approximately 91%, 96%, and 89% of the unquantized accuracy on GSM8k, MMLU, and Spider, respectively, while outperforming the strongest baseline, SliM-LLM, by 1-2% across most datasets. TACQ’s advantages become evident in generation tasks requiring sequential token outputs, where it is the only method capable of recovering non-negligible performance in the 2-bit setting for the Spider text-to-SQL task.

In conclusion, researchers introduced TACQ, a significant advancement in task-aware post-training quantization. It improves model performance at ultra-low bit-widths (2- to 3-bits) where previous methods degrade to near-random outputs. TACQ aligns with automatic circuit discovery research by selectively preserving only a small fraction of salient weights at 16-bit precision, indicating that sparse weight “circuits” disproportionately influence specific tasks. Moreover, experiments on Spider show that TACQ better preserves model generation capabilities, making it suitable for program-prediction tasks. This also applies to situations involving agents, where models frequently generate many executable outputs, and where efficiency is a concern.


Check out the Paper and GitHub Page. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post LLMs Can Now Retain High Accuracy at 2-Bit Precision: Researchers from UNC Chapel Hill Introduce TACQ, a Task-Aware Quantization Approach that Preserves Critical Weight Circuits for Compression Without Performance Loss appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 量化 TACQ 模型压缩 2-bit
相关文章