MarkTechPost@AI 2024年10月15日
Stanford Researchers Propose LoLCATS: A Cutting Edge AI Method for Efficient LLM Linearization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LoLCATS 是一种创新的 AI 方法,旨在有效地线性化大型语言模型 (LLM),从而降低计算成本和内存需求,同时保持模型性能。它通过两步法实现:首先,使用注意力转移将线性注意力机制训练成与原始模型的 softmax 注意力相匹配;然后,使用低秩自适应来修正线性化过程中产生的残差误差。LoLCATS 的优势在于它能够在不进行大规模重新训练的情况下,有效地提高线性化模型的质量,并且可以成功地用于线性化非常大的模型,例如 Llama 3 70B 和 405B。

🎯 LoLCATS 是一种两步法,用于有效地提高线性化大型语言模型 (LLM) 的质量,无需进行大规模重新训练。

第一步,注意力转移:将线性注意力机制训练成与原始模型的 softmax 注意力相匹配,使用可学习特征图参数化线性注意力,并最小化线性机制和 softmax 机制之间的输出差异。

第二步,低秩线性化:使用 LoRA 对线性化层进行微小的低秩调整,以补偿初始线性化后可能出现的质量差距。

🚀 LoLCATS 能够成功地线性化非常大的模型,例如 Llama 3 70B 和 405B,显著降低计算成本和时间,相比于之前的线性化方法,LoLCATS 在性能方面取得了显著的改进,例如在标准基准测试 (5-shot MMLU) 上,它将线性化模型与原始 Transformer 模型之间的性能差距缩小了 78%。

💡 LoLCATS 仅使用 0.2% 的模型参数和 0.4% 的训练 token,就实现了性能提升,其效率和可扩展性使其成为一种有前景的线性化 LLM 的方法。

The problem with efficiently linearizing large language models (LLMs) is multifaceted. The quadratic attention mechanism in traditional Transformer-based LLMs, while powerful, is computationally expensive and memory-intensive. Existing methods that try to linearize these models by replacing quadratic attention with subquadratic analogs face significant challenges: they often lead to degraded performance, incur high computational costs, and lack scalability. The main challenge is how to maintain high model quality while making the linearization process more efficient and scalable for very large models, including those beyond 70 billion parameters.

Researchers from Stanford University, Together AI, California Institute of Technology, and MIT introduced LoLCATS (Low-rank Linear Conversion via Attention Transfer). LoLCATS is a two-step method designed to efficiently improve the quality of linearized large language models without the need for expensive retraining on billions of tokens. The core idea behind LoLCATS is to first train linear attention mechanisms to match the softmax attentions of the original model using a mean squared error (MSE) loss in a process called “attention transfer.” Then, low-rank adaptation (LoRA) is employed to correct any residual errors in approximation, allowing the model to achieve high-quality predictions with significantly reduced computational costs. This method makes it feasible to create linearized versions of very large models, like Llama 3 8B and Mistral 7B, with minimal overhead.

The structure of LoLCATS involves two main stages. The first stage, attention transfer, focuses on training the linear attention to closely approximate the output of softmax attention. The researchers achieved this by parameterizing the linear attention using learnable feature maps, which are optimized to minimize the output discrepancy between the linear and softmax mechanisms. The second stage, low-rank linearizing, further improves model performance by leveraging LoRA to make small, low-rank adjustments to the linearized layers. This step compensates for the quality gaps that might emerge after the initial linearization. The LoLCATS framework also employs a block-by-block training approach, particularly for larger models, to make the process scalable and more memory-efficient.

The results presented in the research demonstrate significant improvements over prior linearization methods. For example, LoLCATS successfully closed the performance gap between linearized and original Transformer models by up to 78% on a standard benchmark (5-shot MMLU). The researchers also highlight that LoLCATS achieved these improvements while only using 0.2% of the model parameters and 0.4% of the training tokens required by previous methods. Additionally, LoLCATS is the first method that was successfully used to linearize extremely large models, such as Llama 3 70B and 405B, enabling a considerable reduction in computational cost and time compared to earlier approaches.

Conclusion

LoLCATS presents a compelling solution to the problem of linearizing large language models by significantly reducing the memory and compute requirements without compromising on quality. By introducing the two-step process of attention transfer followed by low-rank adaptation, this research enables the efficient conversion of large Transformer models into linearized versions that retain their powerful capabilities. This breakthrough could lead to more accessible and cost-effective deployment of LLMs, making them feasible for a broader range of applications. The implementation details, along with the code, are available on GitHub, allowing others to build upon and apply this method to other large-scale models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Stanford Researchers Propose LoLCATS: A Cutting Edge AI Method for Efficient LLM Linearization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LoLCATS 大型语言模型 线性化 人工智能 深度学习
相关文章