Gated Slot Attention: Advancing Linear Attention Models for Efficient and Effective Language Processing

MarkTechPost@AI 2024年09月22日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

门控槽位注意力（GSA）是一种新的线性注意力机制，它通过引入门控机制来增强 ABC 模型，以提高线性注意力的效率和效果。GSA 在语言建模和上下文记忆密集型任务方面表现出色，在各种基准测试中优于其他线性模型，并与强大的 Transformer 模型相媲美。

😁**门控槽位注意力的提出背景** 传统的注意力机制在处理长序列时面临着巨大的挑战。softmax 标准注意力机制的二次复杂度阻碍了对视频理解和生物序列建模等领域的大量数据的有效处理。虽然这在训练期间对于语言建模来说不是一个主要问题，但在推理期间却会变得很麻烦。KV 缓存会随着生成长度线性增长，导致大量的内存负担和由于高 I/O 成本而导致的吞吐量瓶颈。这些限制促使研究人员探索能够在保持性能的同时提高效率的替代注意力机制，特别是对于长序列任务和推理阶段。线性注意力及其门控变体已成为 softmax 注意力的有希望的替代方案，在语言建模和理解任务中表现出强大的性能。这些模型可以在推理期间重新表述为 RNN，从而实现恒定的内存复杂度，并显著提高效率。然而，它们面临着两个主要挑战。首先，线性循环模型难以处理需要上下文检索或学习的任务，面临着基本的召回-记忆权衡。其次，尽管支持硬件高效的块状训练，但从头开始在数万亿个标记上训练这些模型仍然非常昂贵。

😊**门控槽位注意力的主要特点** 门控槽位注意力（GSA）是 ABC 模型的一种门控版本，它遵循了用门控机制增强线性注意力的趋势。GSA 不仅在语言建模和理解任务方面与其他模型的性能相当，而且在不需要大状态大小的情况下，在上下文召回密集型任务中也显著优于其他线性模型。在 T2R 微调设置中，GSA 在微调 Mistral-7B 时表现出优异的性能，超过了大型循环语言模型，并优于其他线性模型和 T2R 方法。值得注意的是，GSA 实现了与 GLA 相似的训练速度，但由于其较小的状态大小，提供了更快的推理速度。 GSA 解决了 ABC 模型的两个主要限制：缺乏遗忘机制和不必要的偏向初始标记的归纳偏差。GSA 结合了一种门控机制，可以实现对历史信息的遗忘，并引入了对自然语言处理至关重要的最近度归纳偏差。

🤩**门控槽位注意力的结构和机制** GSA 的核心是每个内存槽位的门控 RNN 更新规则，使用一个标量数据相关门控值。这可以用矩阵形式表示，类似于 HGRN2。GSA 可以实现为两遍门控线性注意力 (GLA)，允许进行硬件高效的块状训练。 GSA 架构由 L 个块组成，每个块包含一个 GSA 令牌混合层和一个门控线性单元 (GLU) 通道混合层。它采用多头注意力来捕获不同的输入方面。对于每个头，输入经过线性变换，并使用 Swish 激活函数。遗忘门使用线性变换后跟 sigmoid 激活函数（带有阻尼因子）来获得。然后，输出通过 GSA 层处理，并组合起来产生最终输出。该模型通过仔细控制参数计数来平衡效率和有效性，通常将内存槽位数量设置为 64，并使用 4 个注意力头。

🥰**门控槽位注意力的实验结果** GSA 在语言建模和上下文召回密集型任务中都表现出具有竞争力的性能。在 Slimpajama 语料库上的语言建模实验中，GSA 的性能与 HGRN2 等具有相同大小隐藏状态的强大模型相当，同时在更小的状态大小下优于 GLA 和 RetNet。对于 1.3B 和 2.7B 参数模型，GSA 在各种常识推理任务中取得了相当或更好的结果，包括 ARC、Hellaswag、Lambada、PIQA 和 Winograde。在召回密集型任务中，GSA 显示出比其他次二次模型有显著的改进。在合成多查询关联召回 (MQAR) 任务中，GSA 在不同的模型维度上优于 Mamba、GLA、RetNet 和 HGRN2。对于 FDA、SWDE、SQuAD、NQ、TriviaQA 和 Drop 等现实世界的召回密集型任务，GSA 一直优于其他次二次模型，实现了最接近 Transformer (Xfmr++) 基线的平均性能。

🥳**门控槽位注意力的未来展望** 这项研究提出了 GSA，它通过受门控线性注意力启发的门控机制增强了 ABC 模型。通过将 GSA 构建为两遍 GLA，它利用硬件高效的实现来进行高效的训练。GSA 的上下文感知内存读取和遗忘机制在保持较小状态大小的同时隐式地提高了模型容量，从而提高了训练和推理效率。大量实验表明了 GSA 在上下文召回密集型任务和“将预训练的 Transformer 微调为 RNN”场景中的优势。这项创新带来了巨大的希望，可以进一步推动自然语言处理领域的发展。

Transformer models have revolutionized sequence modeling tasks, but their standard attention mechanism faces significant challenges when dealing with long sequences. The quadratic complexity of softmax-based standard attention hinders the efficient processing of extensive data in fields like video understanding and biological sequence modeling. While this isn’t a major concern for language modeling during training, it becomes problematic during inference. The Key-Value (KV) cache grows linearly with generation length, causing substantial memory burdens and throughput bottlenecks due to high I/O costs. These limitations have spurred researchers to explore alternative attention mechanisms that can maintain performance while improving efficiency, particularly for long-sequence tasks and during inference.

Linear attention and its gated variants have emerged as promising alternatives to softmax attention, demonstrating strong performance in language modeling and understanding tasks. These models can be reframed as RNNs during inference, achieving constant memory complexity and significantly enhancing efficiency. However, they face two key challenges. First, linear recurrent models struggle with tasks requiring in-context retrieval or learning, facing a fundamental recall-memory trade-off. Second, training these models from scratch on trillions of tokens remains prohibitively expensive, despite supporting hardware-efficient chunkwise training.

In this study, researchers from the School of Computer Science and Technology, Soochow University, Massachusetts Institute of Technology, University of California, Tencent AI Lab, LuxiTech, and University of Waterloo revisit the Attention with the Bounded-Memory Control (ABC) model, which retains the softmax operation, reducing discrepancies between standard and linear attention in training-finetuning scenarios. ABC enables more effective state utilization, requiring smaller state sizes for comparable performance. However, its potential has been overlooked due to mediocre language modeling performance and slow training speed. To address these limitations, the researchers reformulate ABC as two-pass linear attention linked via softmax, utilizing hardware-efficient chunkwise implementation for faster training.

Building on this foundation, they introduce Gated Slot Attention (GSA), a gated version of ABC that follows the trend of enhancing linear attention with gating mechanisms. GSA not only matches performance in language modeling and understanding tasks but also significantly outperforms other linear models in in-context recall-intensive tasks without requiring large state sizes. In the T2R finetuning setting, GSA demonstrates superior performance when finetuning Mistral-7B, surpassing large recurrent language models and outperforming other linear models and T2R methods. Notably, GSA achieves similar training speeds to GLA while offering improved inference speed due to its smaller state size.

GSA addresses two key limitations of the ABC model: the lack of a forgetting mechanism and an unwarranted inductive bias favouring initial tokens. GSA incorporates a gating mechanism that enables forgetting of historical information and introduces a recency inductive bias, crucial for natural language processing.

The core of GSA is a gated RNN update rule for each memory slot, using a scalar data-dependent gating value. This can be represented in matrix form, reminiscent of HGRN2. GSA can be implemented as a two-pass Gated Linear Attention (GLA), allowing for hardware-efficient chunkwise training.

The GSA architecture consists of L blocks, each comprising a GSA token mixing layer and a Gated Linear Unit (GLU) channel mixing layer. It employs multi-head attention to capture different input aspects. For each head, the input undergoes linear transformations with Swish activation. A forget gate is obtained using a linear transformation followed by a sigmoid activation with a damping factor. The outputs are then processed through the GSA layer and combined to produce the final output. The model balances efficiency and effectiveness by carefully controlling parameter counts, typically setting the number of memory slots to 64 and using 4 attention heads.

GSA demonstrates competitive performance in both language modeling and in-context recall-intensive tasks. In language modeling experiments on the Slimpajama corpus, GSA performs comparably to strong models like HGRN2 with equally sized hidden states, while outperforming GLA and RetNet even with smaller state sizes. For 1.3B and 2.7B parameter models, GSA achieves comparable or better results across various commonsense reasoning tasks, including ARC, Hellaswag, Lambada, PIQA, and Winograde.

In recall-intensive tasks, GSA shows significant improvements over other subquadratic models. On the synthetic Multi-Query Associative Recall (MQAR) task, GSA outperforms Mamba, GLA, RetNet, and HGRN2 across different model dimensions. For real-world recall-intensive tasks like FDA, SWDE, SQuAD, NQ, TriviaQA, and Drop, GSA consistently outperforms other subquadratic models, achieving an average performance closest to the Transformer (Xfmr++) baseline.

This study presents GSA that enhances the ABC model with a gating mechanism inspired by Gated Linear Attention. By framing GSA as a two-pass GLA, it utilizes hardware-efficient implementations for efficient training. GSA’s context-aware memory reading and forgetting mechanisms implicitly increase model capacity while maintaining a small state size, improving both training and inference efficiency. Extensive experiments demonstrate GSA’s advantages in in-context recall-intensive tasks and “finetuning pre-trained Transformers to RNNs” scenarios. This innovation bridges the gap between linear attention models and traditional Transformers, offering a promising direction for efficient, high-performance language modeling and understanding tasks.

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post Gated Slot Attention: Advancing Linear Attention Models for Efficient and Effective Language Processing appeared first on MarkTechPost.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签