MarkTechPost@AI 05月06日 02:10
RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

RWKV-X是一种新型混合架构,它结合了RWKV模型在处理短序列时的效率和稀疏注意力机制在捕获长程上下文方面的能力。该模型通过交错块扩展方法和零初始化机制构建,并在短上下文和长上下文数据上进行两阶段训练。实验结果表明,RWKV-X在长上下文基准测试中优于先前的RWKV-7模型,同时保持了在短上下文任务上的强大性能。此外,RWKV-X在处理长序列时表现出卓越的扩展特性,在128K tokens时比Flash-Attention v3快1.37倍。

💡RWKV-X是一种混合架构,集成了RWKV-7块和稀疏注意力块,利用交错块扩展方法和零初始化机制,在现有模型的基础上构建。

📚RWKV-X的训练分为两个阶段:首先在MiniPile数据集的短1024-token上下文上训练,冻结所有参数,只训练新添加的块;然后在ProLong-64K数据集上进行长上下文持续预训练,上下文长度为64K tokens,期间所有参数解冻并联合优化。

⏱️RWKV-X在短上下文评估中保持了竞争力,较小的RWKV-X (0.22B) 取得了与RWKV-7相当的平均分,而较大的RWKV-X (3.6B) 在性能上与RWKV-7 (2.9B) 和Qwen2.5-3B相近,超过了LLaMA3.2-3B。

🚀效率分析表明,RWKV-X在处理长序列时具有优越的扩展特性,在128K tokens时比Flash-Attention v3快1.37倍,并且随着上下文长度的增加,这种优势会进一步扩大。

LLMs built on Transformer architectures face significant scaling challenges due to their quadratic complexity in sequence length when processing long-context inputs. Methods like Linear Attention models, State Space Models like Mamba, Linear RNNs like DeltaNet, and RWKV solve this problem. However, these linear architectures struggle with long-context understanding. For instance, RWKV-7 (2.9B) achieves high accuracy on passkey retrieval up to 28K tokens but experiences rapid performance degradation beyond this point. Even with continual pretraining using 128K-length data, long-context limitations persist. This issue extends beyond RWKV to other architectures like Mamba, representing a fundamental challenge for this class of models.

Linear complexity language models have emerged as alternatives to transformer-based architectures that suffer from quadratic computational demands when processing long sequences. The RWKV model series combines transformer parallelizability during training with RNN-like recurrent state representation. RWKV has evolved through multiple iterations, from the foundational RWKV-4 to RWKV-5 to RWKV-6 to RWKV-7. Hybrid language models, including Jamba, Zamba, and MiniMax, enhance hybrid designs uniquely. Further, Native Sparse Attention organizes tokens into temporal blocks with three distinct attention paths: compressed coarse-grained tokens, selectively retained fine-grained tokens, and sliding windows for local contextual information. Other attention includes SeerAttention and Block Attention (MoBA).

Researchers from Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen, Hohai University, Nanjing, Shenzhen University, and Qinghai University, Xining, have proposed a novel hybrid architecture called RWKV-X that combines RWKV’s efficiency for short-range modeling with a sparse attention mechanism designed to capture long-range context. Unlike previous hybrid approaches, RWKV-X achieves linear-time complexity during training and constant-time complexity during inference decoding. It shows near-perfect accuracy on the 64K passkey retrieval benchmark when pretrained on 64K-token sequences continuously. The model consistently outperforms previous RWKV-7 models on long-context benchmarks while maintaining strong performance on short-context tasks.

RWKV-X is a hybrid architecture that integrates RWKV-7 blocks with sparse attention blocks. Rather than training from scratch, RWKV-X builds upon existing models using an interleaved block expansion approach and zero-initialization mechanism inspired by LLaMA Pro. The training follows a two-stage process:

The Short-context evaluation reveals that RWKV-X maintains competitive performance across standard benchmarks. The smaller RWKV-X (0.22B) achieves an average score of 51.0, comparable to RWKV-7’s 51.8. At a larger scale, RWKV-X (3.6B) reaches 71.9, closely matching RWKV-7 (2.9B, 72.8) and Qwen2.5-3B (71.4), while surpassing LLaMA3.2-3B (69.7). These results confirm RWKV-X’s effectiveness as a general-purpose LLM backbone without sacrificing performance on shorter contexts. Moreover, efficiency analysis demonstrates RWKV-X’s superior scaling characteristics for long sequences. At 128K tokens, RWKV-X achieves a 1.37 times speedup over Flash-Attention v3, with this advantage expanding as context length increases.

In this paper, researchers introduced RWKV-X, which emerges as a hybrid language model that successfully combines RWKV’s efficiency for modeling short-range dependencies with a novel sparse attention mechanism designed specifically for long-range context modeling. While RWKV-X demonstrates strong performance and efficiency in long-context language modeling, several limitations remain. First, its sparse attention mechanism, which relies on top-k chunk selection, employs a heuristic approach that may overlook semantically relevant dependencies. Second, the current implementation shows sparse attention decoding running slower than vanilla RWKV, indicating that further engineering efforts are needed to optimize performance.


Check out the Paper. Also, don’t forget to follow us on Twitter.

Here’s a brief overview of what we’re building at Marktechpost:

ML News Community – r/machinelearningnews (92k+ members)

Newsletter– airesearchinsights.com/(30k+ subscribers)

miniCON AI Events – minicon.marktechpost.com

AI Reports & Magazines – magazine.marktechpost.com

AI Dev & Research News – marktechpost.com (1M+ monthly readers)

The post RWKV-X Combines Sparse Attention and Recurrent Memory to Enable Efficient 1M-Token Decoding with Linear Complexity appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RWKV-X 稀疏注意力 长文本解码 混合架构
相关文章