MarkTechPost@AI 07月08日 04:40
How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了径向注意力(Radial Attention),一种用于视频扩散模型的稀疏注意力机制,旨在提高长视频生成的效率。研究者观察到视频扩散模型中注意力分数随空间和时间距离增加而衰减的现象,称为时空能量衰减。径向注意力受此启发,通过模拟自然衰减来减少计算量。它采用静态注意力模式,窗口呈指数级收缩,从而实现高达1.9倍的性能提升,并支持生成四倍长的视频。通过轻量级的LoRA微调,径向注意力显著降低了训练成本(4.4倍)和推理成本(3.7倍),同时保持了多种先进扩散模型的视频质量。

💡 视频扩散模型在生成高质量视频方面取得了显著进展,但处理视频的额外时间维度大大增加了计算需求,尤其是自注意力机制。

⏳ 径向注意力是一种稀疏注意力机制,其灵感来源于视频扩散模型中注意力分数随空间和时间距离增加而衰减的现象——时空能量衰减。

⚡️ 径向注意力通过使用静态注意力模式和指数收缩的窗口来减少计算量,从而提高效率。这种方法使预训练模型能够生成长达四倍的视频,同时降低训练成本(4.4倍)和推理时间(3.7倍)。

✅ 径向注意力在Mochi 1、HunyuanVideo和Wan2.1这三种领先的文本到视频扩散模型上进行了评估,结果表明它在速度和质量上都有所提升。与现有的稀疏注意力基线相比,径向注意力提供了更好的感知质量和显著的计算增益。

🛠️ 通过轻量级的LoRA微调,径向注意力可以高效地生成高质量的长视频。LoRA微调在某些情况下甚至优于完全微调,证明了其在资源效率方面的优势。

Introduction to Video Diffusion Models and Computational Challenges

Diffusion models have made impressive progress in generating high-quality, coherent videos, building on their success in image synthesis. However, handling the extra temporal dimension in videos significantly increases computational demands, especially since self-attention scales poorly with sequence length. This makes it difficult to train or run these models efficiently on long videos. Attempts like Sparse VideoGen utilize attention head classification to accelerate inference, but they struggle with accuracy and generalization during training. Other methods replace softmax attention with linear alternatives, although these often necessitate significant architectural changes. Interestingly, the natural energy decay of signals over time in physics inspires new, more efficient modeling strategies.

Evolution of Attention Mechanisms in Video Synthesis

Early video models extended 2D architectures by incorporating temporal components, but newer approaches, such as DiT and Latte, enhance spatial-temporal modeling through advanced attention mechanisms. While 3D dense attention achieves state-of-the-art performance, its computational cost increases rapidly with video length, making the generation of long videos expensive. Techniques such as timestep distillation, quantization, and sparse attention help reduce this burden, but often overlook the unique structure of video data. Although alternatives like linear or hierarchical attention improve efficiency, they typically struggle to maintain detail or scale effectively in practice.

Introduction to Spatiotemporal Energy Decay and Radial Attention

Researchers from MIT, NVIDIA, Princeton, UC Berkeley, Stanford, and First Intelligence have identified a phenomenon in video diffusion models called Spatiotemporal Energy Decay, where attention scores between tokens decline as spatial or temporal distance increases, mirroring how signals naturally fade. Motivated by this, they proposed Radial Attention, a sparse attention mechanism with O(n log n) complexity. It uses a static attention mask where tokens attend mostly to nearby ones, with the attention window shrinking over time. This enables pre-trained models to generate videos up to four times longer, reducing training costs by 4.4 times and inference time by 3.7 times, all while preserving video quality.

Sparse Attention Using Energy Decay Principles

Radial Attention is based on the insight that attention scores in video models decrease with increasing spatial and temporal distance, a phenomenon known as Spatiotemporal Energy Decay. Instead of attending to all tokens equally, Radial Attention strategically reduces computation where attention is weaker. It introduces a sparse attention mask that decays exponentially outward in both space and time, preserving only the most relevant interactions. This results in an O(n log n) complexity, making it significantly faster and more efficient than dense attention. Additionally, with minimal fine-tuning using LoRA adapters, pre-trained models can be adapted to generate much longer videos efficiently and effectively.

Evaluation Across Video Diffusion Models

Radial Attention is evaluated on three leading text-to-video diffusion models: Mochi 1, HunyuanVideo, and Wan2.1, demonstrating both speed and quality improvements. Compared to existing sparse attention baselines, such as SVG and PowerAttention, Radial Attention offers better perceptual quality and significant computational gains, including up to 3.7 times faster inference and 4.4 times lower training cost for extended videos. It scales efficiently to 4× longer video lengths and maintains compatibility with existing LoRAs, including style ones. Importantly, LoRA fine-tuning with Radial Attention outperforms full fine-tuning in some cases, demonstrating its effectiveness and resource efficiency for high-quality long-video generation.

Conclusion: Scalable and Efficient Long Video Generation

In conclusion, Radial Attention is a sparse attention mechanism designed to handle long video generation in diffusion models efficiently. Inspired by the observed decline in attention scores with increasing spatial and temporal distances, a phenomenon the researchers term Spatiotemporal Energy Decay Radial Attention, this approach mimics the natural decay to reduce computation. It utilizes a static attention pattern with exponentially shrinking windows, achieving up to 1.9 times faster performance and supporting videos up to 4 times longer. With lightweight LoRA-based fine-tuning, it significantly cuts down training (by 4.4×) and inference (by 3.7×) costs, all while preserving video quality across multiple state-of-the-art diffusion models.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post How Radial Attention Cuts Costs in Video Diffusion by 4.4× Without Sacrificing Quality appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

径向注意力 视频扩散模型 稀疏注意力 长视频生成 LoRA微调
相关文章