MarkTechPost@AI 07月22日 03:45
This AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressive Video Generator Leveraging MM-RoPE and AR-DF for Efficient Spatiotemporal Modeling
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

自回归视频生成是AI研究的新兴领域,旨在逐帧生成视频。与传统方法不同,自回归模型模仿大型语言模型,通过学习空间和时间动态来动态生成内容,有望统一视频、图像和文本生成。然而,准确建模视频的时空依赖性是一个挑战,现有方法常引入额外复杂性或效率问题。阿里巴巴达摩院、湖畔实验室和浙江大学的研究团队推出了Lumos-1,一个保持大型语言模型架构的统一模型。它采用MM-RoPE解决三维结构建模难题,并利用AR-DF技术实现高效训练和高质量生成,支持多模态生成,在多个基准测试中表现优异,为高效视频生成树立新标杆。

🎯 **时空依赖性建模挑战与Lumos-1的创新**:视频生成的核心挑战在于捕捉和建模其固有的时空依赖性。现有自回归模型常因复杂性或效率低下而受限。Lumos-1通过引入MM-RoPE(多模态旋转位置嵌入)来解决这一问题,它能平衡时空维度频率,避免传统3D RoPE的细节丢失和位置编码模糊,确保了对视频三维结构的准确表示。

🔄 **AR-DF技术提升训练效率与生成质量**:为了克服视频帧间训练的损失不平衡问题,Lumos-1采用了AR-DF(自回归离散扩散强制)技术。通过时序管状掩码(temporal tube masking)进行训练,避免模型过度依赖未掩码的空间信息,从而保证了视频序列的均匀学习。这一策略在推理时得到镜像,确保了高质量帧生成且无性能下降。

🚀 **高效训练与卓越性能的统一**:Lumos-1在仅48个GPU上,使用6000万张图像和1000万个视频进行从头训练,展现了极高的内存效率。其性能与领域内的顶尖模型相当,在GenEval基准上媲美EMU3,在VBench-I2V测试中与COSMOS-Video2World相当,并在VBench-T2V基准上与OpenSoraPlan相媲美。这证明了Lumos-1轻量级训练模式并未牺牲其竞争力。

🌐 **多模态生成能力与广泛应用前景**:Lumos-1成功实现了文本到视频、图像到视频以及文本到图像的生成,展现了其跨模态的强大泛化能力。该模型不仅解决了时空建模的难题,还通过其高效且有效的自回归框架,为下一代可扩展、高质量视频生成模型开辟了道路,并为未来的多模态研究提供了新的方向。

Autoregressive video generation is a rapidly evolving research domain. It focuses on the synthesis of videos frame-by-frame using learned patterns of both spatial arrangements and temporal dynamics. Unlike traditional video creation methods, which may rely on pre-built frames or handcrafted transitions, autoregressive models aim to generate content dynamically based on prior tokens. This approach is similar to how large language models predict the next word. It offers a potential to unify video, image, and text generation under a shared framework by using the structural power of transformer-based architectures.

One major problem in this space is how to accurately capture and model the intrinsic spatiotemporal dependencies in videos. Videos contain rich structures across both time and space. Encoding this complexity so models can predict coherent future frames remains a challenge. When these dependencies are not modeled well, it leads to broken frame continuity or unrealistic content generation. Traditional training techniques like random masking also struggle. They often fail to provide balanced learning signals across frames. When spatial information from adjacent frames leaks, prediction becomes too easy.

Several methods attempt to address this challenge by adapting the autoregressive generation pipeline. However, they often deviate from standard large language model structures. Some use external pre-trained text encoders, making models more complex and less coherent. Others bring significant latency during generation with inefficient decoding. Autoregressive models like Phenaki and EMU3 try to support end-to-end generation. Despite this, they still struggle with performance consistency and high training costs. Techniques like raster-scan order or global sequence attention also do not scale well to high-dimensional video data.

The research team from Alibaba Group’s DAMO Academy, Hupan Lab, and Zhejiang University introduced Lumos-1. It is a unified model for autoregressive video generation that stays true to large language model architecture. Unlike previous tools, Lumos-1 eliminates the need for external encoders and changes very little in the original LLM design. The model uses MM-RoPE, or Multi-Modal Rotary Position Embeddings, to address the challenge of modeling video’s three-dimensional structure. The model also uses a token dependency approach. This preserves intra-frame bidirectionality and inter-frame temporal causality, which aligns more naturally with how video data behaves.

In MM-RoPE, researchers expand existing RoPE methods to balance frequency spectrum for spatial and temporal dimensions. Traditional 3D RoPE misallocates frequency focus, causing detail loss or ambiguous positional encoding. MM-RoPE restructures allocations so that temporal, height, and width each receive balanced representation. To address loss imbalance in frame-wise training, Lumos-1 introduces AR-DF, or Autoregressive Discrete Diffusion Forcing. It uses temporal tube masking during training, so the model does not rely too much on unmasked spatial info. This ensures even learning across the video sequence. The inference strategy mirrors the training, allowing high-quality frame generation without degradation.

Lumos-1 was trained from scratch on 60 million images and 10 million videos, using only 48 GPUs. This is considered memory-efficient given the training scale. The model achieved results comparable to top models in the field. It matched EMU3’s results on GenEval benchmarks. It performed equivalently to COSMOS-Video2World on the VBench-I2V test. It also rivaled OpenSoraPlan’s outputs on the VBench-T2V benchmark. These comparisons show that Lumos-1’s lightweight training does not compromise competitiveness. The model supports text-to-video, image-to-video, and text-to-image generation. This demonstrates strong generalization across modalities.

Overall, this research not only identifies and addresses core challenges in spatiotemporal modeling for video generation but also showcases how Lumos-1 sets a new standard for unifying efficiency and effectiveness in autoregressive frameworks. By successfully blending advanced architectures with innovative training, Lumos-1 paves the way for the next generation of scalable, high-quality video generation models and opens up new avenues for future multimodal research.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project.

The post This AI Paper from Alibaba Introduces Lumos-1: A Unified Autoregressive Video Generator Leveraging MM-RoPE and AR-DF for Efficient Spatiotemporal Modeling appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Lumos-1 自回归视频生成 时空建模 MM-RoPE AR-DF AI视频
相关文章