MarkTechPost@AI 2024年10月07日
LOONG: A New Autoregressive LLM-based Video Generator That can Generate Minute-Long Videos
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Loong是一种基于自回归LLM的视频生成器,能生成分钟级长视频。文章介绍了其训练过程、面临的挑战及解决方法,还提到了其系统设计、参数等内容,以及该模型的应用和可能存在的问题。

🎬 Loong是一种新型的视频生成器,采用自回归LLM,能够生成分钟级长视频。它的训练从零基础开始,将文本和视频令牌视为统一序列,通过独特的训练方法和策略来解决长视频训练中的问题。

💪 训练Loong模型的过程分为三个阶段。第一阶段利用大量静态图像进行模型预训练,为建模每帧外观奠定基础;第二阶段训练图像和短视频片段,学习捕捉短期时间依赖性;第三阶段增加视频帧数并继续联合训练。

🚧 生成大视频存在诸多挑战,如训练中的不平衡损失问题,以及推理时基于自身预测导致的误差积累和视觉质量下降等。为缓解这些问题,研究人员提出了具有损失重加权的渐进式从短到长的训练策略。

🛠️ Loong模型采用双组件系统,包括将视频压缩为令牌的视频令牌器,以及基于文本令牌预测下一个视频令牌的解码器和转换器。该模型使用3D CNN架构的令牌器,受MAGViT2启发,可处理低分辨率视频。

Video Generation by LLMs is an emerging field with a promising growth trajectory. While Autoregressive Large Language Models (LLMs) have excelled in generating coherent and lengthy sequences of tokens in natural language processing, their application in video generation has been limited to short videos of a few seconds. To address this, researchers have introduced Loong, an auto-regressive LLM-based video generator capable of generating videos that span minutes.

Training a video generation model like Loong involves a unique process. The model is trained from scratch, with text tokens and video tokens treated as a unified sequence. The researchers have proposed a progressive short-to-long training approach and a loss reweighing scheme to mitigate the loss imbalance problem for long video training. This allows Loong to be trained on a 10-second video and then extended to generate minute-level long videos conditioned on text prompts. 

However,  the generation of large videos is quite trickier and has many challenges ahead. Firstly, there is a problem of imbalanced loss during training. When trained with the objective of next-token prediction, predicting early-frame tokens from text prompts is harder than predicting late-frame tokens based on previous frames, leading to uneven loss during training. As video length increases, the accumulated loss from easy tokens overshadows the loss from difficult tokens, dominating the gradient direction. Secondly, The model predicts the next token based on ground-truth tokens, but it relies on its own predictions during inference. This discrepancy causes error accumulation, especially due to strong inter-frame dependencies and many video tokens, leading to visual quality degradation in long video inference.

To mitigate the challenge of imbalanced video token difficulties, researchers have proposed a progressive short-to-long training strategy with loss reweighting, demonstrated in the following:

Progressive Short-to-long training

Imbalanced Training Losses When Training Directly on Long Videos. The training loss for late frames is smaller than that of early frames , and the loss for the first frame remains relatively high, leading to suboptimal visual quality in the early frames

Training is factored into three stages, which increases the training length:

Stage 1: Model pre-trained with text-to-image generation on a large dataset of static images, helping the model to establish a strong foundation for modeling per-frame appearance 

Stage 2: Model trained on images and short video clips, where model learns to capture short-term temporal dependencies

Stage 3: The number of video frames increased, and joint training is continued

Loong is designed with a two-component system, a video tokenizer that compresses videos to tokens and a decoder and a transformer that predicts the next video tokens based on text tokens. 

Loong uses 3D CNN architecture for the tokenizer, inspired by MAGViT2. The model works with low-resolution videos and leaves super-resolution for post-processing. Tokenizer can compress 10-second video (65 frames, 128128 resolution) into a sequence of 1716*16 discrete tokens. Autoregressive LLM-based video generation converts video frames into discrete tokens, allowing text and video tokens to form a unified sequence. Text-to-video generation is modeled as autoregressive predicting video tokens based on text tokens using decoder-only Transformers.

Large language models can generalize to longer videos, but extending beyond trained durations risks error accumulation and quality degradation. There are ample methods to correct it:

    Video token re-encodingSampling strategySuper-resolution and refinement

The model uses the LLaMA architecture, with sizes ranging from 700M TO 7B parameters. Models are trained from scratch without text-pretrained weights. The vocabulary contains 32,000 tokens for text, 8,192 tokens for video, and 10 special tokens ( a total of 40,202). The video tokenizer replicates MAGViT2, using a causal 3D CNN structure for the first video frame. Spatial dimensions are compressed by 8x and temporal by 4x. Clustering Vector Quantization(CVQ) is used for quantization, improving codebook usage over standard VQ. The video tokenizer has 246M parameters.

The Loong model generates long videos with a consistent appearance, large motion dynamics, and natural scene transitions. Loong is modeled with text tokens and video tokens in a unified sequence and overcomes the challenges of long video training with the progressive short-to-long training scheme and loss reweighting. The model can be deployed to assist visual artists, film producers, and entertainment purposes. But, at the same time, it can be wrongly used to create fake content and deliver misleading information. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post LOONG: A New Autoregressive LLM-based Video Generator That can Generate Minute-Long Videos appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Loong 视频生成 训练策略 模型组件
相关文章