MarkTechPost@AI 2024年12月23日
NOVA: A Novel Video Autoregressive Model Without Vector Quantization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NOVA模型是一种新型的视频自回归模型,它通过顺序预测帧和帧内空间标记集来生成视频,克服了传统自回归模型在视频生成中的局限性。该模型结合了基于时间和空间的预测方法,利用预训练的语言模型处理文本提示,并使用光流跟踪运动。NOVA采用分块因果掩蔽方法进行时间预测,双向方法进行空间预测,并通过引入缩放和移位层来提高稳定性。此外,它还加入了扩散损失,从而在连续空间中预测标记概率,提高了训练和推理的效率,同时提升了视频质量和可扩展性。

🎬 NOVA模型通过时间上的帧序列和空间上的标记集序列来预测生成视频,有效地结合了时间和空间的预测。

🧠 它利用预训练的语言模型处理文本提示,并结合光流技术跟踪运动,实现了更精准的视频内容生成。

⚙️ 模型引入了缩放和移位层,增强了模型的稳定性,同时采用扩散损失在连续空间中预测标记概率,提高了训练效率。

🏆 在T2I-CompBench、GenEval和DPG-Bench等基准测试中,NOVA的性能超越了PixArt-α和SD v1/v2等模型,生成了更高质量的图像和视频。

Autoregressive LLMs are complex neural networks that generate coherent and contextually relevant text through sequential prediction. These LLms excel at handling large datasets and are very strong at translation, summarization, and conversational AI. However, achieving high quality in vision generation often comes at the cost of increased computational demands, especially for higher resolutions or longer videos. Despite efficient learning with compressed latent spaces, video diffusion models are limited to fixed-length outputs and lack contextual adaptability in autoregressive models like GPT.

Current autoregressive video generation models face many limitations. Diffusion models make excellent text-to-image and text-to-video tasks but rely on fixed-length tokens, which limits their versatility and scalability in video generations. Autoregressive models typically suffer from vector quantization issues because they transform visual data into discrete-valued token spaces. Higher-quality tokens require more tokens, while using these tokens increases the computational cost. While advancements like VAR and MAR improve image quality and generative modeling, their application to video generation remains constrained by inefficiencies in modeling and challenges in adapting to multi-context scenarios.

To address these issues, researchers from BUPT, ICT-CAS, DLUT, and BAAI proposed NOVA, a non-quantized autoregressive model for video generation. NOVA approaches video generation by predicting frames sequentially over time and spatial token sets within each frame in a flexible order. This model combines time-based and space-based prediction by separating how frames and spatial sets are generated. It uses a pre-trained language model to process text prompts and optical flow to track motion. For time-based prediction, the model applies a block-wise causal masking method, while for space-based prediction, it uses a bidirectional approach to predict sets of tokens. The model introduces scaling and shifting layers to improve stability and uses sine-cosine embeddings for better positioning. It also adds diffusion loss to help predict token probabilities in a continuous space, making training and inference more efficient and improving video quality and scalability.

The researchers trained NOVA using high-quality datasets, starting with 16 million image-text pairs from sources like DataComp, COYO, Unsplash, and JourneyDB, which were later expanded to 600 million pairs from LAION, DataComp, and COYO. For text-to-video, researchers used 19 million video-text pairs from Panda70M and other internal datasets, plus 1 million pairs from Pexels-a caption engine based on Emu2-17B generated descriptions. NOVA’s architecture included a spatial AR layer, a denoising MLP block, and a 16-layer encoder-decoder structure for handling spatial and temporal components. The temporal encoder-decoder dimensions ranged from 768 to 1536, and the denoising MLP had three blocks with 1280 dimensions. A pre-trained VAE model captured image features using masking and diffusion schedulers. NOVA was trained on sixteen A100 nodes with the AdamW optimizer. It was first trained for text-to-image tasks and then for text-to-video tasks. 

Results from evaluations on T2I-CompBench, GenEval, and DPG-Bench showed that NOVA outperformed models like PixArt-α and SD v1/v2 in text-to-image and text-to-video generation tasks. NOVA generated higher-quality images and videos with clearer, more detailed visuals. It also provided more accurate results and better matched the text inputs and the generated outputs. 

In summary, the proposed NOVA model significantly advances text-to-image and text-to-video generation. The method reduces computational complexity and improves efficiency by integrating temporal frame-by-frame and spatial set-by-set predictions with good-quality outputs. Its performance exceeds existing models, with near-commercial image quality and video fidelity. This work provides a foundation for future research, offering a baseline for developing scalable models and real-time video generation and opening up new possibilities for advancements in the field.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post NOVA: A Novel Video Autoregressive Model Without Vector Quantization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NOVA 视频生成 自回归模型 扩散模型 人工智能
相关文章