MarkTechPost@AI 2024年12月14日
Researchers from UCLA and Apple Introduce STIV: A Scalable AI Framework for Text and Image Conditioned Video Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果和加州大学的研究人员提出了STIV框架,一种可扩展的文本和图像条件视频生成方法。该方法通过帧替换将图像条件融入Diffusion Transformer (DiT)中,并通过联合图像文本条件分类器自由引导实现文本条件。STIV能够同时执行文本到视频(T2V)和文本图像到视频(TI2V)任务,并可扩展到视频预测、帧插值、多视角生成和长视频生成等应用。实验表明,STIV在多个基准测试中表现出色,为视频生成领域的未来发展提供了坚实的基础。

🖼️ STIV框架通过帧替换将图像条件融入DiT模型,并利用联合图像文本条件分类器自由引导实现文本条件,从而实现文本到视频(T2V)和文本图像到视频(TI2V)任务的统一。

🚀 该框架具有高度的可扩展性和灵活性,不仅在公共基准测试中表现出色,还可应用于可控视频生成、视频预测、帧插值、长视频生成和多视角生成等多种场景。

📊 研究人员对T2V和T2I模型进行了深入的设置、训练和评估,包括使用AdaFactor优化器、特定的学习率和梯度裁剪,并在包含超过9000万高质量视频字幕对的数据集上进行了训练。

📈 通过将模型参数从600M扩展到8.7B,T2V和STIV模型的性能得到了显著提升,例如,T2V的VBench-语义分数从72.5提高到79.5,STIV-M-512模型的VBench-I2V分数达到90.1。

Video generation has improved with models like Sora, which uses the Diffusion Transformer (DiT) architecture. While text-to-video (T2V) models have advanced, they often find it hard to create clear and consistent videos without extra references. Text-image-to-video (TI2V) models address this limitation by using an initial image frame as grounding to improve clarity. Reaching Sora-level performance is still difficult as it is hard to combine image-based inputs with the model effectively, and higher-quality datasets are needed to improve the model’s output, making it tough to achieve the same level of success as Sora.

Current methods explored integrating image conditions into U-Net architectures, but applying these techniques to DiT models remained unresolved. While diffusion-based approaches dominated text-to-video generation by using LDMs, scaling models, and shifting to transformer-based architectures, many studies focused on isolated aspects, overlooking their combined impact on performance. Techniques like cross-attention in PixArt-α, self-attention in SD3, and stability tricks such as QKnorm showed some improvements but became less effective as models scaled. Despite advancements, no unified model successfully combined T2V and TI2V capabilities, limiting progress toward more efficient and versatile video generation.

To solve this, researchers from Apple and the University of California developed a comprehensive framework that systematically examined the interaction between model architectures, training methods, and data curation strategies. The resulting STIV method is a simple and scalable text-image-conditioned video generation approach. Using frame replacement, it incorporates image conditions into a Diffusion Transformer (DiT) and applies text conditioning through a joint image-text conditional classifier-free guidance. This design enables STIV to perform text-to-video (T2V) and text-image-to-video simultaneously (TI2V) tasks. Additionally, STIV can be easily expanded to applications like video prediction, frame interpolation, multi-view generation, and long video generation.

Researchers investigated the setup, training, and evaluation process for text-to-video (T2V) and text-to-image (T2I) models. The models used the AdaFactor optimizer, with a specific learning rate and gradient clipping, and were trained for 400k steps. Data preparation involved a video data engine that analyzed video frames, performed scene segmentation, and extracted features like motion and clarity scores—the training utilized curated datasets, including over 90 million high-quality video-caption pairs. Key evaluation metrics, including temporal quality, semantic alignment, and video-image alignment, were assessed using VBench, VBench-I2V, and MSRVTT. The study also explored ablation techniques, such as using different architectural designs and training strategies, including Flow Matching, CFG-Renormalization, and AdaFactor Optimizer. Experiments on model initialization showed that joint initialization from lower and higher resolution models improved performance. Additionally, using more frames during training enhanced metrics, particularly motion smoothness and dynamic range.

The T2V and STIV models significantly improved after scaling from 600M to 8.7B parameters. In T2V, the VBench-Semantic score increased from 72.5 to 74.8 with larger model sizes and improved to 77.0 when the resolution was raised from 256 to 512. Fine-tuning with high-quality data boosted the VBench-Quality score from 82.2 to 83.9, with the best model achieving a VBench-Semantic score of 79.5. Similarly, the STIV model showed advancements, with the STIV-M-512 model achieving a VBench-I2V score of 90.1. In video prediction, the STIV-V2V model outperformed T2V with an FVD score of 183.7 compared to 536.2. The STIV-TUP model delivered fantastic results in frame interpolation, with FID scores of 2.0 and 5.9 on MSRVTT and MovieGen datasets. In the multi-view generation, the proposed STIV model maintained the 3D coherency and achieved comparable performance to Zero123++ with Pa SNR of 21.64 and LPIPS of 0.156. In long video generation, it generated 380 frames, which showed its performance with potential for further progress.

In the end, the proposed framework provided a scalable and flexible solution for video generation by integrating text and image conditioning within a unified model. It demonstrated strong performance on public benchmarks and adaptability across various applications, including controllable video generation, video prediction, frame interpolation, long video generation, and multi-view generation. This approach highlighted its potential to support future advancements in video generation and contribute to the broader research community!


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

Trending: LG AI Research Releases EXAONE 3.5: Three Open-Source Bilingual Frontier AI-level Models Delivering Unmatched Instruction Following and Long Context Understanding for Global Leadership in Generative AI Excellence….

The post Researchers from UCLA and Apple Introduce STIV: A Scalable AI Framework for Text and Image Conditioned Video Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

STIV 视频生成 Diffusion Transformer 文本图像条件 AI框架
相关文章