MarkTechPost@AI 2024年11月22日
Microsoft Research Introduces Reducio-DiT: Enhancing Video Generation Efficiency with Advanced Compression
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软研究人员推出了Reducio-DiT,一种旨在解决大型视频生成模型计算成本高昂问题的新方法。该方法利用图像条件变分自动编码器(VAE)显著压缩视频的潜在空间,从而大幅降低生成高分辨率视频所需的计算资源。Reducio-DiT将VAE与扩散模型相结合,实现了在单个A100 GPU上以15.5秒的速度生成1024×1024视频,速度提升16.6倍,同时保持了视频质量。这种高效的解决方案有望推动视频生成技术在内容创作、广告和互动娱乐等领域的广泛应用。

🤔 **Reducio-DiT的核心思路是利用视频中存在的冗余信息,通过VAE压缩潜在空间,从而实现64倍的潜在表示大小降低,而不会影响视频质量。** 这种压缩技术有效降低了生成高分辨率视频所需的计算资源,使得视频生成更加高效。

🚀 **Reducio-DiT采用两阶段生成方法:首先利用文本到图像技术生成内容图像,然后以该图像为先验,通过扩散过程创建视频帧。** 这种方法将运动信息与静态背景分离,并对运动信息进行高效压缩,从而减少了计算量。

💡 **Reducio-DiT在UCF-101数据集上实现了318.5的FVD得分,并且速度比Lavie等现有方法快16.6倍。** 同时,该模型还采用了多阶段训练策略,从低分辨率到高分辨率逐步提升视频生成能力,确保了生成帧的视觉完整性和时间一致性。

💻 **Reducio-DiT显著降低了视频生成所需的硬件资源,使其能够在没有大量GPU资源的环境中使用。** 这使得高分辨率视频生成技术更加普及,惠及更多用户和应用场景。

Recent advancements in video generation models have enabled the production of high-quality, realistic video clips. However, these models face challenges in scaling for large-scale, real-world applications due to the computational demands required for training and inference. Current commercial models like Sora, Runway Gen-3, and Movie Gen demand extensive resources, including thousands of GPUs and millions of GPU hours for training, with each second of video inference taking several minutes. These high requirements make these solutions costly and impractical for many potential applications, limiting the use of high-fidelity video generation to only those with substantial computational resources.

Reducio-DiT: A New Solution

Microsoft researchers have introduced Reducio-DiT, a new approach designed to address this problem. This solution centers around an image-conditioned variational autoencoder (VAE) that significantly compresses the latent space for video representation. The core idea behind Reducio-DiT is that videos contain more redundant information compared to static images, and this redundancy can be leveraged to achieve a 64-fold reduction in latent representation size without compromising video quality. The research team has combined this VAE with diffusion models to improve the efficiency of generating 1024×1024 video clips, reducing the inference time to 15.5 seconds on a single A100 GPU.

Technical Approach

From a technical perspective, Reducio-DiT stands out due to its two-stage generation approach. First, it generates a content image using text-to-image techniques, and then it uses this image as a prior to create video frames through a diffusion process. The motion information, which constitutes a large part of a video’s content, is separated from the static background and compressed efficiently in the latent space, resulting in a much smaller computational footprint. Specifically, Reducio-VAE—the autoencoder component of Reducio-DiT—leverages 3D convolutions to achieve a significant compression factor, enabling a 4096-fold down-sampled representation of the input videos. The diffusion component, Reducio-DiT, integrates this highly compressed latent representation with features extracted from both the content image and the corresponding text prompt, thereby producing smooth, high-quality video sequences with minimal overhead.

This approach is important for several reasons. Reducio-DiT offers a cost-effective solution to an industry burdened by computational challenges, making high-resolution video generation more accessible. The model demonstrated a speedup of 16.6 times over existing methods like Lavie, while achieving a Fréchet Video Distance (FVD) score of 318.5 on UCF-101, outperforming other models in this category. By utilizing a multi-stage training strategy that scales up from low to high-resolution video generation, Reducio-DiT maintains the visual integrity and temporal consistency across generated frames—a challenge that many previous approaches to video generation struggled to achieve. Additionally, the compact latent space not only accelerates the video generation process but also reduces the hardware requirements, making it feasible for use in environments without extensive GPU resources.

Conclusion

Microsoft’s Reducio-DiT represents an advance in video generation efficiency, balancing high quality with reduced computational cost. The ability to generate a 1024×1024 video clip in 15.5 seconds, combined with a significant reduction in training and inference costs, marks a notable development in the field of generative AI for video. For further technical exploration and access to the source code, visit Microsoft’s GitHub repository for Reducio-VAE. This development paves the way for more widespread adoption of video generation technology in applications such as content creation, advertising, and interactive entertainment, where generating engaging visual media quickly and cost-effectively is essential.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI VIRTUAL CONFERENCE] SmallCon: Free Virtual GenAI Conference ft. Meta, Mistral, Salesforce, Harvey AI & more. Join us on Dec 11th for this free virtual event to learn what it takes to build big with small models from AI trailblazers like Meta, Mistral AI, Salesforce, Harvey AI, Upstage, Nubank, Nvidia, Hugging Face, and more.

The post Microsoft Research Introduces Reducio-DiT: Enhancing Video Generation Efficiency with Advanced Compression appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视频生成 Reducio-DiT AI 扩散模型 压缩
相关文章