MarkTechPost@AI 2024年11月28日
Polynomial Mixer (PoM): Overcoming Computational Bottlenecks in Image and Video Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

图像和视频生成领域取得了显著进步,但Transformer架构中的多头注意力机制(MHA)带来的二次计算复杂度限制了模型的规模和分辨率。研究人员提出了多项式混合器(PoM),一种能够替代MHA的创新方法,它以线性计算复杂度实现了图像和视频生成,并保持了传统MHA的通用序列到序列近似能力。PoM在图像生成方面取得了良好的效果,FID分数低于可比的DiT架构,并成功地将图像生成分辨率提升至1024×1024。该研究为克服生成模型的计算瓶颈提供了新的思路,并为未来高分辨率视频生成和多模态大语言模型的发展奠定了基础。

🤔**Transformer架构中的多头注意力机制(MHA)存在二次计算复杂度问题,限制了图像和视频生成模型的规模和分辨率。**随着图像或视频分辨率的增加,计算成本呈指数级增长,例如,将图像分辨率加倍会使计算成本增加16倍。

🎉**多项式混合器(PoM)作为一种创新的替代方案,能够以线性计算复杂度替代MHA,从而克服了计算瓶颈。**PoM通过将整个序列编码到显式状态中,实现了线性计算复杂度,并保持了传统MHA的通用序列到序列近似能力。

📊**PoM在图像生成方面取得了良好的效果,FID分数低于可比的DiT架构,并成功地将图像生成分辨率提升至1024×1024。**实验结果表明,PoM能够作为MHA的直接替代方案,无需进行重大的架构修改。

💡**PoM在图像和视频生成方面展现出巨大的潜力,为未来高分辨率视频生成和多模态大语言模型的发展提供了新的方向。**研究人员认为,PoM可以应用于更长时段的高清视频生成和多模态大语言模型等领域。

🚀**PoM的设计针对图像和视频生成分别进行了优化,例如在图像生成中,模型使用了类似于DiT的AdaLN变体的类条件多形体。**此外,PoM还引入了跨模态PoM操作,以聚合文本和视觉标记之间的信息。

Image and video generation has undergone a remarkable transformation, evolving from a seemingly impossible challenge to a task nearly solved by commercial tools like Stable Diffusion and Sora. This progress is largely driven by Multihead Attention (MHA) in transformer architectures, which excel in scaling capabilities. However, this advancement comes with significant computational challenges. The quadratic computational complexity of transformers poses a critical limitation, where increasing image or video resolution exponentially increases processing requirements. For example, doubling an image’s resolution raises computational costs by 16 times, with videos requiring even more. This limitation remains a key obstacle to building high-quality, large-scale generative models for visual content.

Existing approaches to address the computational challenges in generative models include Diffusion models and Fast alternatives to attention. Diffusion models initially used U-Net architectures with attention layers, learning to transform noisy images into natural representations through forward and reverse processes. Alternative strategies focus on reducing attention complexity, including techniques like Reformer, which approximates attention matrices, and Linformer to projects keys and values into lower-dimensional spaces. State-Space Models (SSM) emerged as a promising alternative, offering linear computational complexity. However, these methods have significant limitations, especially in handling spatial variations and maintaining model flexibility across different sequence lengths.

Researchers from LIGM, Ecole Nationale des Ponts et Chauss ´ ees, IP Paris, Univ Gustave Eiffel, CNRS, France ´ and LIX, Ecole Polytechnique, IP Paris, CNRS, France have proposed Polynomial Mixer (PoM), an approach to address the computational challenges in image and video generation. It emerges as an innovative drop-in replacement for MHA, designed to overcome the quadratic complexity limitations of traditional transformer architectures. PoM achieves linear computational complexity for the number of tokens by encoding the entire sequence into an explicit state. PoM maintains the universal sequence-to-sequence approximation capabilities of traditional MHA, positioning it as an alternative for generative modeling.

The proposed method PoM features distinct designs for image and video generation. For image generation, the model utilizes a class-conditional Polymorpher similar to the AdaLN variant of DiT. Images are initially encoded through a VAE, with visual tokens enhanced by 2D cosine positional encoding. Class and time step embeddings are integrated through embedding matrices and summed together. Each block includes modulations, a PoM, and feed-forward networks, with PoM often utilizing a second-order polynomial and a two-fold expansion factor. The model incorporates cross-modal PoM operations to aggregate information between text and visual tokens, followed by self-aggregation and feed-forward processing.

Quantitative evaluations reveal promising outcomes for the PoM. The model achieves an FID score of 2.46 using the standard ADM evaluation framework, which is lower than comparable DiT architectures, with the notable caveat that the model was trained for only half the number of steps. This performance shows the potential of PoM as an alternative to MHA. Further, the qualitative results show successful fine-tuning enabling image generation at resolutions up to 1024 × 1024 on ImageNet. Moreover, some image classes slightly collapse due to limited training data at higher resolutions. Lastly, the results underscore PoM’s capability to serve as a drop-in replacement for MHA without any significant architectural modifications.

In conclusion, researchers introduced the Polynomial Mixer (PoM), a neural network building block designed to replace traditional attention mechanisms. By achieving linear computational complexity and proving its universal sequence-to-sequence approximation capabilities, PoM demonstrates significant potential across generative domains. It successfully generates competitive image and video models with enhanced resolution and generation speed compared to traditional MHA approaches. While the current implementation shows promise in image and video generation, the researchers identify promising future directions, particularly in long-duration high-definition video generation and multimodal large language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post Polynomial Mixer (PoM): Overcoming Computational Bottlenecks in Image and Video Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多项式混合器 图像生成 视频生成 Transformer 计算复杂度
相关文章