MarkTechPost@AI 2024年10月17日
Meissonic: A Non-Autoregressive Mask Image Modeling Text-to-Image Synthesis Model that can Generate High-Resolution Images
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meissonic是一种非自回归掩码图像建模的文本到图像合成模型,能生成高分辨率图像。它融合多种创新方法,解决了现有模型的一些问题,在图像生成和编辑方面表现出色,且在多种评估中取得优异成绩。

🧐Meissonic利用一系列架构创新、先进的位置编码策略和优化的采样条件,提升了性能和效率,能生成1024×1024分辨率的图像,图像质量和分辨率高。

🎯该模型集成了CLIP文本编码器、向量量化(VQ)图像编码器和解码器以及多模态Transformer骨干,VQ-VAE模型将图像像素转换为离散语义令牌,CLIP文本编码器用于优化性能,多模态Transformer骨干利用采样参数和旋转位置嵌入进行空间信息编码。

💪Meissonic在8GB VRAM上能高效运行,具有10亿参数并经过优化,包括QK-Norm层和梯度裁剪,增强了训练稳定性,在图像编辑任务中表现出色,且在多种评估中性能可比DALL-E 2和SDXL。

Large Language Models (LLMs) have demonstrated remarkable progress in natural language processing tasks, inspiring researchers to explore similar approaches for text-to-image synthesis. At the same time, diffusion models have become the dominant approach in visual generation. However, the operational differences between the two approaches present a significant challenge in developing a unified methodology for language and vision tasks. Recent developments like LlamaGen have ventured into autoregressive image generation using discrete image tokens; however, it is inefficient due to the large number of image tokens compared to text tokens. Non-autoregressive methods like MaskGIT and MUSE have emerged, cutting down on the number of decoding steps, but failing to produce high-quality, high-resolution images.

Existing attempts to solve the challenges in text-to-image synthesis have mainly focused on two approaches: diffusion-based and token-based image generation. Diffusion models, like Stable Diffusion and SDXL, have made significant progress by working within compressed latent spaces and introducing techniques like micro-conditions and multi-aspect training. The integration of transformer architectures, as seen in DiT and U-ViT, has further enhanced the potential of diffusion models. However, these models still face challenges in real-time applications and quantization. Token-based approaches like MaskGIT and MUSE, have introduced masked image modeling (MIM) to overcome the computational demands of autoregressive methods.

Researchers from Alibaba Group, Skywork AI, HKUST(GZ), HKUST, Zhejiang University, and UC Berkeley have proposed Meissonic, an innovative method to elevate non-autoregressive MIM text-to-image synthesis to a level comparable with state-of-the-art diffusion models like SDXL. Meissonic utilizes a comprehensive suite of architectural innovations, advanced positional encoding strategies, and optimized sampling conditions to enhance MIM’s performance and efficiency. The model uses high-quality training data, micro-conditions informed by human preference scores, and feature compression layers to improve image fidelity and resolution. The Meissonic can produce 1024 × 1024 resolution images and often outperforms existing models in generating high-quality, high-resolution images.

Meissonic’s architecture integrates a CLIP text encoder, a vector-quantized (VQ) image encoder and decoder, and a multi-modal Transformer backbone for efficient high-performance text-to-image synthesis:

The architecture also includes QK-Norm layers and implements gradient clipping to enhance training stability and reduce NaN Loss issues during distributed training.

Meissonic, optimized to 1 billion parameters, runs efficiently on 8GB VRAM, making inference and fine-tuning convenient. Qualitative comparisons show Meissonic’s image quality and text-image alignment capabilities. Human evaluations using K-Sort Arena and GPT-4 assessments indicate that Meissonic achieves performance comparable to DALL-E 2 and SDXL in human preference and text alignment, with improved efficiency. Meissonic is benchmarked against state-of-the-art models using the EMU-Edit dataset in image editing tasks, covering seven different operations. The model demonstrated versatility in both mask-guided and mask-free editing, achieving great performance without specific training on image editing data or instruction datasets.

In conclusion, researchers introduced Meissonic, an approach to elevate non-autoregressive MIM text-to-image synthesis. The model incorporates innovative elements such as a blended transformer architecture, advanced positional encoding, and adaptive masking rates to achieve superior performance in high-resolution image generation. Despite its compact 1B parameter size, Meissonic outperforms larger diffusion models while remaining accessible on consumer-grade GPUs. Moreover, Meissonic aligns with the emerging trend of offline text-to-image applications on mobile devices, exemplified by recent innovations from Google and Apple. It enhances the user experience and privacy in mobile imaging technology, empowering users with creative tools while ensuring data security.


Check out the Paper and Model. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Meissonic: A Non-Autoregressive Mask Image Modeling Text-to-Image Synthesis Model that can Generate High-Resolution Images appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meissonic 文本到图像合成 高分辨率图像 模型创新
相关文章