MarkTechPost@AI 2024年08月11日
This AI Paper from Shanghai AI Laboratory Introduces Lumina-mGPT: A High-Resolution Text-to-Image Generation Model with Multimodal Generative Pretraining
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

上海人工智能实验室和香港中文大学的研究人员开发了一种名为 Lumina-mGPT 的新型文本到图像生成模型。该模型基于解码器-only transformer 架构,并采用了多模态生成预训练 (mGPT) 技术。Lumina-mGPT 能够生成高质量、高分辨率的图像,并且在图像质量和视觉一致性方面优于现有的自回归模型。

😊 **Lumina-mGPT 的核心技术**:Lumina-mGPT 采用了一种名为灵活渐进监督微调 (FP-SFT) 的策略,该策略逐步训练模型从低分辨率到高分辨率图像生成。这种方法从学习低分辨率的通用视觉概念开始,并逐步引入更复杂的高分辨率细节。此外,该模型还采用了一种创新的、明确的图像表示系统,通过引入特定的高度和宽度指标以及行尾标记,消除了与可变图像分辨率和纵横比相关的歧义。

😃 **Lumina-mGPT 的性能优势**:Lumina-mGPT 在生成逼真的图像方面表现出显著的改进,可以生成 1024×1024 像素的高分辨率图像,并包含与文本提示高度一致的复杂视觉细节。研究人员报告称,Lumina-mGPT 只需要 1000 万个图像-文本对进行训练,这远远小于其他竞争模型(如 LlamaGen 需要 5000 万个对)。尽管训练数据集较小,但 Lumina-mGPT 在图像质量和视觉一致性方面仍然优于其自回归模型。此外,该模型支持多种任务,包括视觉问答、密集标记和可控图像生成,展示了其作为多模态通才模型的通用性。

🤩 **Lumina-mGPT 的未来展望**:Lumina-mGPT 的灵活性和可扩展性架构进一步增强了其生成多样化、高质量图像的能力。该模型采用先进的解码技术(如无分类器引导 (CFG))在优化生成图像的质量方面发挥着至关重要的作用。例如,通过调整诸如温度和 top-k 值之类的参数,Lumina-mGPT 可以控制生成的图像的细节程度和多样性,这有助于减少视觉伪影并增强整体美学吸引力。

Multimodal generative models represent an exciting frontier in artificial intelligence, focusing on integrating visual and textual data to create systems capable of various tasks. These tasks range from generating highly detailed images from textual descriptions to understanding and reasoning across different data types. The advancements in this field are opening new possibilities for more interactive and intelligent AI systems that can seamlessly combine vision and language.

One of the critical challenges in this domain is the development of autoregressive (AR) models that can generate photorealistic images from text descriptions. While diffusion models have made significant strides in this area, AR models have historically lagged, particularly regarding image quality, resolution flexibility, and the ability to handle various visual tasks. This gap has driven the need for innovative approaches to enhance AR models’ capabilities.

The current landscape of text-to-image generation is dominated by diffusion models, which excel at creating high-quality, visually appealing images. However, AR models like LlamaGen and Parti need help matching this level of performance. These models often rely on complex encoder-decoder architectures and are typically limited to generating images at fixed resolutions. This restricts their flexibility and overall effectiveness in producing diverse, high-resolution outputs.

Researchers from the Shanghai AI Laboratory and the Chinese University of Hong Kong introduced Lumina-mGPT, an advanced AR model designed to overcome these limitations. Lumina-mGPT is based on a decoder-only transformer architecture with multimodal Generative PreTraining (mGPT). This model uniquely combines vision-language tasks within a unified framework, aiming to achieve the same level of photorealistic image generation as diffusion models while maintaining the simplicity and scalability of AR methods.

The Lumina-mGPT model employs a detailed approach to enhance its image generation capabilities. The Flexible Progressive Supervised Finetuning (FP-SFT) strategy is at its core, which progressively trains the model from low-resolution to high-resolution image generation. This process begins with learning general visual concepts at lower resolutions and incrementally introduces more complex, high-resolution details. The model also features an innovative, unambiguous image representation system, eliminating the ambiguity often associated with variable image resolutions and aspect ratios by introducing specific height and width indicators and end-of-line tokens.

In terms of performance, Lumina-mGPT has demonstrated a significant improvement in generating photorealistic images compared to previous AR models. It can produce high-resolution images of 1024×1024 pixels with intricate visual details that closely align with the text prompts provided. The researchers reported that Lumina-mGPT requires only 10 million image-text pairs for training, a significantly smaller dataset than that used by competing models like LlamaGen, which requires 50 million pairs. Despite the smaller dataset, Lumina-mGPT outperforms its AR counterparts in terms of image quality and visual coherence. Furthermore, the model supports a wide range of tasks, including visual question answering, dense labeling, and controllable image generation, showcasing its versatility as a multimodal generalist.

Its flexible and scalable architecture further enhances lumina-mGPT’s ability to generate diverse, high-quality images. The model’s use of advanced decoding techniques, such as Classifier-Free Guidance (CFG), plays a crucial role in refining the quality of the generated images. For instance, by adjusting parameters like temperature and top-k values, Lumina-mGPT can control the level of detail and diversity in the photos it produces, which helps reduce visual artifacts and enhances the overall aesthetic appeal.

In conclusion, Lumina-mGPT represents a significant advancement in autoregressive image generation. Developed by researchers at the Shanghai AI Laboratory and the Chinese University of Hong Kong, this model bridges the gap between AR and diffusion models, offering a powerful new tool for generating photorealistic images from text. Its innovative approach to multimodal pretraining and flexible finetuning demonstrates the potential to transform the capabilities of AR models, making them a viable option for a wide range of vision-language tasks. This breakthrough suggests a promising future for AR-based generative models, potentially leading to more sophisticated and versatile AI systems.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 48k+ ML SubReddit

Find Upcoming AI Webinars here


The post This AI Paper from Shanghai AI Laboratory Introduces Lumina-mGPT: A High-Resolution Text-to-Image Generation Model with Multimodal Generative Pretraining appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 文本到图像生成 Lumina-mGPT 多模态生成预训练 自回归模型
相关文章