MarkTechPost@AI 05月28日 12:35
This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MMaDA是由普林斯顿大学等机构的研究者提出的统一多模态扩散模型,旨在整合文本推理、视觉理解和图像生成。该模型采用共享扩散架构,无需针对特定模态的组件,简化了跨不同数据类型的训练。MMaDA通过混合长链思考(Long-CoT)微调策略,对齐文本和图像任务中的推理步骤。此外,还开发了UniGRPO强化学习算法,使用策略梯度和多样化的奖励信号,优化模型性能。实验结果表明,MMaDA在文本到图像生成、多模态理解和文本推理等任务中均表现出色,为构建通用多模态AI系统提供了新的方向。

💡MMaDA模型的核心在于其统一的扩散架构,它摒弃了模态特定的组件,从而简化了跨文本和图像等不同数据类型的训练过程,使得模型能够以一种连贯的方式处理推理和生成任务。

📚MMaDA引入了混合长链思考(Long-CoT)微调策略,通过对数学问题解决和视觉问答等多样化的推理轨迹数据集进行学习,引导模型掌握跨模态的复杂推理能力。

🏆MMaDA在性能基准测试中表现出色,在文本到图像生成方面,CLIP得分为32.46,ImageReward为1.15,超越了SDXL和Janus等模型。在多模态理解方面,POPE得分达到86.1,MME得分为1410.7,Flickr30k得分为67.6,超过了Show-o和SEED-X等系统。在文本推理方面,GSM8K得分为73.4,MATH500得分为36.0,优于其他基于扩散的模型,如LLaDA-8B。

Diffusion models, known for their success in generating high-quality images, are now being explored as a foundation for handling diverse data types. These models denoise data and reconstruct original content from noisy inputs. This ability makes diffusion models promising for multimodal tasks involving discrete data, such as text, and continuous data, such as images.

The challenge in multimodal models is building systems that can handle understanding and generation across text and images without using separate methods or architectures. Existing models often struggle to balance these tasks effectively. They are designed for specific tasks like image generation or question answering, which results in limited performance in unified tasks. Post-training techniques that could further align models across reasoning and generation tasks are also underdeveloped, leaving a gap in fully integrated multimodal models that can handle diverse challenges using a single design.

Popular approaches like Show-o, Janus, and SEED-X combine autoregressive models for text and diffusion models for images, requiring separate loss functions and architectures. These models use distinct tokenization schemes and separate pipelines for text and image tasks, complicating training and limiting their ability to handle reasoning and generation in a unified way. Furthermore, they focus heavily on pretraining strategies, overlooking post-training methods that could help these models learn to reason across different data types.

Researchers from Princeton University, Peking University, Tsinghua University, and ByteDance have introduced MMaDA, a unified multimodal diffusion model. This system integrates textual reasoning, visual understanding, and image generation into a probabilistic framework. MMaDA uses a shared diffusion architecture without relying on modality-specific components, simplifying training across different data types. The model’s design allows it to process textual and visual data together, enabling a streamlined, cohesive approach for reasoning and generation tasks.

The MMaDA system introduces a mixed long chain-of-thought (Long-CoT) finetuning strategy that aligns reasoning steps across text and image tasks. The researchers curated a diverse dataset of reasoning traces, such as problem-solving in mathematics and visual question answering, to guide the model in learning complex reasoning across modalities. They also developed UniGRPO, a reinforcement learning algorithm tailored for diffusion models, which uses policy gradients and diversified reward signals, including correctness, format adherence, and alignment with visual content. The model’s training pipeline incorporates a uniform masking strategy and structured denoising steps, ensuring stability during learning and allowing the model to reconstruct content across different tasks effectively.

In performance benchmarks, MMaDA demonstrated strong results across diverse tasks. It achieved a CLIP score of 32.46 for text-to-image generation and an ImageReward of 1.15, outperforming models like SDXL and Janus. In multimodal understanding, it reached a POPE score of 86.1, an MME score of 1410.7, and a Flickr30k score of 67.6, surpassing systems such as Show-o and SEED-X. For textual reasoning, MMaDA scored 73.4 on GSM8K and 36.0 on MATH500, outperforming other diffusion-based models like LLaDA-8B. These results highlight MMaDA’s capacity to deliver consistent, high-quality outputs across reasoning, understanding, and generation tasks.

Overall, MMaDA provides a practical solution to the challenges of building unified multimodal models by introducing a simplified architecture and innovative training techniques. The research shows that diffusion models can excel as general-purpose systems capable of reasoning and generation across multiple data types. By addressing the limitations of existing models, MMaDA offers a blueprint for developing future AI systems that seamlessly integrate different tasks in a single, robust framework.


Check out the Paper, Model on Hugging Face and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Paper Introduces MMaDA: A Unified Multimodal Diffusion Model for Textual Reasoning, Visual Understanding, and Image Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MMaDA 多模态学习 扩散模型
相关文章