MarkTechPost@AI 06月03日 10:45
This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LLaDA-V 是由中国人民大学和蚂蚁集团的研究人员开发的一种纯基于扩散的、用于视觉指令调优和多模态推理的大型多模态语言模型 (MLLM)。该模型整合了文本、图像等多种模态信息,旨在通过创新的架构解决现有 MLLM 在处理复杂多模态上下文时的局限性。LLaDA-V 通过视觉指令调优和掩码扩散模型,在多模态任务中表现出色,尤其是在数据效率和推理能力方面,为多模态人工智能的发展提供了新的思路。

🖼️ LLaDA-V 基于 LLaDA,一个大型语言扩散模型,通过结合视觉编码器和 MLP 连接器,将视觉特征映射到语言嵌入空间,从而实现有效的多模态对齐。 这种设计与当前多模态方法中占主导地位的自回归范式有所不同,旨在克服现有局限性,同时保持数据效率和可扩展性。

⚙️ LLaDA-V 采用掩码扩散过程,通过迭代预测掩码标记来逐步完善文本响应。 与顺序预测标记的自回归模型不同,LLaDA-V 通过反转掩码扩散过程来生成输出。 该模型分三个阶段训练:第一阶段通过将视觉特征从 SigLIP2 映射到 LLaDA 的语言空间来对齐视觉和语言嵌入。第二阶段使用来自 MAmmoTH-VL 的 1000 万个单图像样本和 200 万个多模态样本对模型进行微调。第三阶段侧重于推理,使用来自 VisualWebInstruct 的 90 万个 QA 对和混合数据集策略。双向注意力提高了上下文理解能力,从而实现强大的多模态理解。

🏆 在 18 个多模态任务的评估中,LLaDA-V 表现出优于混合自回归-扩散模型和纯基于扩散的模型的性能。 它在大多数多学科知识和数学推理任务(如 MMMU、MMMU-Pro 和 MMStar)上优于 LLaMA3-V,在 MMStar 上获得了 60.1 分,接近 Qwen2-VL 的 60.7 分,尽管 LLaDA-V 使用了较弱的 LLaDA-8B 语言塔。 LLaDA-V 在数据效率方面也很出色,在 MMMU-Pro 上使用 100 万个样本的表现优于 LLaMA3-V 的 900 万个样本。 虽然它在图表和文档理解基准(如 AI2D)以及真实场景任务(如 RealworldQA)中落后,但 LLaDA-V 的结果突出了其在多模态任务中的前景。

Multimodal large language models (MLLMs) are designed to process and generate content across various modalities, including text, images, audio, and video. These models aim to understand and integrate information from different sources, enabling applications such as visual question answering, image captioning, and multimodal dialogue systems. The development of MLLMs represents a significant step toward creating AI systems that can interpret and interact with the world in a more human-like manner.

A primary challenge in developing effective MLLMs lies in integrating diverse input types, particularly visual data, into language models while maintaining high performance across tasks. Existing models often struggle with balancing strong language understanding and effective visual reasoning, especially when scaling to complex data. Further, many models require large datasets to perform well, making it difficult to adapt to specific tasks or domains. These challenges highlight the need for more efficient and scalable approaches to multimodal learning.

Current MLLMs predominantly utilize autoregressive methods, predicting one token at a time in a left-to-right manner. While effective, this approach has limitations in handling complex multimodal contexts. Alternative methods, such as diffusion models, have been explored; however, they often exhibit weaker language understanding due to their restricted architectures or inadequate training strategies. These limitations suggest a gap where a purely diffusion-based model could offer competitive multimodal reasoning capabilities if designed effectively.

Researchers from the Renmin University of China and Ant Group introduced LLaDA-V, a purely diffusion-based masked language modeling (MLLM) model that integrates visual instruction tuning with masked diffusion models. Built upon LLaDA, a large language diffusion model, LLaDA-V incorporates a vision encoder and an MLP connector to project visual features into the language embedding space, enabling effective multimodal alignment. This design represents a departure from the autoregressive paradigms dominant in current multimodal approaches, aiming to overcome existing limitations while maintaining data efficiency and scalability.

LLaDA-V employs a masked diffusion process where text responses are gradually refined through iterative prediction of masked tokens. Unlike autoregressive models that predict tokens sequentially, LLaDA-V generates outputs by reversing the masked diffusion process. The model is trained in three stages: the first stage aligns vision and language embeddings by mapping visual features from SigLIP2 into LLaDA’s language space. The second stage fine-tunes the model using 10 million single-image samples and 2 million multimodal samples from MAmmoTH-VL. The third stage focuses on reasoning, using 900K QA pairs from VisualWebInstruct and a mixed dataset strategy. Bidirectional attention improves context comprehension, enabling robust multimodal understanding.

In evaluations across 18 multimodal tasks, LLaDA-V demonstrated superior performance compared to hybrid autoregressive-diffusion and purely diffusion-based models. It outperformed LLaMA3-V on most multidisciplinary knowledge and mathematical reasoning tasks like MMMU, MMMU-Pro, and MMStar, achieving a score of 60.1 on MMStar, close to Qwen2-VL’s 60.7, despite LLaDA-V using the weaker LLaDA-8B language tower. LLaDA-V also excelled in data efficiency, outperforming LLaMA3-V on MMMU-Pro with 1M samples against LLaMA3-V’s 9M. Although it lagged in chart and document understanding benchmarks, such as AI2D, and in real-world scene tasks, like RealworldQA, LLaDA-V’s results highlight its promise for multimodal tasks.

In summary, LLaDA-V addresses the challenges of building effective multimodal models by introducing a purely diffusion-based architecture that combines visual instruction tuning with masked diffusion. The approach offers strong multimodal reasoning capabilities while maintaining data efficiency. This work demonstrates the potential of diffusion models in multimodal AI, paving the way for further exploration of probabilistic approaches to complex AI tasks.


Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post This AI Paper Introduces LLaDA-V: A Purely Diffusion-Based Multimodal Large Language Model for Visual Instruction Tuning and Multimodal Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLaDA-V 多模态 扩散模型 视觉指令调优
相关文章