MarkTechPost@AI 02月20日
Advancing MLLM Alignment Through MM-RLHF: A Large-Scale Human Preference Dataset for Multimodal Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MM-RLHF是一种创新的方法,旨在通过大规模人类偏好数据集提升多模态大型语言模型(MLLM)的对齐。该方法引入了细粒度的人工标注偏好比较对,并提出了基于评论的奖励模型和动态奖励缩放机制,以提高模型决策的可解释性和对齐效率。实验结果表明,MM-RLHF在会话能力、安全性、幻觉减少、数学推理和多图像理解等方面均有显著改善,为MLLM的全面对齐提供了一种有效途径。

💡MM-RLHF提出了一种全面的数据集,包含12万个细粒度的人工标注偏好比较对,在规模、多样性和标注质量方面均优于现有资源,为多模态大型语言模型的对齐提供了坚实的基础。

🤖该方法引入了基于评论的奖励模型,通过生成详细的评论来为输出评分,增强了模型决策的可解释性,并克服了传统标量奖励机制在多模态环境中的局限性。

⚖️动态奖励缩放机制优化了基于奖励信号的样本权重,提高了对齐过程的效率,使得模型在多个维度上都得到了显著的改进,包括会话能力提升10%以上,不安全行为减少至少50%。

🖼️MM-RLHF在图像理解、视频理解和多模态安全等多个领域进行了数据准备和过滤,构建了超过1000万个对话样本,覆盖了从基本对话到复杂推理的各种任务。

🚀实验结果表明,MM-RLHF应用于LLaVA-OV-7B、LLaVA-OV-0.5B和InternVL-1B等模型时,在幻觉减少、数学推理和多图像理解等方面均表现出更好的效果,即使没有针对特定任务的训练数据。

Multimodal Large Language Models (MLLMs) have gained significant attention for their ability to handle complex tasks involving vision, language, and audio integration. However, they lack the comprehensive alignment beyond basic Supervised Fine-tuning (SFT). Current state-of-the-art models often bypass rigorous alignment stages, leaving crucial aspects like truthfulness, safety, and human preference alignment inadequately addressed. Existing approaches target only specific domains such as hallucination reduction or conversational improvements, falling short of enhancing the model’s overall performance and reliability. This narrow focus raises questions about whether human preference alignment can improve MLLMs across a broader spectrum of tasks.

Recent years have witnessed substantial progress in MLLMs, built upon advanced LLM architectures like GPTs, LLaMA, Alpaca, Vicuna, and Mistral. These models have evolved through end-to-end training approaches, tackling complex multimodal tasks involving image-text alignment, reasoning, and instruction following. Several open-source MLLMs, including Otter, mPLUG-Owl, LLaVA, Qwen-VL, and VITA, have emerged to address fundamental multimodal challenges. However, alignment efforts have remained limited. While algorithms like Fact-RLHF and LLAVACRITIC have shown promise in reducing hallucinations and improving conversational abilities, they haven’t enhanced general capabilities. Evaluation frameworks such as MME, MMBench, and Seed-Bench have been developed to assess these models.

Researchers from KuaiShou, CASIA, NJU, USTC, PKU, Alibaba, and Meta AI have proposed MM-RLHF, an innovative approach featuring a comprehensive dataset of 120k fine-grained, human-annotated preference comparison pairs. This dataset represents a significant advancement in terms of size, diversity, and annotation quality compared to existing resources. The method introduces two key innovations: a Critique-Based Reward Model that generates detailed critiques before scoring outputs, and Dynamic Reward Scaling that optimizes sample weights based on reward signals. It enhances both the interpretability of model decisions and the efficiency of the alignment process, addressing the limitations of traditional scalar reward mechanisms in multimodal contexts.

The MM-RLHF implementation involves a complex data preparation and filtering process across three main domains: image understanding, video understanding, and multimodal safety. The image understanding component integrates data from multiple sources including LLaVA-OV, VLfeedback, and LLaVA-RLHF, with multi-turn dialogues converted to single-turn format. This compilation results in over 10 million dialogue samples covering diverse tasks from basic conversation to complex reasoning. The data filtering process uses predefined sampling weights categorized into three types: multiple-choice questions for testing reasoning and perception, long-text questions for evaluating conversational abilities, and short-text questions for basic image analysis.

The evaluation of MM-RLHF and MM-DPO shows significant improvements across multiple dimensions when applied to models like LLaVA-OV-7B, LLaVA-OV-0.5B, and InternVL-1B. Conversational abilities improved by over 10%, while unsafe behaviors decreased by at least 50%. The aligned models show better results in hallucination reduction, mathematical reasoning, and multi-image understanding, even without specific training data for some tasks. However, model-specific variations are observed, with different models requiring distinct hyperparameter settings for optimal performance. Also, high-resolution tasks show limited gains due to dataset constraints and filtering strategies that don’t target resolution optimization.

In this paper, researchers introduced MM-RLHF, a dataset and alignment approach that shows significant advancement in MLLM development. Unlike previous task-specific approaches, this method takes a holistic approach to improve model performance across multiple dimensions. The dataset’s rich annotation granularity, including per-dimension scores and ranking rationales, offers untapped potential for future development. Future research directions will focus on utilizing this granularity through advanced optimization techniques, addressing high-resolution data limitations, and expanding the dataset through semi-automated methods, potentially establishing a foundation for more robust multimodal learning frameworks.


Check out the Paper and Project Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Advancing MLLM Alignment Through MM-RLHF: A Large-Scale Human Preference Dataset for Multimodal Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MM-RLHF 多模态大型语言模型 人类偏好 对齐 数据集
相关文章