MarkTechPost@AI 04月30日 04:40
UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

UniME是一个旨在提升MLLM多模态表征学习的双阶段框架。它首先通过文本判别知识蒸馏,利用大型语言模型增强MLLM的语言嵌入。接着,通过硬负例增强的指令调优,减少假负例干扰,并鼓励模型区分具有挑战性的示例。在MMEB和检索任务上的广泛评估表明,UniME在各种任务中都能持续提升性能,提供强大的判别和组合能力,超越了CLIP等先前模型的局限性。

📚UniME框架采用双阶段方法,首先通过文本判别知识蒸馏,使用教师模型训练学生MLLM,以提升嵌入质量。NV-Embed V2 作为教师模型。

⚔️第二阶段是硬负例增强的指令调优,通过过滤假负例并采样硬负例,改善跨模态对齐和任务性能。这一阶段还利用特定任务提示,增强指令跟随能力,适用于检索和视觉问答等应用。

📊UniME 在 Phi3.5-V 和 LLaVA-1.6 上进行了评估,使用 PyTorch 和 DeepSpeed 在 8 个 NVIDIA A100 GPU 上进行高效训练。在 NLI 数据集(273,000 对)上进行文本知识蒸馏,并在 662,000 个多模态对上进行硬负例指令调优。

🎯在 36 个 MMEB 基准数据集上评估了 UniME,与 E5-V 和 VLM2Vec 等基线相比,实现了持续改进。硬负例显著提高了模型区分细微差异的能力,从而提高了其性能,尤其是在长标题和组合检索任务中。

The CLIP framework has become foundational in multimodal representation learning, particularly for tasks such as image-text retrieval. However, it faces several limitations: a strict 77-token cap on text input, a dual-encoder design that separates image and text processing, and a limited compositional understanding that resembles bag-of-words models. These issues hinder its effectiveness in capturing nuanced, instruction-sensitive semantics. Although MLLMs like LLaVA, Qwen2-VL, and CogVLM offer significant advances in vision-language reasoning, their autoregressive next-token prediction objective restricts their ability to learn generalized, transferable embeddings. This has sparked growing interest in developing alternative methods that can combine the strengths of both contrastive learning and LLM-based reasoning.

Recent approaches aim to overcome these limitations by employing novel architectures and training strategies. For instance, E5-V proposes unimodal contrastive training for aligning cross-modal features, while VLM2Vec introduces the MMEB benchmark to convert advanced vision-language models into effective embedding generators. Models like LLM2Vec and NV-Embed enhance text-based representation learning by modifying the attention mechanisms in decoder-only LLMs. Despite these innovations, challenges such as handling long sequences, enabling better cross-modal fusion, and effectively distinguishing hard negatives in contrastive learning remain. As multimodal applications expand, there is a pressing need for representation learning methods that are both scalable and capable of fine-grained semantic alignment.

Researchers from institutions including The University of Sydney, DeepGlint, Tongyi Lab at Alibaba, and Imperial College London introduce UniME, a two-stage framework designed to improve multimodal representation learning using MLLMs. The first stage applies textual discriminative knowledge distillation from a strong LLM teacher to enhance the language encoder. The second stage employs hard negative enhanced instruction tuning, which involves filtering false negatives and sampling multiple challenging negatives per instance to improve the model’s discriminative and instruction-following abilities. Evaluations on the MMEB benchmark and various retrieval tasks show that UniME delivers consistent and significant improvements in both performance and compositional understanding.

The UniME framework introduces a two-stage method for learning universal multimodal embeddings using MLLMs. First, it employs textual discriminative knowledge distillation, where a student MLLM is trained using text-only prompts and supervised by a teacher model to enhance embedding quality. Then, a second stage—hard negative enhanced instruction tuning—improves cross-modal alignment and task performance by filtering false negatives and sampling hard negatives. This stage also leverages task-specific prompts to enhance instruction-following for various applications, such as retrieval and visual question answering. Together, these stages significantly boost UniME’s performance on both in- and out-of-distribution tasks.

The study evaluated UniME on Phi3.5-V and LLaVA-1.6 using PyTorch with DeepSpeed for efficient training across 8 NVIDIA A100 GPUs. Training consisted of two stages: a textual knowledge distillation phase using the NLI dataset (273,000 pairs) and a hard negative instruction tuning phase on 662,000 multimodal pairs. NV-Embed V2 served as the teacher model. UniME was evaluated on 36 MMEB benchmark datasets, achieving consistent improvements over baselines such as E5-V and VLM2Vec. Hard negatives significantly improved the model’s ability to distinguish subtle differences, thereby enhancing its performance, particularly in long-caption and compositional retrieval tasks. Ablation studies confirmed the effectiveness of both training stages and tuning parameters.

In conclusion, UniME is a two-stage framework designed to improve multimodal representation learning using MLLMs. In the first stage, UniME distills textual discriminative knowledge from a large language model to strengthen the language embeddings of the MLLM. In the second stage, it enhances learning through instruction tuning with multiple hard negatives per batch, reducing false negative interference and encouraging the model to distinguish challenging examples. Extensive evaluation on MMEB and various retrieval tasks demonstrates that UniME consistently boosts performance, offering strong discriminative and compositional abilities across tasks, thereby surpassing the limitations of prior models, such as CLIP.


Check out the Paper and Code. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post UniME: A Two-Stage Framework for Enhancing Multimodal Representation Learning with MLLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UniME 多模态学习 MLLM 表征学习
相关文章