MarkTechPost@AI 07月28日 05:24
VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

现有嵌入模型在处理自然图像方面表现出色,但对文档、视频等更广泛视觉信息的覆盖不足,导致在实际搜索任务中表现不佳。为解决此问题,研究人员提出了VLM2Vec-V2,一个统一图像、视频和视觉文档检索的通用嵌入模型。该模型基于Qwen2-VL,并通过MMEB-V2基准进行训练,该基准扩展了多模态任务的评估范围。VLM2Vec-V2在多项任务上均取得了领先性能,证明了其在跨模态表示学习方面的强大能力和广阔的应用前景。

💡 **统一跨模态嵌入的挑战与解决方案**:现有嵌入模型主要集中在自然图像,忽视了文档、视频等更广泛的视觉信息,导致在实际应用中表现受限。VLM2Vec-V2通过统一的框架,能够处理图像、视频和视觉文档,弥补了这一不足。

🚀 **VLM2Vec-V2的技术核心与优势**:该模型以Qwen2-VL为基础,利用其Naive Dynamic Resolution、M-RoPE和统一的卷积框架等特性,实现了跨模态的有效嵌入。其灵活的数据采样管道,通过on-the-fly批次混合和交错子批次策略,提高了多任务训练的稳定性和效率。

🏆 **卓越的性能表现与基准测试**:VLM2Vec-V2在包含78个跨图像、视频和视觉文档任务的MMEB-V2基准上,取得了58.0的最高平均分,显著优于GME、LamRA等基线模型。在模型规模小于其他模型的情况下,它在图像任务上达到了可比的性能,并在视频和文档任务上展现了强大的竞争力。

📊 **MMEB-V2基准的重要性**:MMEB-V2作为评估多模态嵌入模型的新基准,不仅扩展了评估维度,涵盖了视觉文档检索、视频相关任务等,还为研究人员提供了一个诊断和推进未来跨模态学习研究的重要工具。

🌐 **实际应用前景广阔**:VLM2Vec-V2的成功不仅在于其技术创新,更在于它为构建更具可扩展性和灵活性的表示学习奠定了基础,预示着在信息检索、内容理解等多个领域具有巨大的应用潜力。

Embedding models act as bridges between different data modalities by encoding diverse multimodal information into a shared dense representation space. There have been advancements in embedding models in recent years, driven by progress in large foundation models. However, existing multimodal embedding models are trained on datasets such as MMEB and M-BEIR, with most focus only on natural images and photographs sourced from the MSCOCO, Flickr, and ImageNet datasets. These datasets fail to cover larger forms of visual information, including documents, PDFs, websites, videos, and slides. This causes existing embedding models to underperform on realistic tasks such as article searching, website searching, and YouTube video search.

Multimodal embedding benchmarks such as MSCOCO, Flickr30K, and Conceptual Captions initially focused on static image-text pairs for tasks like image captioning and retrieval. More recent benchmarks, such as M-BEIR and MMEB, introduced multi-task evaluations, but remain limited to static images and short contexts. Video representation learning has evolved through models like VideoCLIP and VideoCoCa, integrating contrastive learning with captioning objectives. Visual document representation learning advanced through models like ColPali and VisRAG, which use VLMs for document retrieval. Unified modality retrieval methods like GME and Uni-Retrieval achieve strong performance on universal benchmarks. However, none can unify image, video, and visual document retrieval within a single framework.

Researchers from Salesforce Research, UC Santa Barbara, University of Waterloo, and Tsinghua University have proposed VLM2Vec-V2 to unify image, video, and visual document retrieval within a single framework. Firstly, researchers developed MMEB-V2, a benchmark that extends MMEB with five new task types, including visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. Secondly, VLM2Vec-V2 serves as a general-purpose embedding model that supports multiple input modalities while demonstrating strong performance on both newly introduced tasks and original image benchmarks. This establishes a foundation for more scalable and flexible representation learning in both research and practical applications.

VLM2Vec-V2 utilizes Qwen2-VL as its backbone, selected for its specialized capabilities in multimodal processing. Qwen2-VL offers three critical features that support unified embedding learning: Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-RoPE), and a unified framework that combines 2D and 3D convolutions. To enable effective multi-task training across diverse data sources, VLM2Vec-V2 introduces a flexible data sampling pipeline with two key components: (a) on-the-fly batch mixing based on predefined sampling weight tables that control the relative probabilities of each dataset, and (b) an interleaved sub-batching strategy that splits full batches into independently sampled sub-batches, improving the stability of contrastive learning.

VLM2Vec-V2 achieves the highest overall average score of 58.0 across 78 datasets covering image, video, and visual document tasks, outperforming strong baselines including GME, LamRA, and VLM2Vec built on the same Qwen2-VL backbone. On image tasks, VLM2Vec-V2 outperforms most baselines by significant margins and achieves performance comparable to VLM2Vec-7B despite being only 2B parameters in size. For video tasks, the model achieves competitive performance despite training on relatively small amounts of video data. In visual document retrieval, VLM2Vec-V2 outperforms all VLM2Vec variants, but still lags behind ColPali, which is specifically optimized for visual document tasks.

In conclusion, researchers introduced VLM2Vec-V2, a strong baseline model trained through contrastive learning across diverse tasks and modality combinations. VLM2Vec-V2 is built upon MMEB-V2 and uses Qwen2-VL as its backbone model. MMEB-V2 is a benchmark designed by researchers to assess multimodal embedding models across various modalities, including text, images, videos, and visual documents. The experimental evaluation demonstrates the effectiveness of VLM2Vec-V2 in achieving balanced performance across multiple modalities while highlighting the diagnostic value of MMEB-V2 for future research.


Check out the Paper, GitHub Page and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post VLM2Vec-V2: A Unified Computer Vision Framework for Multimodal Embedding Learning Across Images, Videos, and Visual Documents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VLM2Vec-V2 跨模态嵌入 计算机视觉 多模态学习 人工智能
相关文章