MarkTechPost@AI 2024年11月07日
NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA研究人员推出了MM-Embed,这是一个多模态检索模型,在多模态M-BEIR基准测试中取得了最先进的(SOTA)成果,并在纯文本MTEB检索基准测试中名列前五。MM-Embed旨在弥合多种检索格式之间的差距,提供更流畅的跨文本和图像内容的搜索体验。该模型通过微调多模态大型语言模型(MLLM)来实现,并支持复杂的用户查询,包括文本和图像的组合。此外,MM-Embed还引入了模态感知难负例挖掘,以最大程度地减少MLLM中常见的偏差,从而提高检索质量。

🤔**MM-Embed是首个在多模态M-BEIR基准测试中取得SOTA成果的多模态检索模型**,它能够处理文本、图像以及它们的组合,弥合了不同检索格式之间的差距,实现了更流畅的跨模态搜索体验。

🚀**MM-Embed利用多模态大型语言模型(MLLM)进行微调**,并通过16个检索任务和10个数据集进行训练,展现出强大的通用性,支持复杂的用户查询,包括文本和图像的组合。

🔎**MM-Embed引入了模态感知难负例挖掘技术**,有效地减少了MLLM在处理混合模态数据时产生的偏差,从而提高了检索精度,尤其是在处理复杂文本-图像查询时表现出色。

📊**MM-Embed在M-BEIR基准测试中取得了平均52.7%的检索准确率**,并在MSCOCO数据集上取得了73.8%的R@5检索准确率,展现了强大的图像理解能力。

🏆**MM-Embed通过零样本重排序技术进一步提升了检索精度**,在CIRCO的组合图像检索任务中,检索准确率提升了7个百分点以上,证明了利用大型语言模型进行重排序的有效性。

In the world of information retrieval, one of the most challenging tasks is to create a system that can seamlessly understand and retrieve relevant content across different formats, such as text and images, without losing accuracy. Most state-of-the-art retrieval models are still confined to a single modality—either text-to-text or image-to-image retrieval—which limits their applicability in real-world scenarios where information comes in diverse formats. This limitation is particularly evident in complex applications, such as visual question answering or fashion image retrieval, where both text and images are needed to derive relevant answers. Therefore, the need for a universal multimodal retriever that can handle text, images, and their combinations effectively has never been greater. The key challenges include the inherent difficulty of cross-modal understanding and overcoming biases within individual modalities.

NVIDIA researchers have stepped up to address these challenges by introducing MM-Embed, the first multimodal retriever that has achieved state-of-the-art (SOTA) results on the multimodal M-BEIR benchmark and ranks among the top five retrievers on the text-only MTEB retrieval benchmark. MM-Embed aims to bridge the gap between multiple retrieval formats, allowing for a more fluid search experience that spans both text and image-based content. The researchers fine-tuned MM-Embed using a multimodal large language model (MLLM) as a bi-encoder retriever across 16 retrieval tasks and ten datasets, demonstrating its versatility. Unlike other existing retrievers, MM-Embed does not restrict itself to a single type of data but instead supports complex user queries that may be composed of both text and images. Furthermore, the introduction of modality-aware hard negative mining plays a crucial role in enhancing MM-Embed’s retrieval quality by minimizing the biases commonly seen in MLLMs.

The technical implementation of MM-Embed involved a series of key strategies designed to maximize retrieval performance. The model uses a bi-encoder architecture to fine-tune the retrieval process, leveraging modality-aware hard negative mining to mitigate biases that arise when handling mixed-modality data. In simple terms, this mining approach helps the model focus more accurately on the target modality—whether text, image, or a combination—thus improving its ability to handle difficult, interleaved text-image queries. Additionally, MM-Embed undergoes continual fine-tuning to boost its text retrieval capabilities without sacrificing its strength in multimodal tasks. This makes it particularly effective in a diverse set of scenarios, from retrieving Wikipedia paragraphs in response to a text-based query about an image to finding similar images based on complex descriptions.

This advancement is significant for several reasons. First, MM-Embed sets a new benchmark for multimodal retrieval with an average retrieval accuracy of 52.7% across all M-BEIR tasks, surpassing previous state-of-the-art models. When it comes to specific domains, MM-Embed showed notable improvements, such as a retrieval accuracy (R@5) of 73.8% for the MSCOCO dataset, indicating its strong ability to understand complex image captions. Moreover, by employing zero-shot reranking using multimodal LLMs, MM-Embed further enhanced retrieval precision in cases involving intricate text-image queries, such as visual question answering and composed image retrieval tasks. Notably, MM-Embed improved ranking accuracy in CIRCO’s composed image retrieval task by more than 7 points, showcasing the efficacy of prompting LLMs for reranking in challenging, real-world scenarios.

In conclusion, MM-Embed represents a major leap forward in multimodal retrieval. By effectively integrating and enhancing both text and image retrieval capabilities, it paves the way for more versatile and sophisticated search engines capable of handling the varied ways people seek information in today’s digital landscape.


Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Sponsorship Opportunity with us] Promote Your Research/Product/Webinar with 1Million+ Monthly Readers and 500k+ Community Members

The post NVIDIA AI Introduces MM-Embed: The First Multimodal Retriever Achieving SOTA Results on the Multimodal M-BEIR Benchmark appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MM-Embed 多模态检索 人工智能 NVIDIA M-BEIR
相关文章