MarkTechPost@AI 2024年07月23日
Microsoft Research Introduces E5-V: A Universal AI Framework for Multimodal Embeddings with Single-Modality Training on Text Pairs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软研究院和北航的研究人员发布了 E5-V 框架,这是一个新的多模态大语言模型 (MLLM) 框架,旨在实现通用多模态嵌入。该框架通过利用文本对上的单模态训练,显着降低了训练成本,并消除了对多模态数据收集的需求。E5-V 基于提示的表示方法,将多模态嵌入统一到一个单一空间中,从而提高了多模态表示的鲁棒性和通用性。

🤔 E5-V 框架旨在解决当前多模态学习中存在的挑战,即有效地表示多模态信息。与传统的需要大量多模态数据训练的模型不同,E5-V 通过利用文本对上的单模态训练来实现多模态嵌入,从而降低了训练成本并简化了训练过程。

💡 E5-V 框架的核心创新在于它采用了一种新的基于提示的表示方法,将多模态输入表示为词语,有效地消除了模态之间的差距。这种方法使得模型能够处理复杂的视觉语言任务,例如组合图像检索。

💪 E5-V 在各种任务中表现出色,包括文本图像检索、组合图像检索、句子嵌入和图像图像检索。在零样本图像检索任务中,E5-V 在 Flickr30K 和 COCO 上的 Recall@1 分别超过了 CLIP ViT-L 12.2% 和 15.0%。在组合图像检索任务中,E5-V 在 CIRR 数据集上超越了当前最先进的方法 iSEARLE-XL。

🚀 E5-V 框架的出现标志着多模态学习领域取得了重大进展。它为需要集成视觉和语言理解的各种任务提供了更有效和高效的解决方案,为人工智能的未来发展铺平了道路。

A major development in artificial intelligence, multimodal large language models (MLLMs) combine verbal and visual comprehension to produce more accurate representations of multimodal inputs. Through the integration of data from multiple sources, including text and images, these models improve understanding of intricate relationships between various modalities. Because of this integration, sophisticated tasks requiring a thorough comprehension of many kinds of data are now possible. As a result, MLLMs are a critical area of interest for contemporary AI research.

A primary challenge in multimodal learning is achieving effective representation of multimodal information. Current research includes frameworks like CLIP, which aligns visual and language representations using contrastive learning on image-text pairs. Models such as BLIP, KOSMOS, LLaMA-Adapter, and LLaVA extend LLMs to handle multimodal information. These methods often use separate encoders for text and images, leading to poor interleaved input integration. Moreover, they require extensive, costly multimodal training data and need help with comprehensive language understanding and complex visual-linguistic tasks, falling short of achieving universal, efficient multimodal embeddings.

To address these limitations, researchers from Beihang University and Microsoft Corporation introduced the E5-V framework, designed to adapt MLLMs for universal multimodal embeddings. This innovative approach leverages single-modality training on text pairs, significantly reducing training costs and eliminating the need for multimodal data collection. By focusing on text pairs, the E5-V framework demonstrates substantial improvements in representing multimodal inputs compared to traditional methods, offering a promising alternative for future developments in the field.

The E5-V framework employs a novel prompt-based representation method to unify multimodal embeddings into a single space. During training, the model exclusively uses text pairs, simplifying the process and cutting costs associated with collecting multimodal data. The key innovation lies in instructing MLLMs to represent multimodal inputs as words, effectively removing the modality gap. This method allows the model to handle highly accurate tasks like composed image retrieval. By unifying different embeddings into the same space based on their meanings, the E5-V framework enhances the robustness and versatility of multimodal representations.

E5-V has demonstrated impressive performance across various tasks, including text-image retrieval, composed image retrieval, sentence embeddings, and image-image retrieval. The framework surpasses state-of-the-art models in several benchmarks. For instance, in zero-shot image retrieval tasks, E5-V outperforms CLIP ViT-L by 12.2% on Flickr30K and 15.0% on COCO with Recall@1, showcasing its superior ability to integrate visual and language information. Furthermore, E5-V significantly improves composed image retrieval tasks, outperforming the current state-of-the-art method iSEARLE-XL by 8.50% on Recall@1 and 10.07% on Recall@5 on the CIRR dataset. These results underscore the framework’s effectiveness in accurately representing interleaved inputs and complex interactions.

The researchers conducted extensive experiments to validate the effectiveness of E5-V. In text-image retrieval tasks, E5-V achieved competitive performance on the Flickr30K and COCO datasets. For example, E5-V demonstrated a Recall@10 of 98.7% on Flickr30K, outperforming models trained on image-text pairs. In composed image retrieval tasks, E5-V showed remarkable improvements, with Recall@10 scores of 75.88% on CIRR and 53.78% on FashionIQ, significantly higher than those of existing baselines. These results highlight E5-V’s ability to accurately represent multimodal information without requiring additional fine-tuning or complex training data.

In conclusion, the E5-V framework represents a significant advancement in multimodal learning. By leveraging single modality training and a prompt-based representation method, E5-V addresses the limitations of traditional approaches, providing a more efficient and effective solution for multimodal embeddings. This research demonstrates the potential of MLLMs to revolutionize tasks that require integrated visual and language understanding, paving the way for future innovations in artificial intelligence. The work of the research teams from Beihang University and Microsoft Corporation underscores the transformative potential of their approach, setting a new benchmark for multimodal models.


Check out the Paper, Model Card, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Microsoft Research Introduces E5-V: A Universal AI Framework for Multimodal Embeddings with Single-Modality Training on Text Pairs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态学习 人工智能 E5-V 框架 微软研究院 北航 多模态嵌入
相关文章