MarkTechPost@AI 05月05日 11:40
Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

KAIST和DeepAuto.ai的研究人员提出了UniversalRAG,一种新型RAG框架,旨在解决现有RAG系统在处理多模态查询时的局限性。UniversalRAG能够从文本、图像和视频等多种模态的知识源中检索和整合信息,并根据查询动态选择最相关的模态和粒度级别。该框架通过模态感知路由机制和细粒度的数据组织,有效提高了检索精度和效率。在八个多模态基准测试中,UniversalRAG的表现优于传统的统一和模态特定基线。

💡UniversalRAG是一个检索增强生成框架,它能够处理跨多种模态和数据粒度的查询,这与仅限于单一语料库的标准RAG模型不同,UniversalRAG将知识分为文本、图像和视频语料库,每个语料库都有细粒度和粗粒度级别。

🧭UniversalRAG采用模态感知路由机制,该机制可以动态选择最相关的语料库。路由模块首先确定给定查询的最佳模态和粒度,选择段落、完整文档、视频剪辑或完整视频等选项,并相应地检索相关信息。该路由器可以是基于LLM的无训练分类器,也可以是使用来自基准数据集的启发式标签训练的模型。

📊实验评估在六种检索场景中评估了UniversalRAG:无检索、段落、文档、图像、剪辑和视频。无检索使用MMLU测试通用知识。段落级任务使用SQuAD和Natural Questions,而HotpotQA处理多跳文档检索。基于图像的查询来自WebQA,与视频相关的查询来自LVBench和VideoRAG数据集,分为剪辑和完整视频级别。为每种模态策划相应的检索语料库——基于Wikipedia的文本,WebQA的图像以及YouTube视频的视频任务。

✅UniversalRAG通过动态地将查询路由到最合适的模态和粒度特定的语料库来解决模态差距和刚性检索结构等问题。在八个多模态基准上评估,UniversalRAG优于统一和模态特定基线。该研究还强调了细粒度检索的优势,并强调了训练和无训练路由机制如何促进稳健、灵活的多模态推理。

RAG has proven effective in enhancing the factual accuracy of LLMs by grounding their outputs in external, relevant information. However, most existing RAG implementations are limited to text-based corpora, which restricts their applicability to real-world scenarios where queries may require diverse types of information, ranging from textual definitions to spatial understanding from images or temporal reasoning from videos. While some recent approaches have extended RAG to handle different modalities like images and videos, these systems are often constrained to operate within a single modality-specific corpus. This limits their ability to effectively respond to a wide spectrum of user queries that demand multimodal reasoning. Moreover, current RAG methods usually retrieve from all modalities without discerning which is most relevant for a given query, making the process inefficient and less adaptive to specific information needs.

To address this, recent research emphasizes the need for adaptive RAG systems to determine the appropriate modality and retrieval granularity based on the query context. Strategies include routing queries based on complexity, such as deciding between no retrieval, single-step, or multi-step retrieval, and using model confidence to trigger retrieval only when needed. Furthermore, the granularity of retrieval plays a crucial role, as studies have shown that indexing corpora at finer levels, like propositions or specific video clips, can significantly improve retrieval relevance and system performance. Hence, for RAG to truly support complex, real-world information needs, it must handle multiple modalities and adapt its retrieval depth and scope to the specific demands of each query. 

Researchers from KAIST and DeepAuto.ai introduce UniversalRAG, a RAG framework that retrieves and integrates knowledge from various modality-specific sources (text, image, video) and multiple granularity levels. Unlike traditional approaches that embed all modalities into a shared space, leading to modality bias, UniversalRAG uses a modality-aware routing mechanism to select the most relevant corpus dynamically based on the query. It further enhances retrieval precision by organizing each modality into granularity-specific corpora, such as paragraphs or video clips. Validated on eight multimodal benchmarks, UniversalRAG consistently outperforms unified and modality-specific baselines, demonstrating its adaptability to diverse query needs. 

UniversalRAG is a retrieval-augmented generation framework that handles queries across various modalities and data granularities. Unlike standard RAG models limited to a single corpus, UniversalRAG separates knowledge into text, image, and video corpora, each with fine- and coarse-grained levels. A routing module first determines the optimal modality and granularity for a given query, choosing among options like paragraphs, full documents, video clips, or full video, and retrieves relevant information accordingly. This router can be either a training-free LLM-based classifier or a trained model using heuristic labels from benchmark datasets. An LVLM then uses the selected content to generate the final response. 

The experimental setup assesses UniversalRAG across six retrieval scenarios: no retrieval, paragraph, document, image, clip, and video. For no-retrieval, MMLU tests general knowledge. Paragraph-level tasks use SQuAD and Natural Questions, while HotpotQA handles multi-hop document retrieval. Image-based queries come from WebQA, and video-related ones are sourced from LVBench and VideoRAG datasets, split into clip- and full-video levels. Corresponding retrieval corpora are curated for each modality—Wikipedia-based for text, WebQA for images, and YouTube videos for video tasks. This comprehensive benchmark ensures robust evaluation across varied modalities and retrieval granularities.


In conclusion, UniversalRAG is a Retrieval-Augmented Generation framework that can retrieve knowledge from multiple modalities and levels of granularity. Unlike existing RAG methods that rely on a single, often text-only, corpus or a single-modality source, UniversalRAG dynamically routes queries to the most appropriate modality- and granularity-specific corpus. This approach addresses issues like modality gaps and rigid retrieval structures. Evaluated on eight multimodal benchmarks, UniversalRAG outperforms both unified and modality-specific baselines. The study also emphasizes the benefits of fine-grained retrieval and highlights how both trained and train-free routing mechanisms contribute to robust, flexible multimodal reasoning. 


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit. For Promotion and Partnerships, please talk us.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Multimodal Queries Require Multimodal RAG: Researchers from KAIST and DeepAuto.ai Propose UniversalRAG—A New Framework That Dynamically Routes Across Modalities and Granularities for Accurate and Efficient Retrieval-Augmented Generation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

UniversalRAG 多模态学习 检索增强生成 AI框架
相关文章