MarkTechPost@AI 2024年12月07日
Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了DocHaystack和InfoHaystack两个新基准,用于评估大型多模态模型(LMM)在大规模文档检索和推理任务中的表现。当前的多图像问答基准仅限于小型数据集,无法反映现实世界的复杂性。为此,研究者提出了视觉中心检索增强生成框架(V-RAG),该框架集成了多个视觉编码器和一个相关性过滤模块,以提高检索精度和推理能力。V-RAG在DocHaystack-1000和InfoHaystack-1000基准测试中表现出色,Recall@1分别提高了9%和11%,显著提升了LMM在大型图像检索和复杂推理场景中的性能。

🔍DocHaystack和InfoHaystack是两个新的基准测试,用于评估LMM在处理多达1000个文档的大规模视觉文档检索和推理任务中的能力,旨在模拟真实世界场景,弥补了小型数据集的局限性。

🧠V-RAG是一种视觉中心检索增强生成框架,它结合了专门的视觉编码器和一个相关性评估模块,通过集成检索系统和生成模型来增强LMM,使它们能够有效地处理大量的图文数据集。

📈V-RAG框架在DocHaystack-1000和InfoHaystack-1000基准测试中,Recall@1分别提高了9%和11%,显著提高了LMM的检索和推理能力,为大规模视觉检索和推理设定了新标准。

📑为了提高文档检索和推理能力,DocHaystack和InfoHaystack基准测试确保每个问题都有一个独特的、特定于文档的答案,并通过三步管理流程解决歧义:使用LLM过滤一般性问题、手动审查特定性以及删除可通过一般知识回答的问题。

🔬实验部分详细介绍了评估V-RAG框架的训练设置、指标、基线和结果,V-RAG在DocHaystack和InfoHaystack基准测试中优于BM25、CLIP和OpenCLIP等基线,实现了更高的召回率和准确率,通过精选的干扰图像进行微调可增强VQA的鲁棒性。

LMMs have made significant strides in vision-language understanding but still need help reasoning over large-scale image collections, limiting their real-world applications like visual search and querying extensive datasets such as personal photo libraries. Existing benchmarks for multi-image question-answering are constrained, typically involving up to 30 images per question, which needs to address the complexities of large-scale retrieval tasks. To overcome these limitations, new benchmarks like DocHaystack and InfoHaystack have been introduced, requiring models to retrieve and reason across collections of up to 1,000 documents. This shift presents new challenges, significantly expanding the scope of visual question-answering and retrieval tasks.

Retrieval-augmented generation (RAG) frameworks enhance LMMs by integrating retrieval systems with generative models, enabling them to process extensive image-text datasets effectively. While RAG approaches have been widely explored in text-based tasks, their application in vision-language contexts has gained momentum with models like MuRAG, RetVQA, and MIRAGE. These frameworks utilize advanced retrieval methods, such as relevance encoders and CLIP-based training, to filter and process large image collections. Building on these advancements, the proposed V-RAG framework leverages multiple vision encoders and introduces a question-document relevance module, offering superior performance on the DocHaystack and InfoHaystack benchmarks. This sets a new standard for large-scale visual retrieval and reasoning, addressing critical gaps in existing LMM capabilities.

Researchers from KAUST, the University of Sydney, and IHPC, A*STAR, introduced two benchmarks, DocHaystack and InfoHaystack, to evaluate LMMs on large-scale visual document retrieval and reasoning tasks. These benchmarks simulate real-world scenarios by requiring models to process up to 1,000 documents per query, addressing the limitations of smaller datasets. They also proposed V-RAG, a vision-centric retrieval-augmented generation framework that combines specialized vision encoders and a relevance assessment module. V-RAG achieved a 9% and 11% improvement in Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks, significantly advancing retrieval and reasoning capabilities for LMMs.

To improve document retrieval and reasoning, the DocHaystack and InfoHaystack benchmarks ensure each question yields a unique, document-specific answer. These benchmarks address ambiguity using a three-step curation pipeline: filtering general questions with an LLM, manual review for specificity, and removing questions answerable through general knowledge. The Vision-centric Retrieval-Augmented Generation (V-RAG) framework enhances retrieval from extensive datasets using a vision encoder ensemble and an LLM-based filtering module. Relevant documents are ranked and refined to focus on specific subsets. Questions and selected documents are then processed by LLMs for accurate answers, emphasizing vision-based understanding.

The experiments section details the training setup, metrics, baselines, and results for evaluating the V-RAG framework. Metrics include Recall@1, @3, and @5 for document retrieval and a GPT-4o-mini-based model evaluation for VQA tasks. V-RAG outperforms baselines like BM25, CLIP, and OpenCLIP across DocHaystack and InfoHaystack benchmarks, achieving superior recall and accuracy scores. Fine-tuning with curated distractor images enhances VQA robustness. Ablation studies reveal the importance of combining multiple encoders and the VLM-filter module, significantly improving retrieval accuracy. V-RAG’s top performance across challenging benchmarks highlights its effectiveness in large-scale multimodal document understanding and retrieval tasks.

In conclusion, the study introduces DocHaystack and InfoHaystack, benchmarks designed to assess LMMs in large-scale document retrieval and reasoning tasks. Current benchmarks for multi-image question-answering are limited to small datasets, failing to reflect real-world complexities. The proposed V-RAG framework integrates multiple vision encoders and a relevance filtering module to address this, enhancing retrieval precision and reasoning capabilities. V-RAG outperforms baseline models, achieving up to 11% higher Recall@1 on the DocHaystack-1000 and InfoHaystack-1000 benchmarks. By enabling efficient processing of thousands of images, V-RAG significantly improves LMM performance in large-scale image retrieval and complex reasoning scenarios.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Must Attend Webinar]: ‘Transform proofs-of-concept into production-ready AI applications and agents’ (Promoted)

The post Advancing Large Multimodal Models: DocHaystack, InfoHaystack, and the Vision-Centric Retrieval-Augmented Generation Framework appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态模型 文档检索 V-RAG框架 视觉问答 深度学习
相关文章