MarkTechPost@AI 2024年07月24日
Visual Haystacks Benchmark: The First “Visual-Centric” Needle-In-A-Haystack (NIAH) Benchmark to Assess LMMs’ Capability in Long-Context Visual Retrieval and Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Visual Haystacks 基准测试是第一个以视觉为中心的“针在干草堆中”(NIAH)基准测试,用于评估大型多模态模型(LMMs)在长文本视觉检索和推理中的能力。它旨在解决多图像视觉问答(MIQA)中的一个重大挑战:从大量图像集中检索和整合相关图像以回答复杂视觉查询。

🤔 MIRAGE 框架:针对 MIQA 任务,研究人员提出了 MIRAGE(多图像检索增强生成)框架,它通过整合多个创新组件来扩展 LLaVA 模型,包括压缩图像编码器、基于检索的查询感知相关性过滤器以及使用目标合成和真实 MIQA 数据进行的增强训练。

🚀 性能提升:MIRAGE 在 Visual Haystacks 基准测试中显著优于现有模型,在单针问题上的准确率比 GPT-4o 等闭源模型高出 11%,并且在效率方面也表现出显著改进。

🔍 效率改进:MIRAGE 采用了一种压缩图像编码机制,使用 Q-former 将每个图像的令牌强度从 576 个减少到 32 个,从而使模型能够在相同的上下文预算内处理更多图像。查询感知相关性过滤器能够预测图像与查询的相关性,并选择相关的图像进行详细分析。

💡 增强训练:MIRAGE 的训练过程包括使用现有的 MIQA 数据集和从单图像 QA 数据集派生的合成数据,增强了模型在各种 MIQA 场景中的鲁棒性和性能。

📊 基准测试:Visual Haystacks 数据集包含 880 个单针和 1000 个多针问答对,为 MIQA 模型提供了一个严格的评估框架。

🌟 意义重大:MIRAGE 框架代表了 MIQA 领域的一项重大进展,它解决了从大量数据集中有效检索和整合相关图像以回答复杂视觉查询的挑战。MIRAGE 的创新组件和稳健的训练方法使其在性能和效率方面优于现有模型,为在涉及大量视觉数据的现实场景中更有效的 AI 应用铺平了道路。

A significant challenge in the field of visual question answering (VQA) is the task of Multi-Image Visual Question Answering (MIQA). This involves generating relevant and grounded responses to natural language queries based on a large set of images. Existing Large Multimodal Models (LMMs) excel in single-image visual question answering but face substantial difficulties when queries span extensive image collections. Addressing this challenge is crucial for real-world applications like searching through large photo albums, finding specific information across the internet, or monitoring environmental changes through satellite imagery.

Current methods for visual question answering primarily focus on single-image analysis, which limits their utility for more complex queries involving large image sets. Models like Gemini 1.5-pro and GPT-4V can process multiple images but encounter significant challenges in efficiently retrieving and integrating relevant images from large datasets. These methods are computationally inefficient and exhibit performance degradation as the volume and variability of images increase. They also suffer from positional bias and struggle with integrating visual information across numerous unrelated images, leading to a decline in accuracy and applicability in large-scale tasks.

To address these limitations, researchers from the University of California propose MIRAGE (Multi-Image Retrieval Augmented Generation), a novel framework tailored for MIQA. MIRAGE extends the LLaVA model by integrating several innovative components: a compressive image encoder, a retrieval-based, query-aware relevance filter, and augmented training with targeted synthetic and real MIQA data. These innovations enable MIRAGE to handle larger image contexts efficiently and improve accuracy in answering MIQA tasks. This approach represents a significant contribution to the field, offering up to an 11% accuracy improvement over closed-source models like GPT-4o on the Visual Haystacks (VHs) benchmark, and demonstrating up to 3.4x improvements in efficiency over traditional text-focused multi-stage approaches.

MIRAGE employs a compressive image encoding mechanism using a Q-former to reduce the token intensity per image from 576 to 32 tokens. This allows the model to handle more images within the same context budget. The query-aware relevance filter is a single-layer MLP that predicts the relevance of images to the query, which is then used to select relevant images for detailed analysis. The training process involves both existing MIQA datasets and synthetic data derived from single-image QA datasets, enhancing the model’s robustness and performance across varied MIQA scenarios. The VHs dataset used for benchmarking contains 880 single-needle and 1000 multi-needle question-answer pairs, providing a rigorous evaluation framework for MIQA models.

Evaluation results show that MIRAGE significantly outperforms existing models on the Visual Haystacks benchmark, surpassing closed-source models like GPT-4o by up to 11% in accuracy for single-needle questions and demonstrating notable improvements in efficiency. MIRAGE maintains higher performance levels as the size of the image sets increases, showcasing its robustness in handling extensive visual contexts. It achieved substantial improvements in both accuracy and processing efficiency compared to traditional text-focused multi-stage approaches.

In conclusion, the researchers present a significant advancement in MIQA with the MIRAGE framework. The critical challenge of efficiently retrieving and integrating relevant images from large datasets to answer complex visual queries is addressed. MIRAGE’s innovative components and robust training methods lead to superior performance and efficiency compared to existing models, paving the way for more effective AI applications in real-world scenarios involving extensive visual data.


Check out the Paper, Project, GitHub, and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post Visual Haystacks Benchmark: The First “Visual-Centric” Needle-In-A-Haystack (NIAH) Benchmark to Assess LMMs’ Capability in Long-Context Visual Retrieval and Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉问答 多图像视觉问答 大型多模态模型 视觉检索 Visual Haystacks 基准测试 MIRAGE 框架
相关文章