MarkTechPost@AI 2024年10月14日
Researchers from UCLA and Stanford Introduce MRAG-Bench: An AI Benchmark Specifically Designed for Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

UCLA和斯坦福的研究人员推出MRAG-Bench,这是一个以视觉为中心的基准,用于评估大型视觉语言模型在视觉信息比文本知识更具优势的场景中的有效性。该基准包含大量图像和多选题,系统地将场景分类,并对多种模型进行评估,结果显示视觉增强知识能显著提升模型性能,但当前模型在利用视觉知识方面与人相比仍有差距。

🧐MRAG-Bench是一个以视觉为中心的基准,旨在评估大型视觉语言模型在视觉信息更具优势场景中的效果,它包含16130张图像和1353个人工标注的多选题,涵盖九个不同场景。

🎯该基准将场景系统地分为两个主要方面:视角变化,包括视觉实体的不同角度或遮挡;变革性变化,包括物体的时间或物理变化。它还提供了9673张人工精选的真实图像。

📊MRAG-Bench对10个开源和4个专有大型视觉语言模型进行评估,结果表明视觉增强知识显著提高模型性能,但当前模型远不如人类有效利用视觉知识,且专有模型在区分高质量和噪声视觉信息方面比开源模型更出色。

Current multimodal retrieval-augmented generation (RAG) benchmarks primarily focus on textual knowledge retrieval for question answering, which presents significant limitations. In many scenarios, retrieving visual information is more beneficial or easier than accessing textual data. Existing benchmarks fail to adequately account for these situations, hindering the development of large vision-language models (LVLMs) that need to utilize diverse types of information effectively.

Researchers from UCLA and Stanford introduced MRAG-Bench, a vision-centric benchmark designed to evaluate the effectiveness of LVLMs in scenarios where visual information provides a clear advantage over textual knowledge. MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions across nine distinct scenarios, focusing on when visual knowledge is more beneficial. The benchmark systematically categorizes scenarios into two main aspects: perspective changes, which involve different angles or occlusions of visual entities, and transformative changes, which include temporal or physical transformations of objects. MRAG-Bench evaluates 10 open-source and four proprietary LVLMs, providing insights into their ability to utilize visually augmented knowledge.

The structure of MRAG-Bench is centered around nine distinct scenarios divided into perspective understanding and transformative understanding aspects. The perspective aspect comprises four categories: Angle, Partial, Scope, and Occlusion. These categories challenge models to reason about entities when the visual input varies in viewpoints, levels of visibility, or resolution. The transformative aspect focuses on temporal, biological, and physical changes, requiring models to interpret visual entities undergoing significant transformations. Additionally, MRAG-Bench provides a clean, human-curated set of 9,673 ground-truth images, ensuring that the benchmark aligns with real-world visual understanding scenarios.

The evaluation results reveal that visually augmented knowledge significantly enhances model performance compared to textual augmentation. All evaluated LVLMs showed greater improvements when augmented with images, confirming the vision-centric nature of MRAG-Bench. Notably, the best-performing proprietary model, GPT-4o, achieved only a 5.82% improvement in performance with ground-truth visual augmentation compared to a 33.16% improvement demonstrated by human participants, indicating that current models are far from effectively leveraging visual knowledge as humans do. Furthermore, the results indicate that proprietary models are better at distinguishing between high-quality and noisy visual information compared to open-source models, which often struggle with utilizing retrieved knowledge effectively.

In conclusion, MRAG-Bench provides a novel vision-centric evaluation framework for assessing LVLMs, focusing on scenarios where visual retrieval surpasses textual knowledge. The findings highlight the critical gap between human performance and current models’ capabilities in effectively using retrieved visual information. The introduction of MRAG-Bench is an important step towards encouraging the development of LVLMs that can better leverage visual knowledge, with the ultimate goal of creating models that understand and utilize multimodal information as effectively as humans.


Check out the Paper, Dataset, GitHub, and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Event- Oct 17, 2024] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post Researchers from UCLA and Stanford Introduce MRAG-Bench: An AI Benchmark Specifically Designed for Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MRAG-Bench 视觉语言模型 视觉信息 模型评估
相关文章