MarkTechPost@AI 2024年09月11日
CMU Researchers Introduce MMMU-Pro: An Advanced Version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) Benchmark for Evaluating Multimodal Understanding in AI Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CMU研究者推出MMMU-Pro,用于评估多模态AI理解能力,指出当前模型存在的问题并展示新基准的测试情况。

🎯多模态大语言模型应用广泛,但确保AI真正理解多模态任务是关键挑战,当前测试工具存在不足。

🌟CMU等机构研究者推出MMMU-Pro新基准,针对先前测试弱点设计,与领先公司合作开发,具多种挑战性特征。

📊MMMU-Pro测试显示许多先进模型的性能局限性,如GPT-4o等模型准确率显著下降,凸显模型在多模态推理方面的不足。

🧠引入Chain of Thought推理提示评估模型,效果因模型而异,强调MMMU-Pro的复杂性及对当前多模态模型的挑战。

💡MMMU-Pro为评估多模态AI系统提供关键见解,推动未来研究,强调改进AI系统以应对多模态挑战的必要性。

Multimodal large language models (MLLMs) are increasingly applied in diverse fields such as medical image analysis, engineering diagnostics, and even education, where understanding diagrams, charts, and other visual data is essential. The complexity of these tasks requires MLLMs to seamlessly switch between different types of information while performing advanced reasoning.

The primary challenge researchers face in this area has been ensuring that AI models genuinely comprehend multimodal tasks rather than relying on simple statistical patterns to derive answers. Previous benchmarks for evaluating MLLMs allowed models to take shortcuts, sometimes arriving at correct answers by exploiting predictable question structures or correlations without understanding the visual content. This has raised concerns about the actual capabilities of these models in handling real-world multimodal problems effectively.

To address this issue, existing tools for testing AI models must be deemed insufficient. Current benchmarks failed to differentiate between models that used true multimodal understanding and those that relied on text-based patterns. As a result, the research team highlighted the need for a more robust evaluation system to test the depth of reasoning and understanding in multimodal contexts. These shortcomings indicated the necessity of a more challenging and rigorous approach to assessing MLLMs.

Researchers from Carnegie Mellon University and other institutions introduced a new benchmark called MMMU-Pro, specifically designed to push the limits of AI systems’ multimodal understanding. This improved benchmark targets the weaknesses in previous tests by filtering out questions solvable by text-only models and increasing the difficulty of multimodal questions. The benchmark was developed with leading companies, including OpenAI, Google, and Anthropic. It introduces features like vision-only input scenarios and multiple-choice questions with augmented options, making it significantly more challenging for models to exploit simple patterns for answers.

The methodology behind MMMU-Pro is thorough and multilayered. The benchmark’s construction involved three primary steps: first, researchers filtered out questions answerable by text-only models by utilizing multiple language models to test each question. Any question that could be consistently answered without visual input was removed. Second, they increased the number of answer options from four to ten in many questions, reducing the effectiveness of random guessing. Finally, they introduced a vision-only input setting, where models were presented with images or screenshots containing the question-and-answer options. This step is crucial as it mimics real-world situations where text and visual information are intertwined, challenging models to understand both modalities simultaneously.

In terms of performance, MMMU-Pro revealed the limitations of many state-of-the-art models. The average accuracy for models like GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro dropped significantly when tested against this new benchmark. For example, GPT-4o saw a drop from 69.1% on the original MMMU benchmark to 54.0% on MMMU-Pro when evaluated using ten candidate options. Meanwhile, Claude 3.5 Sonnet, developed by Anthropic, experienced a performance reduction of 16.8%, while Gemini 1.5 Pro, from Google, saw a decrease of 18.9%. The most drastic decline was observed in VILA-1.5-40B, which experienced a 26.9% drop. These numbers underscore the benchmark’s ability to highlight the models’ deficiencies in true multimodal reasoning.

Chain of Thought (CoT) reasoning prompts were introduced as part of the evaluation to improve model performance by encouraging step-by-step reasoning. While this strategy showed some improvements, the extent of success varied across models. For instance, Claude 3.5 Sonnet’s accuracy increased to 55.0% with CoT, but models like LLaVA-OneVision-72B showed minimal improvements, and some models even faced performance drops. This highlights the complexity of MMMU-Pro and its challenges to current multimodal models.

The MMMU-Pro benchmark provides critical insights into multimodal AI model performance gaps. Despite advances in OCR (Optical Character Recognition) and CoT reasoning, the models still struggled with integrating text and visual elements meaningfully, particularly in vision-only settings where no explicit text was provided. This further emphasizes the need for improved AI systems to handle the full spectrum of multimodal challenges.

In conclusion, MMMU-Pro marks a significant advancement in evaluating multimodal AI systems. It successfully identifies the limitations in existing models, such as their reliance on statistical patterns, and presents a more realistic challenge for assessing true multimodal understanding. This benchmark opens new directions for future research, pushing the development of better-equipped models to integrate complex visual and textual data. The research team’s work represents an important step forward in the quest for AI systems capable of performing sophisticated reasoning in real-world applications.


Check out the Paper and Leaderboard. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and LinkedIn. Join our Telegram Channel.

If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post CMU Researchers Introduce MMMU-Pro: An Advanced Version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) Benchmark for Evaluating Multimodal Understanding in AI Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MMMU-Pro 多模态AI 模型评估 性能局限 未来研究
相关文章