MarkTechPost@AI 07月24日 14:15
GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期,GPT-4o、Gemini等大型多模态模型(MFMs)在语言能力上表现出色,但其视觉理解能力仍有待深入评估。现有基准测试多侧重于文本输出任务,难以公平衡量MFMs的纯粹视觉能力,也限制了与视觉专有模型的比较。EPFL的研究人员开发了一种新的评估框架,将视觉任务转化为文本兼容格式,发现MFMs在几何任务上逊于专用模型,尽管GPT-4o表现最佳。该研究揭示了MFMs在语义任务上的优势,但也指出了其在精确视觉推理和成本上的局限性,为未来MFMs的视觉能力评估提供了新思路。

🔑 **多模态大模型(MFMs)在视觉理解方面仍存在挑战**:尽管GPT-4o等模型在公开演示中表现出强大的语言能力,但其真正的视觉理解水平,尤其是在3D感知、分割和分组等核心视觉任务上,尚未得到充分评估。现有基准测试多以文本输出为主,限制了对模型视觉能力的深入洞察。

📊 **现有评估方法限制了MFMs的公平比较**:当前的视觉任务基准测试,如VQA和分类,倾向于反映语言能力而非纯粹的视觉能力。将视觉任务转换为文本输出的尝试,虽然能够让MFMs参与,但这种转换限制了评估的全面性,也使得MFMs与专门的视觉模型之间的直接比较变得困难。

🔬 **EPFL研究提出创新的评估框架**:EPFL的研究团队开发了一种“提示链”框架,将复杂的视觉任务(如分割、目标检测)分解为更小的、易于MFMs处理的文本兼容子任务。例如,通过递归裁剪和超像素分割来处理更精细的图像信息,从而绕过了模型直接输出像素级预测的限制。

⚖️ **MFMs是通用学习者,但逊于视觉专长模型**:评估结果显示,MFMs在执行核心计算机视觉任务时,普遍不如专门为视觉任务设计的模型。尽管GPT-4o在多项任务中表现领先,但与ViT-G和Co-DETR等专用模型相比,在图像分类和目标检测等任务上的准确率仍有差距。

💡 **研究为MFMs视觉能力评估奠定基础**:该研究不仅揭示了MFMs在视觉任务上的优缺点,还提供了一个统一的评估框架,有助于未来更准确地衡量和提升MFMs的视觉理解能力。研究强调了MFMs在语义任务上的优势,并指出了其在几何任务和成本效益方面的不足,为后续模型改进提供了方向。

Multimodal foundation models (MFMs) like GPT-4o, Gemini, and Claude have shown rapid progress recently, especially in public demos. While their language skills are well studied, their true ability to understand visual information remains unclear. Most benchmarks used today focus heavily on text-based tasks, such as VQA or classification, which often reflect language strengths more than visual capabilities. These tests also require text outputs, making it difficult to fairly assess visual skills or compare MFMs with vision-specific models. Moreover, critical aspects such as 3D perception, segmentation, and grouping, which are core to visual understanding, are still largely overlooked in current evaluations. 

MFMs have demonstrated strong performance in tasks that combine visual and language understanding, such as captioning and visual question answering. However, their effectiveness in tasks that require detailed visual comprehension remains unclear. Most current benchmarks rely on text-based outputs, making it difficult to compare MFMs with vision-only models fairly. Some studies attempt to adapt vision datasets for MFMs by converting annotations into text, but this limitation restricts evaluation to language outputs. Prompting strategies have also been explored to help MFMs tackle visual tasks by breaking them into manageable subtasks, though reproducibility remains a challenge in some cases. 

Researchers at EPFL evaluated several popular multimodal foundation models—such as GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet on core computer vision tasks, including segmentation, object detection, and depth prediction, using datasets like COCO and ImageNet. Since most MFMs are designed to output text and are only accessible via APIs, they developed a prompt-chaining framework to translate these visual tasks into text-compatible formats. Their findings show that while MFMs are competent generalists, they fall short of specialized vision models, especially in geometric tasks. GPT-4o stood out, performing best in 4 out of 6 tasks. The evaluation toolkit will be open-sourced. 

To evaluate MFMs on vision tasks, the study designed a prompt chaining strategy, breaking complex tasks into simpler, language-friendly subtasks. For example, instead of predicting bounding boxes directly, the model first identifies present objects, then locates them through recursive image cropping. For segmentation and grouping, images are divided into superpixels, which are easier to label and compare. Depth and surface normals are estimated using pairwise rankings of superpixel regions. This modular design leverages MFMs’ strength in classification and similarity, while calibration controls ensure fair comparisons. The method is flexible, and performance improves with finer-grained prompting. 

The study evaluates various MFMs, including GPT-4, Gemini Flash, and Claude 3.5, across multiple tasks, such as image classification, object detection, and segmentation. Using datasets like ImageNet, COCO, and Hypersim, results show GPT-4o reaching 77.2% on ImageNet and 60.62 AP50 for object detection, outperformed by specialist models like ViT-G (90.94%) and Co-DETR (91.30%). Semantic segmentation results show GPT-4o at 44.89 mIoU, while OneFormer leads with 65.52. MFMs handle distribution shifts reasonably well but lag on precise visual reasoning. The study also introduces prompt chaining and oracle baselines to evaluate upper-bound performance. 

In conclusion, the study introduces a benchmarking framework to assess the visual capabilities of MFMs, such as GPT-4o, Gemini, and Claude, by converting standard vision tasks into prompt-based formats. Findings show MFMs perform better on semantic tasks than geometric ones, with GPT-4o leading overall. However, all MFMs lag significantly behind task-specific vision models. Despite being generalists trained primarily on image-text data, they show promising progress, especially newer reasoning models, such as o3, on 3D tasks. Limitations include high inference cost and prompt sensitivity. Still, this framework provides a unified approach to evaluating MFMs’ visual understanding, laying the groundwork for future advancements. 


Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

Meet the AI Dev Newsletter read by 40k+ Devs and Researchers from NVIDIA, OpenAI, DeepMind, Meta, Microsoft, JP Morgan Chase, Amgen, Aflac, Wells Fargo and 100s more [SUBSCRIBE NOW]

The post GPT-4o Understands Text, But Does It See Clearly? A Benchmarking Study of MFMs on Vision Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态大模型 视觉能力评估 GPT-4o 计算机视觉 AI基准测试
相关文章