MarkTechPost@AI 2024年12月10日
VisOnlyQA: A New Dataset for Evaluating the Visual Perception of LVLMs (Large Vision Language Models)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

VisOnlyQA是宾夕法尼亚州立大学研究者提出的新数据集,用于直接评估LVLMs的视觉感知能力,研究发现当前LVLMs在视觉感知任务上仍存在不足,未来有改进空间。

🧐LVLMs在多模态任务中有进展,但视觉感知错误仍导致问题。

📚VisOnlyQA专注评估LVLMs视觉感知,弥补现有数据集不足。

🔍对20个LVLMs模型评估,结果显示其视觉感知能力欠佳。

💡强调改进训练数据和模型架构以提升视觉感知能力。

Large Vision Language Models (LVLMs) have demonstrated significant advancements across various challenging multi-modal tasks over the past few years.  Their ability to interpret visual information in figures, known as visual perception, relied on visual encoders and multimodal training. Even with these advancements, visual perception errors still cause many mistakes in LVLMs and impact their ability to understand image details needed for tasks.

Recent popular datasets for evaluating LVLMs, such as MMMU and MathVista, dont focus on visual perception and target tasks that require expert-level reasoning and knowledge. In addition, performance on these datasets is affected by several capabilities, and because visual perception is difficult to evaluate directly, it is difficult to assess performance. Most benchmarks on multimodal reasoning in scientific figures are usually restricted to application domains that demand expert knowledge and neglect visual perception. General visual perception datasets target tasks like OCR, depth estimation, and counting but lack fine-grained detail due to the challenge of designing specific questions. While some datasets are designed to evaluate the visual perception of LVLMs, they often target less fine-grained visual information, such as scene understanding.

To solve this, a group of researchers from Penn State University proposed VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Bridging the gap, VisOnlyQA focused on fine-grain visual details and objective assessment of LVLMs’ visual perception capability to build new limitations on prior datasets and generated synthetic figures from scripts, including diversity and precision. Questions were manually annotated or generated automatically using metadata and templates, avoiding the need for reasoning or domain knowledge. The dataset had three splits: Eval-Real, Eval-Synthetic, and Train, with balanced labels and high annotation quality confirmed by human performance (93.5% to 95% accuracy).

The study evaluates the performance of 20 open-source and proprietary LVLMs on the VisOnlyQA dataset, focusing on their visual perception capabilities. The models were assessed on geometry, chemistry, and chart analysis tasks, with and without chain-of-thought reasoning. Results showed that the models performed significantly worse than humans, with average accuracies around 54.2% for the real-world dataset and 42.4% for synthetic data, far below human performance (93.5% and 95.0%, respectively). 

Despite the large model sizes, such as Phi-3.5-Vision and LLaVA-Next, many of the models performed near-randomly on tasks like Geometry-Triangle and Charts Intersection, indicating that current LVLMs still struggle with visual perception tasks involving geometric and numerical information. The analysis also showed that chain-of-thought reasoning did not consistently improve performance and, in some cases, even worsened it, further emphasizing that VisOnlyQA directly evaluates visual perception rather than requiring reasoning. Error analysis revealed that most mistakes were related to visual perception, supporting the dataset’s ability to assess this capability.

In conclusion, the proposed VisOnlyQA was introduced as a dataset to assess the visual perception abilities of LVLMs through questions on geometric and numerical details in scientific figures. This method revealed that LVLMs still lack strong visual perception capabilities. There is great scope in the future to improve training data and model architectures to enhance their visual perception. This method can open new ways in the field of LVLMs and can serve as a baseline for upcoming research!


Check out the Details here and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

[Must Subscribe]: Subscribe to our newsletter to get trending AI research and dev updates

The post VisOnlyQA: A New Dataset for Evaluating the Visual Perception of LVLMs (Large Vision Language Models) appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

VisOnlyQA LVLMs 视觉感知 数据集
相关文章