MarkTechPost@AI 2024年10月24日
Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce AI Research 提出 Programmatic VLM Evaluation(PROVE),用于评估视觉语言模型对开放式视觉查询的响应。该基准旨在解决现有评估方法的不足,通过构建高保真场景图、利用大语言模型生成问题答案对及验证程序,以更可靠和可解释的方式评估模型性能。评估结果显示当前模型在有用性和真实性间难以平衡。

🎯PROVE是一种新的基准范式,利用高保真场景图表示从超详细图像标题构建,并使用大语言模型生成多样的问答对及可执行程序来验证每个问答对,创建了10.5k个具有挑战性的问答对数据集。

📊PROVE的评估策略通过基于场景图比较的统一框架,测量VLM响应的有用性和真实性。它涉及从模型响应和真实答案中提取场景图表示,并根据这些表示的召回率和精度计算分数。

🔍评估结果表明当前的视觉语言模型在有用性和真实性之间难以达到良好平衡,如GPT - 4o等模型有用性得分较高,但真实性不一定高;LLaVA - 1.5模型系列在真实性方面表现较好。

Vision-Language Models (VLMs) are increasingly used for generating responses to queries about visual content. Despite their progress, they often suffer from a major issue: generating plausible but incorrect responses, also known as hallucinations. These hallucinations can lead to a lack of trust in these systems, especially in real-world, high-stakes applications. Evaluating the helpfulness and truthfulness of VLM-generated responses is challenging because it requires not only understanding visual content but also verifying each claim made in the response. Traditional benchmarks have not been adequate for addressing this challenge, either because they limit evaluations to simplistic, binary questions or because they rely on incomplete context to judge open-ended responses.

Researchers from Salesforce AI Research have proposed Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm that evaluates VLM responses to open-ended visual queries. In PROVE, researchers use a high-fidelity scene graph representation constructed from hyper-detailed image captions and employ a large language model (LLM) to generate diverse question-answer (QA) pairs along with executable programs to verify each QA pair. This approach allows the creation of a benchmark dataset of 10.5k visually grounded and challenging QA pairs. The evaluation strategy involves measuring both the helpfulness and truthfulness of VLM responses using a unified framework based on scene graph comparisons. This programmatic evaluation provides a more reliable and interpretable assessment of VLM performance compared to previous benchmarks.

The PROVE benchmark uses detailed scene graph representations and executable programs to verify the correctness of VLM responses. Scene graphs, constructed from detailed image captions, contain entities, attributes, and relationships that represent the visual scene. By prompting an LLM, researchers generate open-ended QA pairs and corresponding verification programs that ensure the questions are challenging yet verifiable. Only QA pairs that can be programmatically verified are retained in the benchmark, resulting in a high-quality dataset. The evaluation involves extracting scene graph representations from both the model responses and ground truth answers, and then calculating scores based on the recall and precision of these representations, measuring how helpful and truthful the responses are.

The results of the evaluation show that current VLMs struggle to achieve a good balance between helpfulness and truthfulness. Models such as GPT-4o, Phi-3.5-Vision, and Pixtral demonstrated higher helpfulness scores but not necessarily higher truthfulness. The study also found that increasing model size tends to improve helpfulness but does not always enhance truthfulness. The evaluation of various models revealed that recent improvements in training better VLMs have led to enhanced helpfulness but have not consistently translated into truthful outputs. Notably, the LLaVA-1.5 model series achieved the best truthfulness scores, indicating that smaller, more focused models might outperform larger ones in maintaining accuracy.

In conclusion, PROVE presents a significant advancement in evaluating the helpfulness and truthfulness of VLM-generated responses. By leveraging detailed scene graph representations and programmatic verification, this benchmark provides a more reliable and interpretable evaluation framework. The findings underscore the need for VLMs that strike a balance between generating informative and accurate responses, especially as their use in real-world applications continues to grow. Future research is expected to focus on improving both the helpfulness and truthfulness of these models through advanced training techniques and new evaluation strategies.


Check out the Paper and Dataset Card. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post Salesforce AI Research Propose Programmatic VLM Evaluation (PROVE): A New Benchmarking Paradigm for Evaluating VLM Responses to Open-Ended Queries appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Programmatic VLM Evaluation 视觉语言模型 模型评估 场景图
相关文章