MarkTechPost@AI 2024年10月08日
LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

LLaVA-Critic是首个专为评估任务设计的大型多模态模型。随着预训练达到极限,学习评估越发重要。该模型旨在解决AI评估挑战,提供可靠评估,在多种场景中有出色表现,是开源的替代选择。

LLaVA-Critic是首个专为评估任务设计的大型多模态模型,旨在为评估目的精心策划指令跟随数据,解决了领域中评估其他多模态模型的能力的空白。

现有AI评估尝试主要使用专有LMM如GPT - 4V作为通用评估器,而LLaVA - Critic作为开源替代,为特定用户设计的评分标准提供评估,且在多种场景中有良好应用。

LLaVA - Critic通过微调预训练的LMM开发,能处理多种高质量视觉任务。训练过程使用结合多模态指令输入、模型响应和可选参考响应的评估提示,预测定量逐点分数或成对排名。

LLaVA - Critic在与基线模型的比较中,在逐点评分和成对排名能力上有显著提升,在多种评估场景中表现出色,是开源的多模态模型评估的有效选择。

The ability of learning to evaluate is increasingly taking on a pivotal role in the development of modern large multimodal models (LMMs). As pre-training on existing web data reaches its limits, researchers are shifting towards post-training with AI-enhanced synthetic data. This transition highlights the growing importance of learning to evaluate in modern LMMs. Reliable AI evaluation is important for human labor in complex task assessments, generating effective reward signals in reinforcement learning, and guiding inference-time search. Despite the progress in single-image, multi-image, and video scenarios, the development of open LMMs capable of evaluating the performance of other multimodal models presents a gap in the field.

Existing attempts to address the challenge of AI evaluation have primarily focused on using proprietary LMMs like GPT-4V as generalist evaluators for vision-language tasks. These models have been used in evaluation benchmarks for complex scenarios such as visual chat and detailed captioning. Moreover, open-source alternatives like Prometheus-Vision have emerged as evaluators for specific user-designed scoring criteria. In the preference learning for LMMs, techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have been applied to align models with human intentions. Recent research has expanded these concepts to the multimodal space, exploring various strategies to improve visual chat abilities and reduce hallucinations in vision-language models.

Researchers from ByteDance and the University of Maryland, College Park have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. This approach focuses on curating instruction-following data tailored for evaluation purposes. It addresses two primary scenarios: serving as an LMM-as-a-Judge and facilitating Preference Learning. It aims to provide reliable evaluation scores comparable to proprietary models like GPT-4V, offering a free alternative for various evaluation benchmarks in the first scenario. It presents a scalable solution for generating effective reward signals, reducing dependence on costly human feedback collection in the second scenario. The LLaVA-Critic shows a high correlation with commercial GPT models in evaluation tasks and superior performance in preference learning.

LLaVA-Critic is developed by fine-tuning a pre-trained LMM, capable of following diverse instructions. This approach ensures the model can handle a range of high-quality vision tasks. The training process involves using an evaluation prompt that combines multimodal instruction input, model response(s), and an optional reference response. LLaVA-Critic is trained to predict quantitative pointwise scores or pairwise rankings based on specified criteria and provides detailed justifications for its judgments. The model uses standard cross-entropy loss for judgments and justifications. The researchers start with the LLaVA-OneVision(OV) 7B/72B pre-trained checkpoint and fine-tune it on the LLaVA-Critic-113k dataset for one epoch.

The results demonstrate significant improvements in both pointwise scoring and pairwise ranking capabilities of LLaVA-Critic compared to baseline models. The LLaVA-Critic-72B achieves the highest average Pearson-r (0.754) and Kendall’s Tau (0.933) in pointwise scoring, outperforming the baseline LLaVA-OV-72B. In pairwise ranking, LLaVA-Critic-72B outperforms GPT-4o and GPT-4V in comparisons without tie, achieving 73.6% accuracy. LLaVA-Critic-7B outperforms most baselines compared to commercial models and other open-source LMMs in the MLLM-as-a-Judge scenario. These results highlight the effectiveness of LLaVA-Critic as an open-source alternative for multimodal model evaluation.

In conclusion, researchers have proposed LLaVA-Critic, the first LMM specifically designed for evaluation tasks. The researchers have used a high-quality, diverse instruction-following dataset to develop this model that excels in two critical areas. First, as a generalized evaluator, LLaVA-Critic shows remarkable alignment with human and GPT-4o preferences across various evaluation tasks, offering a viable open-source alternative to commercial models. Secondly, in preference learning scenarios, LLaVA-Critic functions as a reliable reward model, outperforming human feedback-based approaches in enhancing the visual chat capabilities of LMMs. This research is a key step toward building self-critiquing capabilities in open-source LMMs, enabling future advancements in scalable, superhuman AI alignment feedback.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

Interested in promoting your company, product, service, or event to over 1 Million AI developers and researchers? Let’s collaborate!

The post LLaVA-Critic: An Open-Source Large Multimodal Model Designed to Assess Model Performance Across Diverse Multimodal Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLaVA-Critic 多模态模型 模型评估 开源替代
相关文章