MarkTechPost@AI 2024年07月13日
MJ-BENCH: A Multimodal AI Benchmark for Evaluating Text-to-Image Generation with Focus on Alignment, Safety, and Bias
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MJ-BENCH是一个新的基准,用于评估文本到图像生成中多模态评判者的性能。它使用一个综合的偏好数据集来评估评判者在四个关键方面:对齐、安全、图像质量和偏差。该基准包括每个方面的详细子类别,可以对评判者的性能进行全面评估。该研究表明,闭源 VLM(如 GPT-4o)在所有方面通常都能提供更好的反馈,而较小的基于 CLIP 的模型在某些特定领域(如文本-图像对齐和图像质量)表现良好。

😄 MJ-BENCH是一个多模态基准,用于评估文本到图像生成中多模态评判者的性能,涵盖对齐、安全、图像质量和偏差四个关键方面。它使用了一个综合的偏好数据集,包含每个方面的详细子类别,可以对评判者的性能进行全面评估。

🤔 MJ-BENCH通过比较评判者对给定指令下图像对的反馈来评估评判者。每个数据点包含一个指令和一对被选和被拒绝的图像。评估指标结合了偏好数据集中自然自动指标和基于微调结果的人工评估。

🚀 研究结果表明,闭源 VLM(如 GPT-4o)在所有方面通常都能提供更好的反馈,尤其是在偏差方面表现出色,平均准确率为 85.9%。而较小的基于 CLIP 的模型在某些特定领域(如文本-图像对齐和图像质量)表现良好。

🧐 此外,研究发现,VLM 在自然语言量表中比在数值量表中能提供更准确和更稳定的反馈。例如,GPT-4o 在偏差方面获得了 85.9 的平均分,而 CLIP-v1 仅获得了 73.6 分,这表明两者之间的性能差异很大。

💡 MJ-BENCH 的综合框架可以帮助研究人员更好地了解多模态评判者的能力,突出不同模型的优缺点,促进文本到图像模型的对齐、安全性和整体质量的提升。

Text-to-image generation models have gained traction with advanced AI technologies, enabling the generation of detailed and contextually accurate images based on textual prompts. The rapid development in this field has led to numerous models, such as DALLE-3 and Stable Diffusion, designed to translate text into visually coherent images.

A significant challenge in text-to-image generation is ensuring the generated images align accurately with the provided text. Issues such as misalignment, hallucination, bias, and the production of unsafe or low-quality content are common problems that need to be addressed. Misalignment occurs when the image does not correctly reflect the text description. Hallucination involves generating plausible entities that contradict the instruction. Bias and unsafe content include harmful, toxic, or inappropriate outputs, such as stereotypes or violence. Addressing these issues is crucial to improve the reliability and safety of these models.

Existing research involves methods to evaluate and enhance text-to-image models to tackle these challenges. One approach involves using multimodal judges, which provide feedback on the generated images. These judges can be categorized into two main types: CLIP-based scoring models and vision-language models (VLMs). CLIP-based models are typically smaller and focus on text-image alignment, providing scores that help identify misalignment. In contrast, VLMs are larger and offer more comprehensive feedback, including safety and bias assessment, due to their advanced reasoning capabilities.

The research team, comprising members from institutions such as UNC-Chapel Hill, University of Chicago, Stanford University, and others, developed MJ-BENCH to provide a holistic evaluation framework. MJ-BENCH is a novel benchmark designed to evaluate the performance of multimodal judges in text-to-image generation. This benchmark utilizes a comprehensive preference dataset to assess judges across four key perspectives: alignment, safety, image quality, and bias. The benchmark includes detailed subcategories for each perspective, enabling a thorough assessment of the judges’ performance.

MJ-BENCH evaluates judges by comparing their feedback on pairs of images based on given instructions. Each data point consists of an instruction and a pair of chosen and rejected images. The evaluation metrics combine natural automatic metrics from the preference dataset with human evaluations based on fine-tuned results. This dual approach ensures that the conclusions drawn are reliable and reflect human preferences. The benchmark also incorporates a variety of evaluation scales, including numerical and Likert scales, to determine the effectiveness of the feedback provided by the judges.

The evaluation results showed that closed-source VLMs, such as GPT-4o, generally provided better feedback across all perspectives. For instance, regarding bias perspective, GPT-4o achieved an average accuracy of 85.9%, while Gemini Ultra scored 79.0% and Claude 3 Opus 76.7%. From an alignment perspective, GPT-4o scored an average of 46.6, while Gemini Ultra achieved 41.9, indicating a superior performance by GPT-4o in aligning text with image content. The study also revealed that smaller CLIP-based models, despite being less comprehensive, performed well in specific areas such as text-image alignment and image quality. Due to their extensive pretraining over text-vision corpora, these models excelled in alignment but could have been more effective in providing accurate safety and bias feedback.

The research found that VLMs could provide more accurate and stable feedback in natural language scales than numerical ones. For instance, GPT-4o provided an average score of 85.9 in the bias perspective, while CLIP-v1 only scored 73.6, indicating a significant difference in performance. Human evaluations of end-to-end fine-tuned models confirmed these findings, further validating the effectiveness of MJ-BENCH. The benchmark’s comprehensive framework allows for a nuanced understanding of the judges’ capabilities, highlighting the strengths and limitations of various models.

In conclusion, MJ-BENCH represents a significant advancement in evaluating text-to-image generation models. Offering a detailed and reliable assessment framework helps identify multimodal judges’ strengths and weaknesses. This benchmark is an essential tool for researchers aiming to improve text-to-image models’ alignment, safety, and overall quality, guiding future developments in this rapidly evolving field.


Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post MJ-BENCH: A Multimodal AI Benchmark for Evaluating Text-to-Image Generation with Focus on Alignment, Safety, and Bias appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

文本到图像生成 多模态评估 AI基准 MJ-BENCH
相关文章