MarkTechPost@AI 05月03日 04:05
Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了REFVNLI,一种由谷歌研究人员提出的新型、经济高效的指标,用于评估主题驱动的文本到图像(T2I)生成。该指标通过联合评估文本对齐和主题一致性,解决了现有评估方法在成本和准确性上的不足。REFVNLI在大型自动生成的数据集上进行训练,在多个基准测试中表现出色,尤其在物体类别中表现突出。它能够同时衡量文本与图像的匹配程度,以及图像中主题的保留程度,为T2I生成技术的进步提供了重要的评估工具。

🖼️ REFVNLI是一种新型评估指标,专为主题驱动的T2I生成而设计,旨在同时评估文本对齐和主题一致性,解决了现有评估方法的局限性。

💡 REFVNLI通过在大型数据集上训练PaliGemma模型来实现,该数据集由视频推理基准和图像扰动生成。模型接受<参考图像,提示词,目标图像>三元组作为输入,预测文本对齐和主题一致性分数。

🏆 在多个基准测试中,REFVNLI表现优异,在对象类别中表现最佳,超越了基于GPT-4o的DreamBench++。在ImagenHub上,REFVNLI在动物类别的文本对齐方面排名靠前,并在对象类别中获得最高分。

🔬 REFVNLI在处理未知概念时表现出色,与人类偏好的一致性超过87%。然而,由于其对身份敏感的训练,REFVNLI在主题一致性方面存在局限性,尤其是在KITTEN数据集上。

🚀 未来的研究方向包括增强REFVNLI在艺术风格评估方面的能力,处理明确改变身份特征的文本修改,以及改进对多参考图像的处理,以支持单个和不同的主题。

Text-to-image (T2I) generation has evolved to include subject-driven approaches, which enhance standard T2I models by incorporating reference images alongside text prompts. This advancement allows for more precise subject representation in generated images. Despite the promising applications, subject-driven T2I generation faces a significant challenge of lacking reliable automatic evaluation methods. Current metrics focus either on text-prompt alignment or subject consistency, when both are essential for effective subject-driven generation. While more correlative evaluation methods exist, they rely on costly API calls to models like GPT-4, limiting their practicality for extensive research applications.

Evaluation approaches for Visual Language Models (VLMs) include various frameworks, with text-to-image (T2I) assessments focusing on image quality, diversity, and text alignment. Researchers utilize embedding-based metrics like CLIP and DINO for subject-driven generation evaluation to measure subject preservation. Complex metrics such as VIEScore and DreamBench++ utilize GPT-4o to evaluate textual alignment and subject consistency, but at a higher computational cost. Subject-driven T2I methods have developed along two main paths: fine-tuning general models into specialized versions capturing specific subjects and styles, or enabling broader applicability through one-shot examples. These one-shot approaches include adapter-based and adapter-free techniques.

Researchers from Google Research and Ben Gurion University have proposed REFVNLI, a cost-efficient metric that simultaneously evaluates textual alignment and subject preservation in subject-driven T2I generation. It predicts two scores, textual alignment and subject consistency, in a single classification based on a triplet <imageref, prompt, imagetgt>. It is trained on an extensive dataset derived from video-reasoning benchmarks and image perturbations, outperforming or matching existing baselines across multiple benchmarks and subject categories. REFVNLI shows improvements of up to 6.4 points in textual alignment and 8.5 points in subject consistency. It is effective with lesser-known concepts, where it aligns with human preferences at over 87% accuracy.

For training REFVNLI, a large-scale dataset of triplets <imageref, prompt, imagetgt>, labeled with <textual alignment, subject preservation>, is curated automatically. REFVNLI is evaluated on multiple human-labeled test sets for subject-driven generation, including DreamBench++, ImagenHub, and KITTEN. The evaluation spans diverse categories such as Humans, Animals, Objects, Landmarks, and multi-subject settings. The training process involves fine-tuning PaliGemma, a 3B Vision-Language Model, focusing on a variant adapted for multi-image inputs. During inference, the model takes two images and a prompt with special markups around the referenced subject, performing sequential binary classifications for textual alignment and subject preservation.

For subject consistency, REFVNLI ranks among the top two metrics across all categories and performs best in the Object category, exceeding the GPT4o-based DreamBench++ by 6.3 points. On ImagenHub, REFVNLI achieves top-two rankings for textual alignment in the Animals category and the highest score for Objects, outperforming the best non-finetuned model by 4 points. It also performs well in Multi-subject settings, ranking in the top three. REFVNLI achieves the highest textual alignment score on KITTEN, but has limitations in subject consistency due to its identity-sensitive training that penalizes even minor mismatches in identity-defining traits. Ablation studies reveal that joint training provides complementary benefits, with single-task training resulting in performance drops.

In this paper, researchers introduced REFVNLI, a reliable, cost-effective metric for subject-driven T2I generation that addresses both textual alignment and subject preservation challenges. Trained on an extensive auto-generated dataset, REFVNLI effectively balances robustness to identity-agnostic variations such as pose, lighting, and background with sensitivity to identity-specific traits, including facial features, object shape, and unique details. Future research directions include enhancing REFVNLI’s evaluation capabilities across artistic styles, handling textual modifications that explicitly alter identity-defining attributes, and improving the processing of multiple reference images for single and distinct subjects.


Check out the Paper. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post Subject-Driven Image Evaluation Gets Simpler: Google Researchers Introduce REFVNLI to Jointly Score Textual Alignment and Subject Consistency Without Costly APIs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

REFVNLI 文本到图像 评估指标 主题驱动生成 人工智能
相关文章