MarkTechPost@AI 01月20日
Researchers from MIT, Google DeepMind, and Oxford Unveil Why Vision-Language Models Do Not Understand Negation and Proposes a Groundbreaking Solution
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

视觉语言模型在多模态任务中至关重要,但理解否定是其主要挑战。现有模型在处理否定语句时表现不佳,这源于训练数据中肯定示例的偏见。为了解决这一问题,研究人员提出了NegBench框架,通过包含大量否定场景的合成数据集,以及改编的现有数据集来评估和改进模型。该框架在检索和多项选择任务中都取得了显著进展,提升了模型对细微差别的理解能力。这些改进为更强大的AI系统铺平了道路,尤其是在医疗诊断等关键领域。

🔍 视觉语言模型 (VLMs) 在图像检索、描述和医疗诊断等任务中发挥关键作用,但其否定理解能力仍面临挑战。

🤔 现有VLMs在处理否定语句时,由于训练数据中肯定示例偏见,导致模型无法可靠地区分肯定和否定,严重限制了其在高风险领域的应用。

🧪 NegBench框架通过引入大量的否定场景合成数据集,以及改编的现有数据集,有效评估和改进了VLMs的否定理解能力,显著提升了模型在检索和多项选择任务中的性能。

📈 使用NegBench框架微调的模型在否定查询的检索任务中,召回率提高了10%,多项选择题的准确率提高了高达40%,表明模型能更好地区分肯定和否定描述。

💡 NegBench通过多样化的否定示例,克服了现有局限,为构建更强大的、能理解细微语言的AI系统奠定了基础,对医疗诊断和语义内容检索等领域具有重要意义。

Vision-language models (VLMs) play a crucial role in multimodal tasks like image retrieval, captioning, and medical diagnostics by aligning visual and linguistic data. However, understanding negation in these models remains one of the main challenges. Negation is critical for nuanced applications, such as distinguishing “a room without windows” from “a room with windows.” Despite their advancements, current VLMs fail to interpret negation reliably, severely limiting their effectiveness in high-stakes domains like safety monitoring and healthcare. Addressing this challenge is essential to expand their applicability in real-world scenarios.

The current VLMs, such as CLIP, use shared embedding spaces to align visual and textual representations. Though these models excel in tasks such as cross-modal retrieval and image captioning, their performance falls sharply when dealing with negated statements. This limitation arises due to pretraining data biases because the training datasets contain mainly affirmative examples, leading to affirmation bias, where models treat negated and affirmative statements as equivalents. Existing benchmarks such as CREPE and CC-Neg rely on simplistic templated examples that don’t represent the richness and depth of negation in natural language. VLMs tend to collapse the embeddings of negated and affirmative captions so it is extremely challenging to tease apart fine-grained differences between the concepts. This poses a problem in using VLMs for precise language understanding applications, for instance, querying a medical imaging database with complex inclusion and exclusion criteria.

To address these limitations, researchers from MIT, Google DeepMind, and the University of Oxford proposed the NegBench framework for the evaluation and improvement of negation comprehension over VLMs. The framework assesses two fundamental tasks: Retrieval with Negation (Retrieval-Neg), which examines the model’s capacity to retrieve images according to both affirmative and negated specifications, such as “a beach without people,” and Multiple Choice Questions with Negation (MCQ-Neg), which evaluates nuanced comprehension by necessitating that models select appropriate captions from slight variations. It uses enormous synthetic datasets, like CC12M-NegCap and CC12M-NegMCQ, augmented with millions of captions that contain a wide range of negation scenarios. This will expose VLMs to somewhat challenging negatives and paraphrased captions, improving the training and evaluation of models. Standard datasets, such as COCO and MSR-VTT, were also adapted, including negated captions and paraphrases, to further expand linguistic diversity and test the robustness. By incorporating varied and complex negation examples, NegBench effectively overcomes existing limitations, significantly enhancing model performance and generalization.

NegBench leverages both real and synthetic datasets to test negation comprehension. Datasets like COCO, VOC2007, and CheXpert were adapted to include negation scenarios, such as “This image includes trees but not buildings.” For MCQs, templates like “This image includes A but not B” were used alongside paraphrased variations for diversity. NegBench is further augmented with the HardNeg-Syn dataset, where images are synthesized to present pairs differing from each other based on the occurrence or absence of certain objects only, hence constituting difficult cases for negation understanding. Model fine-tuning relied on two training objectives. On one hand, contrastive loss facilitated the alignment between image-caption pairs, enhancing performance in retrieval. On the other hand, using multiple-choice loss helped in making fine-grained negation judgments by preferring the right captions in the MCQ context.

The fine-tuned models showed considerable improvements in retrieval and comprehension tasks using the negation-enriched datasets. For retrieval, the model’s recall increases by 10% for negated queries, where performance is nearly at par with standard retrieval tasks. In the multiple-choice question tasks, accuracy improvements of up to 40% were reported, showing a better ability to differentiate between the subtle affirmative and negated captions. Advancements were uniform over a range of datasets, including COCO and MSR-VTT, and on synthetic datasets like HardNeg-Syn, where models handled negation and complex linguistic developments appropriately. This suggests that representing scenarios with diverse kinds of negation in training and testing is effective in reducing affirmation bias and generalization.

NegBench addresses a critical gap in VLMs by being the first work to address their inability to understand negation. It brings significant improvements in retrieval and comprehension tasks by incorporating diverse negation examples into trAIning and evaluation. Such improvements open up avenues for much more robust AI systems that are capable of nuanced language understanding, with important implications for critical domains like medical diagnostics and semantic content retrieval.


Check out the Paper and Code. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Researchers from MIT, Google DeepMind, and Oxford Unveil Why Vision-Language Models Do Not Understand Negation and Proposes a Groundbreaking Solution appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉语言模型 否定理解 NegBench 多模态任务 人工智能
相关文章