LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

cs.AI updates on arXiv.org 07月11日 12:04

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

本文介绍LinguaMark，一个针对多语言视觉问答任务评估大型多模态模型的基准。通过包含11种语言的6,875个图像-文本对，评估了多种模型在偏见、答案相关性和忠实度方面的表现，并开源了评估代码。

arXiv:2507.07274v1 Announce Type: cross Abstract: Large Multimodal Models (LMMs) are typically trained on vast corpora of image-text data but are often limited in linguistic coverage, leading to biased and unfair outputs across languages. While prior work has explored multimodal evaluation, less emphasis has been placed on assessing multilingual capabilities. In this work, we introduce LinguaMark, a benchmark designed to evaluate state-of-the-art LMMs on a multilingual Visual Question Answering (VQA) task. Our dataset comprises 6,875 image-text pairs spanning 11 languages and five social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness. Our findings reveal that closed-source models generally achieve the highest overall performance. Both closed-source (GPT-4o and Gemini2.5) and open-source models (Gemma3, Qwen2.5) perform competitively across social attributes, and Qwen2.5 demonstrates strong generalization across multiple languages. We release our benchmark and evaluation code to encourage reproducibility and further research.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LinguaMark 多语言视觉问答大型多模态模型评估基准开源代码

相关文章

Gemini视频推理遥遥领先GPT-4o，Jeff Dean连续转发三次，首个视频多模态基准Video-MME来了

Fact or Fiction? NOCHA: A New Benchmark for Evaluating Long-Context Reasoning in LLMs

LongVA and the Impact of Long Context Transfer in Visual Processing: Enhancing Large Multimodal Models for Long Video Sequences

Llama3比GPT-4o更爱说谎，首个大模型“诚实性”评估基准来了 | 上海交大

LLaVA-NeXT-Interleave: A Versatile Large Multimodal Model LMM that can Handle Settings like Multi-image, Multi-frame, and Multi-view

Visual Haystacks Benchmark: The First “Visual-Centric” Needle-In-A-Haystack (NIAH) Benchmark to Assess LMMs’ Capability in Long-Context Visual Retrieval and Reasoning

MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models

This AI Paper Presents a Survey of the Current Methods Used to Achieve Refusal in LLMs: Provide Evaluation Benchmarks and Metrics Used to Measure Abstention in LLMs

全新生物学基准数据集LAB-Bench开源！覆盖8大任务，超2.4K选择题

无一大模型及格！北大 / 通研院提出超难基准 LooGLE，专门评估长文本理解生成