MarkTechPost@AI 2024年12月08日
Global-MMLU: A World-class Benchmark Redefining Multilingual AI by Bridging Cultural and Linguistic Gaps for Equitable Evaluation Across 42 Languages and Diverse Contexts
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Global-MMLU 是一个全新的多语言基准测试,旨在解决传统多语言数据集(如 MMLU)的局限性,特别是其西方文化中心倾向和对机器翻译的过度依赖。该数据集涵盖 42 种语言,包括高资源和低资源语言,并引入了文化敏感 (CS) 和文化不可知 (CA) 子集,以更精细地评估多语言能力。Global-MMLU 通过结合专业翻译、社区贡献和改进的机器翻译技术创建,并采用严格的数据管理流程,确保了数据的准确性和文化相关性。该基准测试的推出,将推动多语言人工智能模型在跨文化理解方面的进步。

🌐Global-MMLU 是一个包含 42 种语言的多语言基准测试,旨在解决传统数据集的局限性,特别是其西方文化中心倾向和对机器翻译的依赖。

🌍该数据集引入了文化敏感 (CS) 和文化不可知 (CA) 子集,分别用于评估模型对文化、地理或方言特定知识的理解,以及对通用、非语境任务的处理能力。

📖Global-MMLU 的创建过程结合了专业翻译、社区贡献和改进的机器翻译技术,并由专业注释员和母语人士进行验证,确保数据的准确性和文化相关性。

📊对 14 个最先进模型的评估显示,闭源模型在文化敏感任务上通常优于开源模型,但在低资源语言评估中表现出更大的可变性。

🔬研究结果强调了根据语言资源可用性来分析模型性能的必要性,高资源语言通常表现更好,而低资源语言则表现出更大的性能下降和更高的可变性。

Global-MMLU by researchers from Cohere For AI, EPFL, Hugging Face, Mila, McGill University & Canada CIFAR AI Chair, AI Singapore, National University of Singapore, Cohere, MIT, KAIST, Instituto de Telecomunicações, Instituto Superior Técnico, Universidade de Lisboa, MIT, MIT-IBM Watson AI Lab, Carnegie Mellon University, CONICET & Universidad de Buenos Aires emerges as a transformative benchmark designed to overcome the limitations of traditional multilingual datasets, particularly the Massive Multitask Language Understanding (MMLU) dataset. 

The motivations for Global-MMLU stem from critical observations about the shortcomings of existing datasets. These datasets often reflect Western-centric cultural paradigms and depend heavily on machine translations, which can distort meaning and introduce biases. For example, MMLU datasets are predominantly aligned with Western knowledge systems, with 28% of the dataset requiring culturally sensitive insights, of which 86.5% are rooted in Western cultural contexts. Also, 84.9% of geographic knowledge questions are North America- or Europe-centric, underscoring the dataset’s need for global inclusivity.

Global-MMLU seeks to correct these imbalances by introducing a dataset spanning 42 languages, encompassing both high- and low-resource languages. Including culturally sensitive (CS) and culturally agnostic (CA) subsets allows for a more granular evaluation of multilingual capabilities. CS subsets demand cultural, geographic, or dialect-specific knowledge, while CA subsets focus on universal, non-contextual tasks. The creation of Global-MMLU involved a rigorous data curation process. It combined professional translations, community contributions, and improved machine translation techniques. Notably, professional annotators worked on high-accuracy translations for key languages like Arabic, French, Hindi, and Spanish. Community-driven efforts further enriched the dataset by addressing linguistic nuances in less-resourced languages.

A critical innovation of Global-MMLU lies in its evaluation methodology. By separately analyzing CS and CA subsets, researchers can assess the true multilingual capabilities of LLMs. For instance, cultural sensitivity significantly impacts model rankings, with average shifts of 5.7 ranks and 7.3 positions on CS datasets, compared to 3.4 ranks and 3.7 positions on CA datasets. These findings highlight the variability in model performance when handling culturally nuanced versus universal knowledge tasks. The evaluation of 14 state-of-the-art models, including proprietary systems like GPT-4o and Claude Sonnet 3.5, revealed critical insights. Closed-source models generally outperformed open-weight counterparts, particularly in culturally sensitive tasks. However, they also exhibited greater variability in low-resource language evaluations, underscoring the challenges of creating robust multilingual systems.

Global-MMLU dataset builds on professional translations, community contributions, and state-of-the-art machine translation techniques, emphasizing addressing translation artifacts and cultural biases. Unlike traditional methods that rely heavily on automated translations, Global-MMLU incorporates human-verified translations for improved accuracy and cultural relevance. These efforts focused on four “gold-standard” languages, Arabic, French, Hindi, and Spanish, where professional annotators ensured the translations adhered to both linguistic fluency and cultural appropriateness. Also, community contributions enriched the dataset for eleven other languages, requiring at least fifty samples to be verified by native speakers to ensure quality.

A key challenge addressed in Global-MMLU is the inherent variability in culturally sensitive tasks. The annotation process involved categorizing questions based on their reliance on cultural knowledge, regional specificity, and dialectal understanding. For instance, questions requiring cultural knowledge often reflected Western-centric paradigms, which dominate 86.5 percent of the culturally sensitive subset. In contrast, regions like South Asia and Africa were significantly underrepresented, accounting for a mere four percent and one percent, respectively. Geographic biases were also apparent, with 64.5 percent of questions requiring regional knowledge focused on North America and 20.4 percent on Europe. Such imbalances highlighted the necessity of re-evaluating model capabilities on more inclusive datasets.

Closed-source models like GPT-4o and Claude Sonnet 3.5 demonstrated strong performance across both subsets, yet their rankings showed greater variability when handling culturally nuanced tasks. This variability was pronounced in low-resource languages such as Amharic and Igbo, where limited training data often exacerbates the challenges of multilingual evaluations. Models trained predominantly on high-resource language datasets displayed clear biases, often underperforming in culturally diverse or less-represented contexts.

The findings also underscored the need to disaggregate model performance by resource availability of languages. For instance, high-resource languages like English and French achieved the highest accuracy levels, while low-resource languages exhibited significant drops in performance accompanied by higher variability. In culturally sensitive subsets, this variability was amplified due to the nuanced understanding required to interpret cultural, regional, and vernacular references. This trend was not limited to low-resource languages; even high-resource languages experienced variability in rankings when cultural sensitivity was a factor. For example, Hindi and Chinese emerged as the most sensitive languages to culturally specific tasks, showing significant rank changes across evaluated models.

Global-MMLU introduced separate analysis for culturally sensitive and agnostic subsets to ensure the dataset’s robustness and evaluations. This approach revealed that models demonstrated varying cultural adaptability even within high-resource languages. Closed-source models generally outperformed open-weight systems, yet both categories struggled with tasks requiring a deep contextual understanding of culturally nuanced material. The dataset’s distinct categorization of culturally sensitive and agnostic tasks allowed researchers to pinpoint areas where language models excel or falter.

In conclusion, Global-MMLU stands as a data-rich benchmark redefining multilingual AI evaluation by addressing critical cultural and linguistic representation gaps. The dataset encompasses 42 languages, including low-resource languages like Amharic and Igbo, and integrates 14,000 samples with over 589,000 translations. Of these, 28% require culturally sensitive knowledge, with 86.5% rooted in Western cultural paradigms. Evaluations revealed that culturally sensitive tasks induce an average rank variability of 5.7 ranks and 7.3 positions across models. High-resource languages achieved superior performance, while low-resource languages showed significant variability, with accuracy fluctuations of up to 6.78%.


Check out the Paper and HF Link. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Partner with us]: ‘Next Magazine/Report- Open Source AI in Production’

The post Global-MMLU: A World-class Benchmark Redefining Multilingual AI by Bridging Cultural and Linguistic Gaps for Equitable Evaluation Across 42 Languages and Diverse Contexts appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多语言AI 基准测试 文化敏感性 机器翻译 自然语言处理
相关文章