MarkTechPost@AI 2024年12月07日
Cohere AI Introduces INCLUDE: A Comprehensive Multilingual Language Understanding Benchmark
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着AI技术的飞速发展,大型语言模型(LLMs)在不同语言和文化背景下的有效性变得至关重要。然而,非英语语言评估基准的缺乏限制了LLMs在服务不足地区的应用。Cohere AI、EPFL等机构的研究人员提出了一个名为INCLUDE的综合多语言理解基准测试,通过从52个国家/地区的本地来源收集了44种语言和15种独特文字的197,243个多项选择题-答案对,填补了现有评估方法的空白,为评估人工智能系统的区域和文化知识理解提供了一个强大的框架。

🌐现有的评估框架主要以英语为中心,这阻碍了公平AI技术的发展,并扩大了不同语言社区之间的数字鸿沟。非英语语言评估基准的缺乏限制了大型语言模型在服务不足地区的应用。

📚当前的研究工作虽然改进了LLMs的评估基准,如GLUE、SuperGLUE、MMLU等,但这些基准主要关注基于英语的数据,对多语言模型的开发造成了实质性的限制。

🌍来自EPFL、Cohere For AI、ETH Zurich和Swiss AI Initiative的研究人员提出了一个名为INCLUDE的综合多语言语言理解基准,该基准通过从本地语言资源中直接收集区域资源来解决现有评估方法中的关键差距。

📝INCLUDE基准测试包含来自52个国家/地区的1,926项考试的197,243个多项选择题-答案对,涵盖44种语言和15种独特的文字。这些考试资料均从当地来源收集,以捕捉真实的语言和文化细微差别。

📊对INCLUDE基准的评估揭示了多语言LLM在44种语言中的详细性能。GPT-4o表现最佳,在所有领域均达到了约77.1%的准确率。思维链(CoT)提示在专业和STEM相关考试中显示出适度的性能提升。

The rapid advancement of AI technologies highlights the critical need for Large Language Models (LLMs) that can perform effectively across diverse linguistic and cultural contexts. A key challenge is the lack of evaluation benchmarks for non-English languages, which limits the potential of LLMs in underserved regions. Most existing evaluation frameworks are English-centric, creating barriers to developing equitable AI technologies. This evaluation gap discourages practitioners from training multilingual models and widens digital divides across different language communities. Technical challenges further compound these issues, including limited dataset diversity, translation-based data collection methods, etc.

Existing research efforts have made significant improvements in developing evaluation benchmarks for LLMs. Pioneering frameworks like GLUE and SuperGLUE advanced language understanding tasks, while subsequent benchmarks such as MMLU, HellaSwag, ARC, GSM8K, and BigBench enhanced knowledge comprehension and reasoning. However, these benchmarks predominantly focused on English-based data, creating substantial limitations for multilingual model development. Datasets like Exams and Aya attempt broader language coverage, but they are limited in scope either focusing on specific educational curricula or lacking region-specific evaluation depth. Cultural understanding benchmarks explore language and societal nuances but do not provide holistic approaches to multilingual model assessment.

Researchers from EPFL, Cohere For AI, ETH Zurich, and the Swiss AI Initiative have proposed a comprehensive multilingual language understanding benchmark called INCLUDE. The benchmark addresses the critical gaps in existing evaluation methodologies by collecting regional resources directly from native language sources. Researchers designed an innovative pipeline to capture authentic linguistic and cultural nuances using educational, professional, and practical tests specific to different countries. The benchmark consists of 197,243 multiple-choice question-answer pairs from 1,926 examinations across 44 languages and 15 unique scripts. These examinations are collected from local sources in 52 countries.

The INCLUDE benchmark utilizes a complex annotation methodology to investigate factors driving multilingual performance. The researchers developed a comprehensive categorization approach that addresses the challenges of sample-level annotation by labeling exam sources instead of individual questions. This strategy allows for a nuanced understanding of the dataset’s composition while managing the prohibitive costs of detailed annotation. The annotation framework consists of two primary categorization schemes. Region-agnostic questions, comprising 34.4% of the dataset, cover universal topics like mathematics and physics. Region-specific questions are further subdivided into explicit, cultural, and implicit regional knowledge categories.

The evaluation of the INCLUDE benchmark reveals detailed insights into multilingual LLM performance across 44 languages. GPT-4o emerge as the top performer, achieving an impressive accuracy of approximately 77.1% across all domains. Chain-of-Thought (CoT) prompting shows moderate performance enhancements in Professional and STEM-related examinations, with minimal gains in Licenses and Humanities domains. Larger models like Aya-expanse-32B and Qwen2.5-14B show substantial improvements over their smaller counterparts, with 12% and 7% performance gains respectively. Gemma-7B shows the best performance among smaller models, excelling in the Humanities and Licenses categories, while Qwen models show superiority in STEM and Professional domains.

In conclusion, researchers introduced the INCLUDE benchmark which represents an advancement in multilingual LLM evaluation. By compiling 197,243 multiple-choice question-answer pairs from 1,926 examinations across 44 languages and 15 scripts, the researchers provide a framework for evaluating regional and cultural knowledge understanding in AI systems. The evaluation of 15 different models reveals significant variability in multilingual performance and highlights opportunities for improvement in regional knowledge comprehension. This benchmark sets a new standard for multilingual AI assessment and underscores the need for continued innovation in creating more equitable, culturally aware artificial intelligence technologies.


Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.

[Partner with us]: ‘Next Magazine/Report- Open Source AI in Production’

The post Cohere AI Introduces INCLUDE: A Comprehensive Multilingual Language Understanding Benchmark appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多语言理解 人工智能 基准测试 Cohere AI INCLUDE
相关文章