MarkTechPost@AI 2024年12月13日
Meet Maya: An 8B Open-Source Multilingual Multimodal Model with Toxicity-Free Datasets and Cultural Intelligence Across Eight Languages
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Maya 是一款开源的多语言多模态视觉语言模型,参数规模达 80 亿。它克服了现有数据集质量和毒性限制,使用包含 558,000 个图像-文本对的新预训练数据集,这些数据平均分布在八种语言中,并经过严格的毒性过滤。Maya 在多语言基准测试中表现出色,并在五种语言中优于类似大小的模型,展现出强大的跨语言理解和生成能力。

🌐 Maya 采用包含 558,000 个图像-文本对的全新预训练数据集,这些数据平均分布在英语、中文、法语、西班牙语、俄语、印地语、日语和阿拉伯语八种语言中,并经过严格的毒性过滤,移除了 7,531 个有毒元素,确保了数据的清洁。

⚖️ 该模型通过优化的翻译和预训练策略,实现了平衡的数据分布和文化包容性,支持八种语言,有效避免了数据偏差。

🖼️ Maya 采用 SigLIP 进行视觉编码,Aya-23 进行多语言建模,实现了高质量的图像-文本对齐和跨语言理解能力。

🏆 在性能方面,Maya 在五种语言的多语言基准测试中优于类似大小的模型,并在多个基准测试中与更大模型相媲美。

🛡️ 通过解决毒性和偏见问题,Maya 为道德和公平的 AI 实践树立了先例,推动了多语言 AI 的发展。

Vision-Language Models (VLMs) allow machines to understand and reason about the visual world through natural language. These models have applications in image captioning, visual question answering, and multimodal reasoning. However, most models are designed and trained predominantly for high-resource languages, leaving substantial gaps in accessibility and usability for speakers of low-resource languages. This gap highlights the importance of developing multilingual systems that cater to a global audience while maintaining high performance across diverse linguistic and cultural contexts. However, a concern in developing multilingual VLMs lies in the availability and quality of multilingual datasets.

Even if there are datasets, they have these limitations:

The limited representation of diverse languages, combined with the presence of culturally insensitive material, hampers the performance of VLMs in underrepresented regions and raises concerns about fairness and inclusivity.

Researchers have turned to various methods of dataset expansion and quality improvement to address these limitations. For example, datasets like Multi30k and Crossmodal-3600 have attempted to provide multilingual support but must be expanded in scale and diversity. Semi-automated translations of image-text datasets have been used to extend language coverage in models such as PALO and X-LLaVA. However, these efforts often result in uneven distributions across languages and fail to address the toxicity present in the original data. The lack of systematic approaches to filtering harmful content further worsens the issue.

A team of researchers from Cisco Meraki, Cohere For AI Community, Indiana University Bloomington, Imperial College London, Georgia Institute of Technology, The Alan Turing Institute, Bangladesh University of Engineering and Technology, University of Pennsylvania, IIT Bombay, TU Darmstadt, Articul8 AI, Capital One, IIT Dhanbad, and MBZUAI introduced Maya, an 8B parameters open-source multilingual multimodal vision-language model that aims to overcome existing dataset quality and toxicity limitations. The model leverages a new pretraining dataset containing 558,000 image-text pairs distributed equally across eight languages: English, Chinese, French, Spanish, Russian, Hindi, Japanese, and Arabic. This dataset underwent rigorous toxicity filtering, with over 7,531 toxic images and captions removed using tools like LLaVAGuard and Toxic-BERT. Maya’s development also focused on balancing data distribution to prevent biases.

Maya’s architecture is built on the LLaVA framework and incorporates advanced techniques for image-text alignment and multilingual adaptation. The model employs SigLIP, a vision encoder capable of handling variable input dimensions, and Aya-23, a multilingual language model trained across 23 languages. A two-layer projection matrix bridges image features to language features, optimizing performance while maintaining computational efficiency. Pretraining was conducted on 8xH100 GPUs with a global batch size of 256; instruction fine-tuning utilized the PALO 150K dataset. This training process was designed to ensure high-quality outputs, with pretraining taking approximately 20 hours and fine-tuning requiring 48 hours.

Performance-wise, on multilingual benchmarks such as LLaVA-Bench-In-The-Wild, Maya outperformed similar-size models like LLaVA-7B and PALO-7B in five out of eight languages, including notable success in Arabic due to its robust translation and dataset design. Across English-only benchmarks, Maya maintained competitive accuracy, with marginal gains observed in tasks like text translation and numerical calculation for the toxicity-free variant. However, some complex reasoning tasks showed slight performance declines, indicating that removing diverse, potentially toxic content may impact certain capabilities.

Some key takeaways and highlights from the Maya model research are summarized below:

In conclusion, by introducing Maya, the research addresses limited multilingual and culturally sensitive datasets in VLMs. This model combines an innovative dataset of 558,000 image-text pairs across eight languages with rigorous toxicity filtering and balanced representation to ensure inclusivity and ethical deployment. Leveraging advanced architecture and multilingual adaptation techniques, Maya outperforms similar-size models in multiple languages, setting a new standard for multilingual AI.


Check out the Paper and Model on Hugging Face. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

[Must Subscribe]: Subscribe to our newsletter to get trending AI research and dev updates

The post Meet Maya: An 8B Open-Source Multilingual Multimodal Model with Toxicity-Free Datasets and Cultural Intelligence Across Eight Languages appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多模态模型 多语言处理 数据集质量 毒性过滤 人工智能伦理
相关文章