MarkTechPost@AI 02月14日
Google DeepMind Research Introduces WebLI-100B: Scaling Vision-Language Pretraining to 100 Billion Examples for Cultural Diversity and Multilingualit
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Google DeepMind的研究人员推出了WebLI-100B数据集,包含1000亿个图像-文本对,旨在提升视觉语言模型在文化多样性和多语言方面的性能。该数据集通过扩大规模而非过度过滤,保留了更广泛的文化细节和语言表示。实验结果表明,在文化多样性任务和低资源语言检索方面,使用完整数据集训练的模型优于使用较小数据集的模型。尽管传统基准测试的提升有限,但WebLI-100B通过增强包容性,减少了不同子群体之间的性能差距,为未来研究提供了参考。

🌍WebLI-100B数据集包含1000亿个图像-文本对,是先前数据集的十倍,旨在提升视觉语言模型在文化多样性和多语言方面的性能,尤其关注低资源语言和多样化表示。

🖼️与以往的数据集不同,WebLI-100B侧重于扩展数据规模,而非依赖于重度过滤,从而保留了重要的文化细节,避免了因过滤而丢失关键信息。

📊研究人员通过在不同大小的WebLI-100B数据集子集(1B、10B和100B)上预训练模型,分析了数据扩展的影响。结果表明,使用完整数据集训练的模型在文化和多语言任务中表现更好,即使使用相同的计算资源。

⚖️研究人员创建了一个质量过滤的5B数据集和一个语言重新平衡的版本,以对低资源语言进行上采样。评估结果表明,将数据集大小从10B增加到100B对西方中心基准的影响很小,但提高了文化多样性任务和低资源语言检索的性能。

⚠️尽管性能差距随着多样性的提高而缩小,但偏见分析显示,仍然存在与性别相关的表示和关联偏见。这表明在提高模型性能的同时,需要持续关注和解决偏见问题。

Machines learn to connect images and text by training on large datasets, where more data helps models recognize patterns and improve accuracy. Vision-language models (VLMs) rely on these datasets to perform tasks like image captioning and visual question answering. The question is whether increasing datasets to 100 billion examples dramatically improves accuracy, cultural diversity, and support for low-resource languages. Scaling beyond 10 billion has slowed down, and doubts are raised about whether there will be further benefits. With such enormous data come quality control problems, bias, and computational constraints.

Currently, vision-language models depend on massive datasets such as Conceptual Captions and LAION, with millions to billions of image-text pairs. These datasets enable zero-shot classification and image captioning, but their advancement has slowed to around 10 billion pairs. This constraint minimizes the prospect of further refining model accuracy, inclusivity, and multilingual comprehension. Existing approaches are based on web-crawled data, which poses issues like low-quality samples, linguistic biases, and underrepresentation of multiculturalism. The efficiency of further scaling is questionable beyond this point, bringing into question the ability of increased dataset size alone to produce substantial improvements.

To mitigate limitations in cultural diversity and multilinguality in vision-language models, researchers from Google Deepmind proposed WebLI-100B, a dataset containing 100 billion image-text pairs, which is ten times larger than previous datasets. This dataset captures rare cultural concepts and improves model performance in less-explored areas like low-resource languages and diverse representations. Unlike prior datasets, WebLI-100B focuses on scaling data instead of relying on heavy filtering, which often removes important cultural details. The framework involves pre-training models on different subsets of the WebLI-100B dataset (1B, 10B, and 100B) to analyze the impact of data scaling. Models trained on the full dataset performed better in cultural and multilingual tasks than those trained on smaller datasets, even when using the same computational resources. Instead of aggressive filtering, the dataset maintains a broad representation of languages and cultural elements, making it more inclusive.

Researchers created a quality-filtered 5B dataset and a language-rebalanced version to upsample low-resource languages. Using the SigLIP model, they trained on different dataset sizes (1B, 10B, and 100B) with ViT architectures and a contrastive learning approach. The evaluation covered zero-shot and few-shot classification tasks such as ImageNet, CIFAR100, and COCO Captions, cultural diversity benchmarks like Dollar Street and GeoDE, and multilingual retrieval using Crossmodal-3600. Results indicated that increasing the dataset size from 10B to 100B had minimal impact on Western-centric benchmarks but led to improvements in cultural diversity tasks and low-resource language retrieval. Bias analysis revealed persistent gender-related representation and association biases despite performance disparity improving alongside diversity gains. Finally, researchers assessed model transferability to generative tasks using PaliGemma, testing frozen and unfrozen settings for downstream vision-language applications.

In conclusion, scaling vision-language pre-training datasets to 100 billion image-text pairs improved inclusivity by enhancing cultural diversity and multilinguality, and reducing performance disparity across subgroups, even though traditional benchmarks showed limited gains. While quality filters like CLIP improved performance on standard tasks, they often reduced data diversity. This work can serve as a reference for future research by stimulating the creation of filtering algorithms that maintain diversity and training strategies that promote inclusivity without needing extra data and informing future improvements on balancing performance, diversity, and fairness in vision-language models.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post Google DeepMind Research Introduces WebLI-100B: Scaling Vision-Language Pretraining to 100 Billion Examples for Cultural Diversity and Multilingualit appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WebLI-100B 视觉语言模型 文化多样性 多语言 数据集
相关文章