MarkTechPost@AI 2024年07月26日
MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MINT-1T是一个拥有万亿文本标记和34亿图像的多模态数据集,旨在用于训练大型多模态模型。该数据集由来自HTML,PDF和ArXiv论文的文本和图像组成,相比之前的OBELICS等数据集规模更大,数据更丰富,可用于训练更强大的多模态模型。

😁 **数据集规模和多样性:** MINT-1T数据集包含来自HTML,PDF和ArXiv论文的万亿文本标记和34亿图像,比之前的OBELICS数据集规模大10倍,为训练多模态模型提供了更多的数据。

🤔 **数据集构建过程:** MINT-1T数据集的构建过程包括数据来源、过滤和去重。研究人员从HTML文档、PDF和ArXiv论文中提取了文本和图像数据,并通过语言识别和NSFW检测工具过滤了低质量、非英语和不适当的内容。此外,他们还使用Bloom过滤器和哈希技术去除了重复的段落和图像。

🚀 **数据集应用:** MINT-1T数据集可用于训练各种多模态模型,例如视觉问答和多模态推理模型。研究人员发现,使用MINT-1T数据集训练的模型在各种基准测试中表现出色,甚至超过了使用其他数据集训练的模型。

💡 **数据集意义:** MINT-1T数据集的发布为多模态AI研究提供了宝贵资源,有助于开发更强大的多模态模型。其规模和多样性可以帮助研究人员克服数据限制,推动多模态AI技术的发展。

📊 **数据集细节:** MINT-1T数据集包含9220亿HTML标记,1060亿PDF标记和90亿ArXiv标记。该数据集还包含来自各种来源的图像,例如网页截图、PDF文档中的图片和ArXiv论文中的图表。

Artificial intelligence, particularly in training large multimodal models (LMMs), relies heavily on vast datasets that include sequences of images and text. These datasets enable the development of sophisticated models capable of understanding and generating multimodal content. As AI models’ capabilities advance, the need for extensive, high-quality datasets becomes even more critical, driving researchers to explore new data collection and curation methods.

A significant challenge in AI research is the need for large-scale, open-source, multimodal interleaved datasets. These datasets are essential for training models seamlessly integrating text and image data. The limited availability of such datasets hampers the development of robust and high-performing open-source models, resulting in a performance gap between open-source and proprietary models. Addressing this gap requires innovative approaches to dataset creation that can provide the necessary scale and diversity.

Existing methods for creating multimodal datasets often involve collecting and curating data from HTML documents. Notable datasets like OBELICS have been instrumental but are limited in scale and diversity, primarily sourcing data from HTML. This restriction affects the variety and richness of the data, impacting the performance and applicability of the resulting AI models. Researchers have found that datasets sourced solely from HTML documents must capture the full spectrum of required multimodal content for comprehensive model training.

Researchers from the University of Washington, Salesforce Research, Stanford University, the University of Texas at Austin, and the University of California Berkeley introduced MINT-1T, the most extensive & diverse open-source multimodal interleaved dataset to date, addressing the need for larger and more varied datasets. MINT-1T comprises one trillion text tokens and 3.4 billion images from HTML, PDFs, and ArXiv papers. This dataset represents a tenfold increase from previous datasets, significantly enhancing the data for training multimodal models. Institutions such as the University of Washington and Salesforce Research collaborated on this initiative, demonstrating a concerted effort to bridge the gap in dataset availability.

Creating the MINT-1T dataset involved an intricate process of sourcing, filtering, and deduplicating data. HTML documents were expanded to include data from earlier years, and PDFs were processed to extract readable text and images. ArXiv papers were parsed for figures and text, ensuring a comprehensive collection of multimodal content. Advanced filtering methods were employed to remove low-quality, non-English, and inappropriate content. Deduplication processes were also implemented to eliminate repetitive data, ensuring the dataset’s quality and diversity.

Experiments demonstrated that LMMs trained on the MINT-1T dataset matched and often surpassed the performance of models trained on previous leading datasets like OBELICS. Including more diverse sources in MINT-1, T resulted in better generalization and performance across various benchmarks. Notably, the dataset significantly improved performance in tasks involving visual question answering and multimodal reasoning. The researchers found that models trained on MINT-1T performed better across multiple demonstrations, highlighting the dataset’s effectiveness.

The MINT-1T dataset’s construction included detailed steps to ensure data quality and diversity. For instance, the dataset consists of 922 billion HTML tokens, 106 billion PDF tokens, and 9 billion ArXiv tokens. The filtering process involved eliminating documents with inappropriate content and non-English texts, using tools like Fasttext for language identification and NSFW detectors for image content. The deduplication process was crucial, involving Bloom filters to remove duplicate paragraphs and documents and hashing techniques to eliminate repetitive images.

In conclusion, the MINT-1T dataset addresses dataset scarcity and diversity. By introducing a larger and more varied dataset, the researchers have enabled the development of more robust and high-performing open-source multimodal models. This work highlights the importance of data diversity and scale in AI research and paves the way for future improvements and applications in multimodal AI. The dataset’s extensive scale, including one trillion text tokens and 3.4 billion images, provides a solid foundation for advancing AI capabilities.


Check out the Paper, Details, and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post MINT-1T Dataset Released: A Multimodal Dataset with One Trillion Tokens to Build Large Multimodal Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MINT-1T 多模态数据集 大型多模态模型 AI研究
相关文章