MarkTechPost@AI 2024年09月22日
ByteDance Researchers Release InfiMM-WebMath-40: An Open Multimodal Dataset Designed for Complex Mathematical Reasoning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

字节跳动和中国科学院的研究人员发布了 InfiMM-WebMath-40B,这是一个大型多模态数据集,专门用于数学推理。该数据集包含从 CommonCrawl 存储库中提取和过滤的 2400 万个网页、8500 万个相关图像 URL 和大约 400 亿个文本标记。该数据集旨在解决开源模型训练中缺乏综合多模态数据集的问题,这些数据集整合了文本和视觉数据,例如图表、方程式和几何图形。通过结合文本和视觉数学数据,InfiMM-WebMath-40B 为训练多模态大型语言模型 (MLLM) 提供了前所未有的资源,使其能够处理比以往更复杂的数学概念。

🤔 InfiMM-WebMath-40B 是一个大型多模态数据集,包含 2400 万个网页、8500 万个相关图像 URL 和大约 400 亿个文本标记,用于数学推理。该数据集解决了开源模型训练中缺乏综合多模态数据集的问题,这些数据集整合了文本和视觉数据,例如图表、方程式和几何图形。

📊 该数据集通过整合文本和视觉数学数据,为训练多模态大型语言模型 (MLLM) 提供了前所未有的资源,使其能够处理比以往更复杂的数学概念。该数据集的构建使用了严格的数据处理流程,从 1220 亿个网页开始,过滤到 2400 万个网页文档,确保内容专注于数学和科学。

📈 在 MathVerse 和 We-Math 等基准测试中,使用该数据集训练的模型在处理文本和视觉信息方面的能力方面优于其他模型。例如,尽管只使用了 400 亿个标记,但研究人员的模型 InfiMM-Math 在性能上与使用 1200 亿个标记的专有模型相当。

🚀 InfiMM-WebMath-40B 为训练开源模型处理涉及文本和视觉数据的复杂推理任务提供了更多数据。该数据集的细致构建以及 400 亿个文本标记与 8500 万个图像 URL 的结合,为下一代多模态大型语言模型提供了坚实的基础。

💡 使用 InfiMM-WebMath-40B 训练的模型的性能突出了将视觉元素与文本数据相结合以提高数学推理能力的重要性。该数据集弥合了专有模型和开源模型之间的差距,为未来的研究铺平了道路,以增强 AI 解决复杂数学问题的能力。

Artificial intelligence has significantly enhanced complex reasoning tasks, particularly in specialized domains such as mathematics. Large Language Models (LLMs) have gained attention for their ability to process large datasets and solve intricate problems. The mathematical reasoning capabilities of these models have vastly improved over the years. This progress has been driven by advancements in training techniques, such as Chain-of-Thought (CoT) prompting, and diverse datasets, allowing these models to solve various mathematical problems, from simple arithmetic to complex high-school competition-level tasks. The growing sophistication of LLMs has made them indispensable tools in fields where advanced reasoning is required. Still, the quality and scale of available pre-training datasets have limited their full potential, especially for open-source projects.

A key issue that hinders the development of mathematical reasoning in LLMs is the lack of comprehensive multimodal datasets that integrate text and visual data, such as diagrams, equations, and geometric figures. Most mathematical knowledge is expressed through textual explanations and visual elements. While proprietary models like GPT-4 and Claude 3.5 Sonnet have leveraged extensive private datasets for pre-training, the open-source community has struggled to keep up due to the scarcity of high-quality, publicly available datasets. Without these resources, it is difficult for open-source models to advance in handling the complex reasoning tasks that proprietary models tackle. This gap in multimodal datasets has made it challenging for researchers to train models that can handle text-based and visual reasoning tasks.

Several approaches have been used to train LLMs for mathematical reasoning, but most focus on text-only datasets. For instance, proprietary datasets like WebMath and MathMix have provided billions of text tokens for training models like GPT-4, but they do not address the visual elements of mathematics. Open-source datasets like OpenWebMath and DeepSeekMath have also been introduced, but they are primarily focused on mathematical text rather than integrating visual and textual data. While these datasets have advanced LLMs in specific areas of math, such as arithmetic and algebra, they fall short when it comes to complex, multimodal reasoning tasks that require integrating visual elements with text. This limitation has led to developing models that perform well on text-based tasks but struggle with multimodal problems that combine written explanations with diagrams or equations.

Researchers from ByteDance and the Chinese Academy of Sciences introduced InfiMM-WebMath-40B, a comprehensive dataset that offers a large-scale multimodal resource specifically designed for mathematical reasoning. This dataset includes 24 million web pages, 85 million associated image URLs, and approximately 40 billion text tokens extracted and filtered from the CommonCrawl repository. The research team meticulously filtered the data to ensure the inclusion of high-quality, relevant content, making it the first of its kind in the open-source community. By combining textual and visual mathematical data, InfiMM-WebMath-40B offers an unprecedented resource for training Multimodal Large Language Models (MLLMs), enabling them to process and reason with more complex mathematical concepts than ever.

The dataset was constructed using a rigorous data processing pipeline. Researchers began with 122 billion web pages, filtered to 24 million web documents, ensuring the content focused on mathematics and science. FastText, a language identification tool, filtered out non-English and non-Chinese content. The dataset’s multimodal nature required special attention to image extraction and the alignment of images with their corresponding text. In total, 85 million image URLs were extracted, filtered, and paired with relevant mathematical content, creating a dataset that integrates visual and textual elements to enhance the mathematical reasoning capabilities of LLMs.

The performance of models trained on InfiMM-WebMath-40B has significantly improved compared to previous open-source datasets. In evaluations conducted on benchmarks such as MathVerse and We-Math, models trained using this dataset outperformed others in their ability to process both text and visual information. For instance, despite utilizing only 40 billion tokens, the researchers’ model, InfiMM-Math, performed comparably to proprietary models that used 120 billion tokens. On the MathVerse benchmark, InfiMM-Math demonstrated superior performance in text-dominant, text-lite, and vision-intensive categories, outperforming many open-source models with much larger datasets. Similarly, on the We-Math benchmark, the model achieved remarkable results, demonstrating its capability to handle multimodal tasks and setting a new standard for open-source LLMs.

In conclusion, InfiMM-WebMath-40B, offering a large-scale, multimodal dataset, must address more data for training open-source models to handle complex reasoning tasks involving text and visual data. The dataset’s meticulous construction and combination of 40 billion text tokens with 85 million image URLs provide a robust foundation for the next generation of Multimodal Large Language Models. The performance of models trained on InfiMM-WebMath-40B highlights the importance of integrating visual elements with textual data to improve mathematical reasoning capabilities. This dataset bridges the gap between proprietary and open-source models and paves the way for future research to enhance AI’s ability to solve complex mathematical problems.


Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post ByteDance Researchers Release InfiMM-WebMath-40: An Open Multimodal Dataset Designed for Complex Mathematical Reasoning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

数学推理 多模态数据集 大型语言模型 InfiMM-WebMath-40B 字节跳动
相关文章