MarkTechPost@AI 2024年07月20日
From Diagrams to Solutions: MAVIS’s Three-Stage Framework for Mathematical AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MAVIS 是一种针对视觉数学问题解决能力的强大框架,旨在解决视觉编码器对数学图表的不足、视觉编码器和 LLM 之间的图表语言错位以及视觉元素的数学推理不准确等问题。该框架引入了两个大型数据集:MAVIS-Caption 和 MAVIS-Instruct,涵盖了各种数学领域,并采用了渐进式三阶段训练流程来增强图表视觉编码和推理能力。最终,MAVIS-7B 成为一个专门针对视觉数学任务优化的 MLLM,在评估基准测试中表现出优于现有开源 MLLM 的性能,突出了这种针对性方法在提高视觉数学问题解决能力方面的有效性。

🤔 MAVIS 框架旨在解决视觉数学问题解决能力的不足,该框架采用了三阶段训练流程,涵盖了图表视觉编码、图表语言错位以及视觉元素的数学推理等关键问题。

🖼️ MAVIS 引入了两个大型数据集:MAVIS-Caption 和 MAVIS-Instruct,前者包含 588,000 个图表-标题对,涵盖了平面几何、解析几何和函数等领域,而后者包含 834,000 个视觉数学问题,旨在增强 MLLM 的视觉数学推理能力。

🚀 MAVIS-7B 是一种专门针对视觉数学任务优化的 MLLM,在评估基准测试中表现出优于现有开源 MLLM 的性能,证明了 MAVIS 在提高视觉数学问题解决能力方面的有效性。

📦 MAVIS 数据引擎能够高效地生成高质量的数学图表,涵盖了平面几何、解析几何和函数等三种主要类型,通过 Matplotlib 渲染,并提供顶点标注和关键点绘图等功能。

💡 MAVIS-Caption 数据集包含详细的标题描述,平均长度为 61.48 个单词,词汇量为 149 个。标题生成策略根据图表类型而异,利用 GPT-4 创建的模板和每个领域的特定规则。

🧠 MAVIS-Instruct 数据集包含来自四个来源的视觉数学问题:手动收集的问题由 GPT-4 增强、现有数据集由 GPT-4 扩展、数据引擎标题由 GPT-4 标注以及数据引擎直接生成的问题。

Large Language Models (LLMs) and their multi-modal counterparts (MLLMs) have made significant strides in advancing artificial general intelligence (AGI) across various domains. However, these models face a significant challenge in the realm of visual mathematical problem-solving. While MLLMs have demonstrated impressive capabilities in diverse tasks, they struggle to fully utilize their potential when confronted with mathematical problems presented in visual contexts. This limitation is particularly evident in scenarios where models must interpret geometric figures, understand spatial relationships, and integrate complex mathematical concepts with visual information.

The difficulty lies in the unique demands of visual mathematical problem-solving, which requires a seamless integration of analytical reasoning from textual questions with the contextual information provided by visual diagrams. Unlike text-only mathematical problems, where LLMs have shown considerable progress due to abundant training data and their inherent language proficiency, visual mathematics introduces an additional layer of complexity. Models must not only comprehend the mathematical concepts but also accurately interpret visual elements such as geometric shapes, angles, measurements, and spatial relationships represented in diagrams.

Visual instruction tuning for MLLMs has seen significant advancements through approaches like LLaMA-Adapter, LLaVA, Flamingo, SPHINX, and InternVL, each introducing efficient techniques for vision-language integration. Simultaneously, text-based mathematical problem-solving has progressed with projects like MAmmoTH, MetaMATH, and MathCoder. However, in the multi-modal mathematical domain, efforts remain limited. Datasets such as Geometry3K and UniMath have emerged, but their scope and scale are insufficient. G-LLaVA shows promise in graphical geometry but struggles in other mathematical areas, highlighting the need for more robust, comprehensive approaches to visual mathematical problem-solving.

Researchers from CUHK, Peking University, Shanghai AI Laboratory, and Oracle introduce MAVIS (MAthematical VISual instruction tuning) which presents a robust approach addressing  the limitations of MLLMs in visual mathematical problem-solving. This framework tackles three critical issues: unsatisfactory math diagram embeddings by vision encoders, diagram-language misalignment between vision encoders and LLMs, and inaccurate mathematical reasoning with visual elements. MAVIS introduces two extensive datasets, MAVIS-Caption and MAVIS-Instruct, covering various mathematical domains. It employs a progressive three-stage training pipeline to enhance diagram visual encoding and reasoning capabilities. The result is MAVIS-7B, a specialized MLLM optimized for visual mathematical tasks, which demonstrates superior performance on evaluation benchmarks compared to existing open-source MLLMs, highlighting the effectiveness of this targeted approach in advancing visual mathematical problem-solving capabilities.

MAVIS introduces an innovative data engine to generate high-quality mathematical diagrams efficiently, addressing the scarcity of visual mathematics datasets. The engine covers three main diagram types: plane geometry, analytic geometry, and function. For plane geometry, it employs multi-hop data curation principles, iteratively combining basic shapes to create diverse configurations. Analytic geometry diagrams are constructed on a Cartesian coordinate system, incorporating various geometric elements without overlap. Function diagrams focus on seven fundamental types, using parameterized equations to generate diverse graphs. All diagrams are rendered using Matplotlib, with additional features like vertex labeling and key point plotting to enhance mathematical understanding and reasoning capabilities.

MAVIS-Caption, a crucial component of the MAVIS framework, is a large-scale dataset comprising 588,000 diagram-caption pairs. It covers three mathematical domains: plane geometry (299K pairs), analytic geometry (77K pairs), and function (212K pairs). The dataset’s captions are detailed, with an average length of 61.48 words and a vocabulary size of 149. Caption generation strategies vary by diagram type, utilizing GPT-4-created templates and specific rules for each domain. Plane geometry captions are built iteratively, analytic geometry captions use coordinate-based descriptions, and function captions detail various properties of the graphed functions. All captions are refined by ChatGPT for natural language expression, ensuring high-quality, diverse, and mathematically accurate descriptions of visual mathematical content.

MAVIS-Instruct is a comprehensive dataset of 834,000 visual math problems designed to enhance MLLMs’ visual mathematical reasoning capabilities. It covers plane geometry and function problems, each accompanied by a Chain-of-Thought (CoT) rationale averaging 150 words. The dataset’s questions are streamlined to minimize textual redundancy, encouraging MLLMs to extract critical information from visual inputs. MAVIS-Instruct is compiled from four sources: manually collected problems augmented by GPT-4 (84K), existing datasets expanded by GPT-4 (80K), data engine captions annotated by GPT-4 (51K), and problems directly generated by the data engine. This diverse approach ensures broad coverage of mathematical concepts and problem types, while maintaining high-quality, detailed solutions and rationales for each problem.

MAVIS-7B demonstrates superior performance across multiple mathematical benchmarks, showcasing its effectiveness in visual mathematical problem-solving. On the comprehensive MathVerse benchmark, MAVIS-7B achieves the highest overall accuracy among open-source models, surpassing larger models and specialized mathematical MLLMs. It outperforms InternLM-XComposer2 (7B) by 11.0% and ShareGPT4V (13B) by 10.1%. In specific domains, MAVIS-7B excels on GeoQA for plane geometry, achieving 66.7% accuracy, and on FunctionQA, reaching 40.3% accuracy, outperforming both traditional methods and other MLLMs. Qualitative analysis reveals MAVIS-7B’s superior understanding of geometric elements, function curves, and coordinate axes, leading to higher-quality Chain-of-Thought reasoning compared to GPT-4V.

This study introduces MAVIS, an efficient approach to mathematical visual instruction tuning for MLLMs. The framework comprises two key components: high-quality datasets (MAVIS-Caption and MAVIS-Instruct) generated by a sophisticated data engine, and a three-stage training pipeline. This process sequentially enhances the math-specific vision encoder, improves diagram-language alignment, and develops mathematical reasoning capabilities. The resulting specialist model, MAVIS-7B, demonstrates exceptional performance across various mathematical visual benchmarks. MAVIS’s innovative approach sets a new standard in visual mathematical problem-solving, paving the way for future advancements in this critical area of artificial intelligence and education technology.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post From Diagrams to Solutions: MAVIS’s Three-Stage Framework for Mathematical AI appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

视觉数学 数学AI MAVIS MLLM 图表理解
相关文章