MarkTechPost@AI 2024年07月25日
LEAN-GitHub: A Large-Scale Dataset for Advancing Automated Theorem Proving
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

香港中文大学的研究人员提出了LEAN-GitHub,这是一个大规模的Lean数据集,它补充了广泛使用的Mathlib数据集。该数据集通过开源Lean仓库在GitHub上构建,显著扩大了用于训练定理证明模型的数据量。研究人员开发了一个可扩展的管道,提高了提取效率和并行性,解决了树证明搜索方法中的状态重复问题,并提供了从以前未编译和未提取的Lean语料库中利用有价值数据的方法。

LEAN-GitHub数据集的构建涉及多个关键步骤和创新,包括仓库选择、编译挑战、源代码编译、提取过程和结果分析。研究人员从GitHub上识别了237个Lean 4仓库,估计约有48,091个定理。经过筛选和编译,最终得到了包含2,133个文件和28,000个定理的最终数据集。

该数据集覆盖了包括逻辑、一阶逻辑、拟阵理论和算术在内的多个数学领域,包含了尖端的数学主题、数据结构和奥林匹克级别的难题。与现有数据集相比,LEAN-GitHub提供了独特的组合,包括人类编写的内容、中间状态和不同复杂度级别,对于推进自动定理证明和形式数学非常有价值。

基于LEAN-GitHub数据集训练的InternLM2-StepProver模型在多个基准测试中展示了卓越的形式推理能力。它在miniF2F基准上达到了63.9%的有效率和54.5%的测试率,超过了以前的模型。在ProofNet上,它达到了18.1%的Pass@1率,性能领先。对于PutnamBench,它一次性解决了5个问题,包括以前未解决的Putnam 1988 B2问题。这些结果涵盖了从高中到高级本科水平的数学,展示了InternLM2-StepProver的多样性和LEAN-GitHub数据集在训练高级定理证明模型中的有效性。

Theorem proving in mathematics faces growing challenges due to increasing proof complexity. Formalized systems like Lean, Isabelle, and Coq offer computer-verifiable proofs, but creating these demands substantial human effort. Large language models (LLMs) show promise in solving high-school-level math problems using proof assistants, yet their performance still needs to improve due to data scarcity. Formal languages require significant expertise, resulting in limited corpora. Unlike conventional programming languages, formal proof languages contain hidden intermediate information, making raw language corpora unsuitable for training. This scarcity persists despite the existence of valuable human-written corpora. Auto-formalization efforts, while helpful, cannot fully substitute human-crafted data in quality and diversity.

Existing attempts to address theorem-proving challenges have evolved significantly with modern proof assistants like Coq, Isabelle, and Lean having expanded formal systems beyond first-order logic, increasing interest in automated theorem proving (ATP). The recent integration of large language models has further advanced this field. Early ATP approaches used traditional methods like KNN or GNN, with some employing reinforcement learning. Recent efforts utilize deep transformer-based methods, treating theorems as plain text. Many learning-based systems (e.g., GPT-f, PACT, Llemma) train language models on (proof state, next-tactic) pairs and use tree search for theorem proving. Alternative approaches involve LLMs generating entire proofs independently or based on human-provided proofs. Data extraction tools are crucial for ATP, capturing intermediate states invisible in code but visible during runtime. Tools exist for various proof assistants, but Lean 4 tools face challenges in massive extraction across multiple projects due to single-project design limitations. Some methods also explore incorporating informal proofs into formal proofs, broadening the scope of ATP research.

Researchers from The Chinese University of Hong Kong propose LEAN-GitHub, a large-scale Lean dataset that complements the well-utilized Mathlib dataset. This innovative approach provides an open-source Lean repositories on GitHub, significantly expanding the available data for training theorem-proving models. The researchers developed a scalable pipeline to enhance extraction efficiency and parallelism, enabling the exploitation of valuable data from previously uncompiled and unextracted Lean corpus. Also, they provide a solution to the state duplication problem common in tree-proof search methods. 

The LEAN-GitHub dataset construction process involved several key steps and innovations:

    Repository Selection: The researchers identified 237 Lean 4 repositories  (GitHub does not differentiate between Lean 3 and Lean 4) on GitHub, estimating approximately 48,091 theorems. After discarding 90 repositories with deprecated Lean 4 versions, 147 remained. Only 61 of these could be compiled without modifications.Compilation Challenges: The team developed automated scripts to find the closest official releases for projects using non-official Lean 4 versions. They also addressed the issue of isolated files within empty Lean projects.Source Code Compilation: Instead of using the Lake tool, they called the Leanc compiler directly. This approach allowed for compiling non-compliant Lean projects and isolated files, which Lake couldn’t handle. They extended Lake’s import graph and created a custom compiling script with increased parallelism.Extraction Process: Building upon LeanDojo, the team implemented data extraction for isolated files and restructured the implementation to increase parallelism. This approach overcame bottlenecks in network connection and computational redundancies.Results: Out of 8,639 Lean source files, 6,352 and 42,000 theorems were successfully extracted. The final dataset includes 2,133 files and 28,000 theorems with valid tactic information.

The resulting LEAN-GitHub dataset is diverse, covering various mathematical fields including logic, first-order logic, matroid theory, and arithmetic. It contains cutting-edge mathematical topics, data structures, and Olympiad-level problems. Compared to existing datasets, LEAN-GitHub offers a unique combination of human-written content, intermediate states, and diverse complexity levels, making it a valuable resource for advancing automated theorem proving and formal mathematics.

InternLM2-StepProver, trained on the diverse LEAN-GitHub dataset, demonstrates exceptional formal reasoning abilities across various benchmarks. It achieves state-of-the-art performance on miniF2F (63.9% on Valid, 54.5% on Test), surpassing previous models. On ProofNet, it attains an 18.1% Pass@1 rate, outperforming the previous leader. For PutnamBench, it solves 5 problems in a single pass, including the previously unsolved Putnam 1988 B2. These results span high-school to advanced undergraduate-level mathematics, showcasing InternLM2-StepProver’s versatility and the effectiveness of the LEAN-GitHub dataset in training advanced theorem-proving models.

LEAN-GitHub, a large-scale dataset extracted from open Lean 4 repositories, contains 28,597 theorems and 218,866 tactics. This diverse dataset was used to train InternLM2-StepProver, achieving state-of-the-art performance in Lean 4 formal reasoning. Models trained on LEAN-GitHub demonstrate improved performance across various mathematical fields and difficulty levels, highlighting the dataset’s effectiveness in enhancing reasoning capabilities. By open-sourcing LEAN-GitHub, the researchers aim to help the community better utilize under-exploited information in raw corpora and advance mathematical reasoning. This contribution could significantly accelerate progress in automated theorem proving and formal mathematics.


Check out the Paper and Dataset. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 47k+ ML SubReddit

Find Upcoming AI Webinars here

The post LEAN-GitHub: A Large-Scale Dataset for Advancing Automated Theorem Proving appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LEAN-GitHub 自动定理证明 数据集 形式数学 AI推理
相关文章