Communications of the ACM - Artificial Intelligence 02月11日
Feedback Loops Guide AI to Proof Checking
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

人工智能在数学领域的应用正迎来突破。早期AI研究将数学视为关键,但自动数学证明助手多年未出现。如今,数学和计算机科学的进展带来了新希望。专家借助志愿者社区,众包验证复杂证明,为AI介入创造了条件。大型语言模型(LLM)在生成证明草图方面面临挑战,但通过结合反馈和责任划分,如Draft, Sketch, and Prove (DSP)等工具,准确性显著提高。未来,AI有望在形式化验证、定义优化等方面发挥更大作用,并助力创建更易读的数学文本,最终实现自动数学证明。

💡AI在数学证明中的应用迎来新进展,早期的AI研究将数学视为快速突破的关键领域,但长期以来,自动数学证明助手并未出现。

🧑‍💻 众包验证成为AI介入的契机: 专家如Kevin Buzzard通过社区合作,将复杂证明分解为模块化任务,利用Lean等语言进行验证,为AI提供了可学习的数据和模式。

🤖大型语言模型(LLM)在生成证明草图方面面临挑战,虽然LLM具备整合证明元素的能力,但在语法和推理方面存在问题,需要大量修正。

🔄 混合方案提升AI证明能力: Draft, Sketch, and Prove (DSP)等工具通过结合LLM、自动定理证明器和形式引擎,形成反馈循环,显著提高了证明的准确性。

📚 AI的未来应用方向: AI不仅可以用于自动形式化验证,还可用于优化数学定义、创建易于理解的数学文本,甚至构建交互式数学教材。

Some of the earliest work on artificial intelligence (AI) saw mathematics as a major target and key to making breakthroughs quickly. In 1961, leading computer scientist and AI pioneer John McCarthy argued at the Fifth Symposium in Pure Mathematics that the job of checking mathematical proofs would likely be “one of the most interesting and useful applications of automatic computers.”

McCarthy saw the possibility for mathematicians to try out different ideas for proofs quickly that the computers then tested for correctness. More than 60 years later, such a proof assistant has yet to appear. But recent developments in both mathematics and computer science may see a breakthrough sooner rather than later.

Much of the work of verifying proofs formally using a computer still relies on a lot of manual effort by specialists such as Kevin Buzzard, professor of pure mathematics at the U.K.’s Imperial College London. Last year, he kicked off a project, funded for five years by the U.K.’s Engineering and Physical Sciences Research Council, to formalize the proof of Fermat’s Last Theorem developed by Andrew Wiles 30 years ago. Buzzard estimates it will take some 100 person-years of work to complete the process. Much of the help is coming from a community of volunteers who have, in recent years, shown how well crowdsourcing can work in this area. That, in turn, may provide an easier route for AI to finally make an entrance.

A key characteristic of projects like Buzzard’s is that the work readily separates into modules with clearly defined interfaces. This attribute helped one of the first major crowdsourced proof verification projects to be completed in just a couple of years. The Liquid Tensor Experiment (LTE) took shape in 2020, when Peter Scholze, professor of mathematics at Germany’s University of Bonn, asked the community for help in checking the 600-page proof he and Dustin Clausen had painstakingly threaded together by hand. Scholze wrote in his appeal how he had lost confidence in his ability to comprehend all the subtleties of the proof because of its sheer size and intricacy.

Languages such as Lean, used in both the LTE and Fermat projects, use keywords like “sorry” to mark unfinished components. This makes it possible to sketch out a skeleton of the overall proof that team members fill in gradually, until they are ready to have the proof engine used by these languages check the result and mark that section as complete. When working on LTE, Johan Commelin, assistant professor at the Netherlands’ Utrecht University, said he would wake up in the morning and find new parts of the proof had appeared overnight.

Researchers see AI benefitting not just from the same approach. Instead of expecting the software to work on complete proofs, tools could work on much smaller and more manageable pieces. The current generation of AI also benefits from the data that has spun off from the crowdsourcing efforts that can be used to train the models. Lean now holds the equivalent of an undergraduate course in mathematics and is rapidly catching up to the size of the Mathematics Components library developed over a longer period for the older language Coq.

In principle, the large language model (LLM) makes a good choice for pulling together the elements of a proof. However, the technology has problems, as Michail Karatarakis reported to colleagues at the Conference on AI and Theorem Proving (AITP) in Aussois, France in 2023. The Radboud University Ph.D. student used the Mistral LLM to generate sketches of proofs that, with some editing, could be checked by the Lean proof engine.

For the test, Karatarakis used two lemmas from a textbook on number theory, one picked because the components needed were already in the library. The other included a definition not yet in the library, “offering a way to see how Mistral handles unfamiliar definitions,” he explained.

Despite having a large body of existing material to draw upon and only being required to produce “sketches,” or small components of a larger proof, the output needed many corrections, particularly to the syntax. As well as other issues with the output, the LLM’s training set seemed to include many examples from older versions of the Lean language. Yet Karatarakis was using the latest version of the language, which has important differences.

That LLMs can struggle to build even proof sketches is not unusual. A 2024 review of activity in automated theorem proving by an international group led by Xujie Si, assistant professor of computer science at Canada’s University of Toronto, found advanced LLMs at best could completely formalize just a quarter of high-school and undergraduate-level problems.

One key problem that LLMs face with full proofs is the lack of reasoning these tools possess.

“Since the task in our case was to provide proof sketches rather than full proofs, syntax issues had a larger impact than reasoning,” Karatarakis said. “For tasks involving complete proofs, reasoning remains the primary challenge, even with improved syntax handling.”

Injecting more feedback into the training process could address some of the issues faced by LLMs. This intuition drove the creation of Draft, Sketch, and Prove (DSP) by researchers working at the U.K. universities of Cambridge and Edinburgh. This tool uses the LLM to create an initial sketch of an idea that goes to an automated theorem prover that can work at a more informal level. Several of these have been developed over the past couple of decades to assist mathematicians working in languages such as Coq and Isabelle. The last part of DSP is a separate formal engine that checks the work. The trio of engines form a loop where the LLM retrains on the proofs that the symbolic engine accepts.

Said Wenda Li, lecturer in hybrid AI at the U.K.’s University of Edinburgh, “Sometimes the gap is too large for the automated theorem prover to bridge. But generating new drafts is relatively cheap. We can sample millions of them to get just the right gap for the prover to bridge.”

Accuracy improved to over 50% with DSP’s combination of feedback and division of responsibilities. In a further step presented last summer, the team added a fourth module in a version called Sketch, Prove, Add Details & Repeat (SPADeR). The extra module called on GPT-4o to fill in blanks that would otherwise block a full proof. This increased the number of successfully verified problems in one test set to 93 from 85 using the earlier DSP tool.

For practical mathematical work, AI’s output may still need to be refined after completing a proof successfully. One common thread in the manual formalization efforts is the importance of finding good definitions and proof steps that support the process. At one stage of the LTE project, it looked as though just a few lines in one key part of the proof would mean formalizing an immense body of knowledge as a prerequisite. Commelin led work to avoid the problem by building a new underpinning that was far easier to implement.

In Li’s work with colleagues on a formalized version of another textbook on number theory, the group found that a formal definition of a key type of function of complex numbers would be better expressed by the use of an arithmetic expansion, rather than the more-intuitive approach used in the informal source. That approach would help underpin a larger set of dependent proofs, he explained. In some projects, these considerations have led to changes to underlying definitions, sometimes repeatedly.

“Over time, it becomes increasingly difficult to refactor definitions or lemmas due to growing dependencies,” Li said. “AI could potentially assist with this, much like other code-assistance tools, such as Copilot.”

Other goals may provide targets that AI can serve more easily than auto formalization itself. Patrick Massot, professor of mathematics at France’s University of Paris-Saclay, argues one lingering issue with the current formal languages like Coq and Lean is that they are too opaque to non-computer scientists. An “informalizer” would help scholars read the verified proofs. It could, as a byproduct, provide the foundation for building interactive math textbooks, in which students are able to drill down into the background of any proof or lemma they see.

A couple of teams have used LLMs to try to do the work. Though LLMs make fewer mistakes here than in formalization, the task demands much higher accuracy than they can deliver. In his work on informalization with Massot, Kyle Miller, assistant professor at the University of California at Santa Cruz, has been exploring the use of more traditional symbolic AI techniques. This involves far more manual engineering than training an LLM. Simply mapping the grammar of a language like Lean into English is not enough; it needs more changes. For example, the code created to check a proof formally can contain a lot of repetition that the tool would ideally remove from the human-readable version.

If successful, the work on informalizers in turn may help close the loop for theorem-proving engines based on LLMs and similar technologies. The output from these tools would provide a rich resource of synthetic data that can be used to retrain the AI engines as they create new proofs.

The multiple feedback loops that are now appearing may mean the logjam that has held up one of computer science’s major applications is finally breaking. But it may take a lot more research into hybrid schemes to strengthen AI’s reasoning skills and, perhaps, finally make autoformalization work for mathematicians.

Further Reading

  • Avigad, J.
    Mathematics and the Formal Turn, Bulletin of the American Mathematical Society 61 (2024), 225-240
  • Li, Zhaoyu, Sun, J., Murphy, L., Su, Q., Li, Zenan, Zhang, X., Yang, K., and Si Xujie.
    A Survey on Deep Learning for Theorem Proving,
    arXiv:2404.09939 (2024)
  • Karatarakis, M.
    Leveraging Large Language Models for Autoformalizing Theorems: A Case Study, Ninth Conference on Artificial Intelligence and Theorem Proving (AITP 2024)
  • Tarrach, G., Jiang, A.Q., Raggi, D., Li, W., and Jamnik, M.
    More Details, Please: Improving Autoformalization with More Detailed Proofs, AI for MATH Workshop at the International Conference on Machine Learning (ICML 2024)
  • McCarthy, M.
    Computer programs for checking mathematical proofs, Proceedings of the Fifth Symposium in Pure Mathematics of the American Mathematical Society, pages 219–227 (1961)

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 数学证明 自动定理证明 大型语言模型 形式化验证
相关文章