少点错误 2024年10月16日
Distillation Of DeepSeek Prover V1.5
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

DeepSeek-Prover-V1.5 是一个改进的开源语言模型,用于在 Lean 4 中进行定理证明。它通过对 DeepSeekMath-Base 进行预训练,然后在 Lean 4 中对不完整证明数据集进行监督微调,最后通过来自 Lean 4 的反馈进行强化学习来实现。DeepSeek-Prover-V1.5 使用截断和恢复方法来寻找证明,该方法结合了现有的证明生成技术,并使用一种新颖的蒙特卡罗树搜索(MCTS)算法来提高性能。

🚀 **截断与恢复的证明生成**: DeepSeek-Prover-V1.5 结合了证明步骤生成和完整证明生成两种方法,称为截断与恢复。该方法首先尝试生成完整的证明,如果 Lean 检测到错误,则删除错误后的所有代码(截断),并使用有效的证明步骤作为下一个证明段的提示(恢复)。

🌳 **蒙特卡罗树搜索(MCTS)**: DeepSeek-Prover-V1.5 使用 MCTS 来搜索可能的证明状态。为了解决奖励稀疏问题,引入了无奖励探索算法,该算法使用内在动机来探索策略空间。

📚 **强化学习**: DeepSeek-Prover-V1.5 使用来自证明助手反馈的强化学习(RLPAF),并采用群体相对策略优化(GRPO)算法,该算法为每个定理提示采样一组候选证明,并根据组内输出(证明)的奖励来优化模型。

🧠 **思想增强证明生成**: 在生成定理证明代码之前,需要使用自然语言思想链(CoT)。这鼓励在执行每个策略之前进行正确的数学推理和步骤规划。

🏆 **实验结果**: 在 Mini F2F 和 ProofNet 基准测试中,DeepSeek-Prover-V1.5 表现出色,在 Mini F2F 测试中取得了 63.5% 的通过率。

📈 **RMaxTS 算法**: 为了解决 MCTS 中的奖励稀疏问题,引入了 RMaxTS 算法,该算法结合了内在奖励,鼓励搜索代理探索并收集有关环境的信息,平衡了外部奖励的优化和对交互式证明环境的一般知识的获取。

Published on October 15, 2024 6:53 PM GMT

TL;DR

DeepSeek-Prover-V1.5 is an improved open-source language model for theorem proving in Lean 4.

The paper continues pre-training DeepSeekMath-Base, a math foundation model, and then does supervised fine-tuning on a dataset of incomplete proofs in Lean 4, followed by RL with feedback sourced from Lean 4. DeepSeek-Prover-V1.5 finds proofs using truncate-and-resume, which combines existing proof generation techniques, and a novel Monte-Carlo-Tree-Search (MCTS) algorithm for superior performance.

Overview

Definitions:

Language models trained for theorem proving typically do one of two things:

DeepSeek-Prover-V1.5 combines proof-step and whole-proof generation, naming the combination truncate-and-resume. Truncate-and-resume starts with whole-proof generation. If Lean detects an error in the proof, all code succeeding the error is removed (truncate), and the valid proof steps are used as a prompt for the next proof segment (resume). This truncate and resume mechanism is used in conjunction with MCTS to search over possible proof states. A reward-free exploration algorithm that uses intrinsic motivation to explore the tactic space is introduced to address reward sparsity.

DeepSeek Prover V1.5 Framework:

Pre-training:

Supervised fine-tuning: 

RLPAF

Thought-Augmented Proof Generation:

MCTS

The paper uses a proof tree abstraction that defines a proof state and possible tactics to implement MCTS in the whole proof generation. The proof search tree is constructed at the tactic level, with each edge representing a single tactic state transition

The process then proceeds as,

Intrinsic Rewards for MCTS are sparse, occurring only when a proof is completely solved, to address this, the RMaxTS algorithm is introduced[1], incorporating intrinsic rewards to encourage the search agent to explore and gather information about the environment, balancing the optimization of extrinsic rewards with the acquisition of general knowledge about the interactive proof environment.

Results and Evaluations

The model is evaluated on the following benchmarks

Metrics

They compare the DeepSeekProver V1.5 to prior SOTA models.

 

  1. ^

    RMax applied to Tree Search



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSeek-Prover 定理证明 Lean 4 强化学习 蒙特卡罗树搜索
相关文章