Tao He, Hao Li等 2025-04-22 16:30 北京
未来的人工智能系统需要与现实世界交互。
本文转载自“集智俱乐部”
OpenAI的O1及其后续竞争者(如DeepSeek R1)的发布显著推动了大语言模型(Large Language Models,LLMs)在复杂推理方面的研究,引发学术界与工业界的双重关注。此项进展激发了相关技术成果的复现和在此基础上的创新。为系统构建该领域的研究框架,本文从自我进化(self-evolution)的角度系统地分类了现有技术。我们的调查发现分为三个相互关联的部分:数据进化(data evolution)、模型进化(model evolution)和自我进化(self-evolution)。
摘要
1.引言
2. 预备知识
3. 数据进化
4. 模型进化
的对比,为每个词元标注奖励。Rafailov 等[2024]、Zhong 等[2024]从DPO框架导出隐式奖励,形式为:
5. 自我进化
6. 基于自我进化框架解读 O1 类研究
7. 未来挑战和方向
8. 结论
部分参考文献,全部参考文献请查看原文
· Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. Back to basics: Revisiting REINFORCE-style optimization for learning from human feedback in LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12248–12267, Bangkok, Thailand, August 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.662. URL https://aclanthology. org/2024.acl-long.662/.
` Renat Aksitov, Sobhan Miryoosefi, Zong xiao Li, Daliang Li, Sheila Babayan, Kavya Kopparapu, Zachary Fisher, Ruiqi Guo, Sushant Prakash, Pranesh Srinivasan, Manzil Zaheer, Felix X. Yu, and Sanjiv Kumar. Rest meets react: Self-improvement for multi-step reasoning llm agent. ArXiv, abs/2312.10003, 2023. URL https://arxiv.org/pdf/2312.10003.
` Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, A. Kalyan, Peter Clark, Derry Tanti Wijaya, and Niket Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. In Annual Meeting of the Association for Computational Linguistics, 2023. URL https://aclanthology.org/2023.acl-long.427.pdf.
` Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes LLM better reasoner. CoRR, abs/2310.20689, 2023. doi: 10.48550/ARXIV. 2310.20689. URL https://doi.org/10.48550/arXiv.2310.20689.
` Marcin Andrychowicz, Dwight Crow, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Joshua Tobin, P. Abbeel, and Wojciech Zaremba. Hindsight experience replay. In Neural Information Processing Systems, 2017. URL https://arxiv.org/pdf/1707.01495.
` Zachary Ankner, Cody Blakeney, Kartik K. Sreenivasan, Max Marion, Matthew L. Leavitt, and Mansheej Paul. Perplexed by perplexity: Perplexity-based data pruning with small reference models. ArXiv, abs/2405.20541, 2024a. URL https://arxiv.org/pdf/2405.20541.
` Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. Critique-out-loud reward models. CoRR, abs/2408.11791, 2024b. doi: 10.48550/ARXIV.2408. 11791. URL https://doi.org/10.48550/arXiv.2408.11791.
` Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences. ArXiv, abs/2310.12036, 2023. URL https://arxiv.org/pdf/2310.12036.
` Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952. URL https://api.semanticscholar.org/ CorpusID:125209808.
` Cameron Browne, Edward Jack Powley, Daniel Whitehouse, Simon M. M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez Liebana, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4:1–43, 2012. URL https://ieeexplore.ieee.org/document/ 6145622.
` Changyu Chen, Zi-Yan Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, and Min Lin. Bootstrapping language models with dpo implicit rewards. ArXiv, abs/2406.09760, 2024a. URL https://arxiv.org/pdf/2406.09760. Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Alphamath almost zero: process supervision without process. ArXiv, abs/2405.03553, 2024b. URL https://arxiv.org/pdf/2405.03553.
` Guoxin Chen, Minpeng Liao, Chengxi Li, and Kai Fan. Step-level value preference optimization for mathematical reasoning. In Conference on Empirical Methods in Natural Language Processing, 2024c. URL https://arxiv.org/pdf/2406.10858.
` Huayu Chen, Guande He, Lifan Yuan, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. ArXiv, abs/2402.05369, 2024d. URL https://arxiv. org/pdf/2402.05369.
` Justin Chih-Yao Chen, Zifeng Wang, Hamid Palangi, Rujun Han, Sayna Ebrahimi, Long T. Le, Vincent Perot, Swaroop Mishra, Mohit Bansal, Chen-Yu Lee, and Tomas Pfister. Reverse thinking makes llms stronger reasoners. ArXiv, abs/2411.19865, 2024e. URL https://arxiv.org/pdf/ 2411.19865.
` Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W. Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. Trans. Mach. Learn. Res., 2023, 2022. URL https://arxiv.org/pdf/2211.12588.
` Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Do not think that much for 2+3=? on the overthinking of o1-like llms. ArXiv, abs/2412.21187, 2024f. URL https://arxiv.org/pdf/2412.21187.
` Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation. CoRR, abs/2311.17311, 2023a. doi: 10.48550/ARXIV.2311.17311. URL https://doi.org/10.48550/arXiv.2311.17311.
` Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. ArXiv, abs/2304.05128, 2023b. URL https://doi.org/10.48550/arXiv.2304. 05128.
` Zhaorun Chen, Zhaorun Chen, Zhuokai Zhao, Zhihong Zhu, Ruiqi Zhang, Xiang Li, Bhiksha Raj, and Huaxiu Yao. Autoprm: Automating procedural supervision for multi-step reasoning via controllable question decomposition. ArXiv, abs/2402.11452, 2024g. URL https://aclanthology.org/ 2024.naacl-long.73/.
` Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. Improving large language models via fine-grained reinforcement learning with minimum editing constraint. In Annual Meeting of the Association for Computational Linguistics, 2024h. URL https://arxiv.org/pdf/2401.06081.
` Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models. ArXiv, abs/2401.01335, 2024i. URL https://arxiv.org/pdf/2401.01335.
` Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, and Nan Du. Selfplaying adversarial language game enhances llm reasoning. ArXiv, abs/2404.10642, 2024. URL https://arxiv.org/pdf/2404.10642.
大模型研究小分队:《打造可信大模型,刻不容缓》
王齐昂:《大模型下半场:7个趋势判断》
👇 点个“在看”分享洞见