MarkTechPost@AI 04月01日
Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ReSearch是一个创新的AI框架,旨在通过强化学习训练大语言模型(LLMs),使其能够将推理过程与外部搜索操作无缝结合,特别是在处理需要复杂推理链和多次检索的多跳问题时。该框架的核心在于将搜索操作直接整合到推理链中,利用Group Relative Policy Optimization (GRPO)引导LLMs自主识别执行搜索操作的最佳时机和策略。ReSearch无需依赖有监督的推理步骤,这使得模型能够逐步完善其推理能力,并自然地具备反思和自我修正的能力。实验结果表明,ReSearch在多跳问答基准测试中表现优异,显著超越了现有方法,突显了其强大的泛化能力。

🔍 ReSearch框架通过强化学习,训练LLMs将推理与搜索相结合,无需依赖有监督的推理数据,解决了多跳推理场景中的可扩展性和适应性问题。

💡 ReSearch的核心方法是将搜索操作直接融入推理链,利用Group Relative Policy Optimization (GRPO)引导LLMs自主进行搜索,从而优化推理过程。

📝 ReSearch采用结构化输出格式,在推理链中嵌入如等标签,促进模型与外部检索环境的清晰交互。

📈 在多跳问答基准测试中,ReSearch-Qwen-32B-Instruct在HotpotQA、2WikiMultiHopQA、MuSiQue和Bamboogle等数据集上,性能提升了8.9%到22.4%,展现出强大的泛化能力。

🔄 ReSearch模型在训练过程中逐渐增加对迭代搜索操作的依赖,并通过案例研究展示了其识别次优搜索、反思推理步骤并自主纠正的能力。

Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate reasoning chains and multiple retrieval steps. Current methods primarily depend on manually designed prompts or heuristics, posing limitations in scalability and flexibility. Additionally, generating supervised data for multi-step reasoning scenarios is often prohibitively expensive and practically infeasible.

Researchers from Baichuan Inc., Tongji University, The University of Edinburgh, and Zhejiang University introduce ReSearch, a novel AI framework designed to train LLMs to integrate reasoning with search via reinforcement learning, notably without relying on supervised reasoning steps. The core methodology of ReSearch incorporates search operations directly into the reasoning chain. Utilizing Group Relative Policy Optimization (GRPO), a reinforcement learning technique, ReSearch guides LLMs to autonomously identify optimal moments and strategies for performing search operations, which subsequently influence ongoing reasoning. This approach enables models to progressively refine their reasoning and naturally facilitates advanced capabilities such as reflection and self-correction.

From a technical perspective, ReSearch employs structured output formats by embedding specific tags—such as <think>, <search>, <result>, and <answer>—within the reasoning chain. These tags facilitate clear communication between the model and the external retrieval environment, systematically organizing generated outputs. During training, ReSearch intentionally excludes retrieval results from loss computations to prevent model bias. Reward signals guiding the reinforcement learning process are based on straightforward criteria: accuracy assessment through F1 scores and adherence to the predefined structured output format. This design encourages the autonomous development of sophisticated reasoning patterns, circumventing the need for manually annotated reasoning datasets.

Experimental evaluation confirms the robustness of ReSearch. When assessed on multi-hop question-answering benchmarks, including HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle, ReSearch consistently outperformed baseline methods. Specifically, ReSearch-Qwen-32B-Instruct achieved improvements ranging between 8.9% and 22.4% in performance compared to established baselines. Notably, these advancements were achieved despite the model being trained exclusively on a single dataset, underscoring its strong generalization capabilities. Further analyses demonstrated that models gradually increased their reliance on iterative search operations throughout training, indicative of enhanced reasoning proficiency. A detailed case study illustrated the model’s capacity to identify suboptimal search queries, reflect on its reasoning steps, and implement corrective actions autonomously.

In summary, ReSearch presents a significant methodological advancement in training LLMs to seamlessly integrate reasoning with external search mechanisms via reinforcement learning. By eliminating dependency on supervised reasoning data, this framework effectively addresses critical scalability and adaptability issues inherent in multi-hop reasoning scenarios. Its capability for self-reflection and correction enhances its practical applicability in complex, realistic contexts. Future research directions may further extend this reinforcement learning-based framework to broader applications and incorporate additional external knowledge resources.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

The post Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ReSearch 大语言模型 强化学习 推理 搜索
相关文章