MarkTechPost@AI 05月14日 04:30
Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Nemotron-Research-Tool-N1系列模型通过创新的强化学习范式,显著提升了大型语言模型(LLM)的工具使用能力。该方法摆脱了传统的监督微调(SFT)和推理轨迹提炼技术,采用轻量级监督,侧重于工具调用的结构有效性和功能正确性评估。模型通过二元奖励机制自主发展推理策略,无需依赖显式标注的推理轨迹。在BFCL和API-Bank基准测试中,Nemotron-Tool-N1模型超越了GPT-4o等闭源模型以及xLAM-2-70B等微调模型,验证了其在提升LLM工具调用能力方面的潜力。

🛠️ 现有研究依赖于合成大量工具使用轨迹来增强LLM的工具调用能力,但合成数据集无法捕捉明确的推理步骤,导致模型仅学习模仿表面模式,缺乏真正的理解。

💡 Nemotron-Research-Tool-N1模型采用了一种独特的强化学习范式,摆脱了传统的SFT方法,专注于工具调用的结构有效性和功能正确性评估。

🏆 在BFCL基准测试中,Tool-N1-7B/14B模型优于GPT-4o和xLAM-2-70B等模型,在API-Bank基准测试中,Tool-N1-7B/14B的准确率比GPT-4o高出4.12%和5.03%。

🔗 该研究统一并预处理了来自xLAM和ToolACE等现有工具调用数据集的数据,并创建了一个轻量级的提示模板,以指导工具调用生成,从而最大限度地减少了格式限制并降低了过度拟合的风险。

Equipping LLMs with external tools or functions has become popular, showing great performance across diverse domains. Existing research depends on synthesizing large volumes of tool-use trajectories through advanced language models and SFT to enhance LLMs’ tool-calling capability. The critical limitation lies in the synthetic datasets’ inability to capture explicit reasoning steps, resulting in superficial tool call training. In many cases, reasoning is either completely omitted during the training or deferred to inference through prompting techniques. This results in pseudo-reasoning: models merely learn to mimic surface-level patterns without truly understanding the underlying decision-making process.

Existing research explores multiple approaches to enhance LLMs’ tool-use capabilities. Previous methods have focused on two key strategies for improving tool learning. The first approach concentrated on dataset curation and model refinement, involving the creation of large-scale supervised datasets and applying advanced training techniques such as SFT and DPO reinforcement learning. LLMs are combined with various external tools, including search engines, calculators, vision tools, and Python interpreters, to expand their functional capabilities. The second approach targeted reasoning improvement, shifting from traditional train-time scaling to more complex test-time scaling strategies. Earlier methods relied on step-level supervision and learned reward models to guide reasoning trajectories.

Researchers from NVIDIA, Pennsylvania State University, and the University of Washington have proposed the Nemotron-Research-Tool-N1 series to address the limitations of existing tool-use methods. It diverges from traditional SFT and reasoning trace distillation techniques by implementing a unique RL paradigm. Drawing inspiration from DeepSeek-R1’s success, a lightweight supervision method has been developed to focus on the structural validity and functional correctness evaluation of tool invocations. The Nemotron-Research-Tool-N1 model employs a binary reward mechanism that enables the model to autonomously develop reasoning strategies without relying on explicitly annotated reasoning trajectories.

Researchers unify and preprocess data from existing tool-calling datasets, xLAM, and a subset of ToolACE, which provide single-turn and multi-turn synthetic tool-calling trajectories. A lightweight prompting template is created to guide tool call generation, featuring explicit instructions for intermediate reasoning within <think>…</think> tags and tool invocation enclosed in <tool_call>…</tool_call>. The template helps to minimize rigid formatting constraints and reduce the risk of overfitting to specific prompt patterns. The primary backbone model utilized is Qwen2.5-7B/14B-Instruct, and to evaluate the generalization ability of the proposed method, evaluations are performed on alternative backbone models, including multiple variants from the LLaMA family.

Results on the BFCL and API-Bank benchmarks show Nemotron-Research-Tool-N1 models’ superior performance. On the BFCL benchmark, the Tool-N1-7B/14B models outperform closed-source models like GPT-4o and specialized fine-tuned models such as xLAM-2-70B and ToolACE-8B. The models surpass SFT baselines trained on identical data sources, highlighting the effectiveness of the R1-style RL approach. Further, the API-Bank benchmark validates these findings, with Tool-N1-7B/14B achieving 4.12% and 5.03% higher accuracy than GPT-4o. These results conclusively demonstrate the potential of the proposed method in enhancing large language models’ tool-calling capabilities through a novel reinforcement learning paradigm.

In conclusion, researchers introduced Nemotron-Research-Tool-N1, a significant advancement in LLM tool-use capabilities. The research shows a paradigm shift from traditional SFT methodologies by introducing a novel rule-based RL approach. The proposed method enables models to develop sophisticated reasoning strategies without relying on explicitly annotated reasoning trajectories. Benchmark evaluations across BFCL and API-Bank consistently validate the approach’s effectiveness, showing substantial performance improvements over existing baselines. The findings open new avenues for developing more adaptable and intelligent language models that can autonomously generate reasoning strategies.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

Here’s a brief overview of what we’re building at Marktechpost:

The post Reinforcement Learning, Not Fine-Tuning: Nemotron-Tool-N1 Trains LLMs to Use Tools with Minimal Supervision and Maximum Generalization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Nemotron-Tool-N1 强化学习 LLM 工具使用
相关文章