MarkTechPost@AI 01月12日
ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了ToolHop,一个专为评估大型语言模型(LLM)在多跳工具使用场景下能力的新数据集。多跳查询因其需要多步骤推理和多源信息而对LLM代理构成挑战。ToolHop通过精心设计的995个用户查询和3912个相关工具,解决了现有评估方法中工具依赖性不足和缺乏可验证答案的问题。该数据集采用查询驱动的数据构建方法,包含工具创建、文档优化和代码生成三个阶段。实验结果表明,使用工具可以提高LLM的性能,但仍有改进空间,同时模型仍存在一定的幻觉问题。ToolHop的提出为更可靠地评估LLM的多跳工具使用能力提供了新的途径。

🛠️ToolHop数据集专为评估LLM在多跳工具使用场景下的能力而设计,包含995个精心设计的用户查询和3912个相关工具,旨在解决现有评估方法的不足。

📄 该数据集采用查询驱动的数据构建方法,分为三个关键阶段:工具创建、文档优化和代码生成,保证了工具之间的相互依赖性和评估的可靠性。

⚙️工具创建阶段根据用户提供的多跳查询创建初步工具文档,文档设计为相互依赖和相关,将查询分解为原子部分并单独处理,确保模块化和内聚性。

💡文档优化阶段对准备好的工具文档进行全面过滤,引入结果过滤和可自定义格式等新功能,以扩展功能并保持原创性,同时增加参数数量并优化其类型。

💻代码生成阶段通过准备好的工具生成本地可执行函数,从而实现模型和工具之间无缝的多轮交互,增强了工具的实用性和评估的准确性。

Multi-hop queries have always given LLM agents a hard time with their solutions, necessitating multiple reasoning steps and information from different sources. They are crucial for analyzing a model’s comprehension, reasoning, and function-calling capabilities. At this time when new large models are booming every other day with claims of unparalleled capabilities, multi-hop tools realistically assess them by bestowing with a complex query, which the model needs to decompose into atomic parts and iteratively solve by invoking and utilizing appropriate tools. Furthermore, multi-hop tool evaluation has emerged as pivotal for advancing models toward generalized intelligence.

Existing works in this field fall short of offering a reliable evaluation method. Methods proposed until now have relied on tool-driven data construction methods where queries are simulated for a given collection of tools. This shortfall points out the loophole in ensuring the interdependence of collected tools and assessing the multi-hop reasoning. Additionally, the absence of verifiable answers introduces model bias and evaluation errors. This article discusses the latest research that presents a reliable method to honestly assess the multi-hop capabilities of a large language model.

Fudan University and ByteDance researchers presented ToolHop, a dataset designed explicitly for multi-hop tool evaluation with 995 rigorously designed user queries and 3,912 associated tools. Toolhop claims to solve all the aforementioned problems through diverse queries, locally executable tools, meaningful interdependencies, detailed feedback, and verifiable answers. The authors propose a novel query-driven data construction approach that could expand a single multi-hop query into a comprehensive multi-hop tool use test case.

The proposed novel scheme comprises three key stages: tool creation, document refinement, and code generation.

Tool Creation:    A preliminary set of tool documents is created per the user-provided multi-hop query. The document is designed to keep it interdependent and relevant by resolving queries into atomic parts and individually handling each. This way, the document captures the essence of the query and structures itself to generate similar queries, ensuring modularity and cohesion.

Document Refinement: The prepared tool document undergoes comprehensive filtering to support the evaluation of models in complex multi-hop scenarios. Here, new features like result filtering and customizable formats are introduced to expand functionality while maintaining originality. Parallelly, the number of parameters is increased, and their types are optimized.

Code Generation: At this stage, locally executable functions are generated by the prepared tool. Through these functions, tools are externally invoked, enabling seamless multi-turn interactions between the model and tools.

The research team implemented the approach with the queries drawn from the MoreHopQA dataset. Further, to ensure the evaluation with ToolHop, a rigorous five-dimensional analysis was done. ToolHop was then evaluated on fourteen LLMs from five families, including open and closed-sourced models. The evaluation method was so designed that answer correctness and minimized invocation errors were ensured. The authors observed that using tools increased the models’ performance by up to 12 % on average and by up to 23 % for GPT models. The best-performing model could achieve 49.04% answer correctness even after the increase. Also, despite using tools in response to multi-hop queries, models hallucinated around 10% of the time.

Conclusion: 

This paper presents a comprehensive dataset for solving multi-hop queries using specially designed queries and tools. The main finding from the experiments was that while LLMs have significantly enhanced their ability to solve complex multi-shop queries with the use of tools, their multi-shop tool use capabilities still leave considerable room for improvement.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post ToolHop: A Novel Dataset Designed to Evaluate LLMs in Multi-Hop Tool Use Scenarios appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ToolHop 多跳查询 LLM评估 工具使用 数据集
相关文章