MarkTechPost@AI 2024年07月22日
The GTA Benchmark: A New Standard for General Tool Agent AI Evaluation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GTA基准测试旨在更准确地评估大型语言模型(LLM)在现实世界场景中的工具使用能力。它使用人类编写的查询、涵盖各种类别(感知、操作、逻辑、创造力)的真实部署工具以及与现实世界环境相似的多模态输入,为评估LLM在使用各种工具规划和执行复杂任务的能力提供了更全面、更现实的方案。

😊 **真实世界问题的挑战:** 现有的评估方法往往无法有效地衡量LLM的工具使用能力,因为它们依赖于AI生成的查询、单步任务、虚拟工具以及纯文本交互,这些方法无法准确地反映现实世界解决问题的复杂性和需求。

🤔 **GTA基准测试的创新:** 研究人员提出了GTA基准测试,以弥合这一差距。该基准测试使用人类编写的查询,这些查询包含隐含的工具使用要求,并使用涵盖各种类别(感知、操作、逻辑、创造力)的真实部署工具以及与现实世界环境相似的多模态输入。

💪 **评估方法和结果:** GTA基准测试包含229个需要使用各种工具的现实世界任务。每个任务都包含多个步骤,需要LLM进行推理和规划,以确定使用哪些工具以及使用顺序。评估采用两种模式进行:逐步模式和端到端模式。结果表明,现实世界任务对当前的LLM来说是一个重大挑战。

🚀 **未来的方向:** GTA基准测试有效地揭示了当前LLM在处理现实世界工具使用任务方面的不足。它利用人类编写的查询、真实部署的工具和多模态输入,对LLM的能力进行了更准确、更全面的评估。这些发现强调了在开发通用工具代理方面需要进一步改进。该基准测试为评估LLM设定了新的标准,并将成为未来研究的宝贵指南,旨在提高LLM的工具使用能力。

The paper addresses the significant challenge of evaluating the tool-use capabilities of large language models (LLMs) in real-world scenarios. Existing benchmarks often fail to effectively measure these capabilities because they rely on AI-generated queries, single-step tasks, dummy tools, and text-only interactions, which do not accurately represent the complexities and requirements of real-world problem-solving.

Current methodologies for evaluating LLMs typically involve synthetic benchmarks that do not reflect the intricacies of real-world tasks. These methods use AI-generated queries and single-step tasks, which are simpler and more predictable than the multifaceted problems encountered in everyday scenarios. Moreover, the tools used in these evaluations are often dummy tools that do not provide a realistic measure of an LLM’s ability to interact with actual software and services.

A team of researchers from Shanghai Jiao Tong University and Shanghai AI Laboratory propose the General Tool Agents (GTA) benchmark to bridge this gap. This new benchmark is designed to assess LLMs’ tool-use capabilities in real-world situations more accurately. The GTA benchmark features human-written queries with implicit tool-use requirements, real deployed tools spanning various categories (perception, operation, logic, creativity), and multimodal inputs that closely mimic real-world contexts. This setup will provide a more comprehensive and realistic evaluation of an LLM’s ability to plan and execute complex tasks using various tools.

The GTA benchmark is composed of 229 real-world tasks that require the use of various tools. Each task involves multiple steps and necessitates reasoning and planning by the LLM to determine which tools to use and in what order. The evaluation is carried out using two modes: step-by-step and end-to-end. In the step-by-step mode, the LLM is given the initial steps of a reference toolchain and is expected to predict the next action. This mode evaluates the model’s fine-grained tool-use capabilities without actual tool use, allowing for a detailed comparison of the model’s output against the reference steps.

In the end-to-end mode, the LLM calls the tools and attempts to solve the problem by itself, with each step depending on the previous ones. This mode reflects the actual task execution performance of the LLM. The researchers use several metrics to evaluate performance, including instruction-following accuracy (InstAcc), tool selection accuracy (ToolAcc), argument accuracy (ArgAcc), summary accuracy (SummAcc) in the step-by-step mode, and answer accuracy (AnsAcc) in the end-to-end mode.

The results reveal that real-world tasks pose a significant challenge for current LLMs. The best-performing models, GPT-4 and GPT-4o, were able to correctly solve fewer than 50% of the tasks. Other models achieved less than 25% accuracy. However, these results also highlight the potential for improvement in LLMs’ tool-use capabilities. Among open-source models, the Qwen-72b achieved the highest accuracy, demonstrating that with further advancements, LLMs can better meet the demands of real-world scenarios.

The GTA benchmark effectively exposes the shortcomings of current LLMs in handling real-world tool-use tasks. By utilizing human-written queries, real deployed tools, and multimodal inputs, the benchmark provides a more accurate and comprehensive evaluation of LLMs’ capabilities. The findings underscore the pressing need for further advancements in the development of general-purpose tool agents. This benchmark sets a new standard for evaluating LLMs and will serve as a crucial guide for future research aimed at enhancing their tool-use proficiency.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post The GTA Benchmark: A New Standard for General Tool Agent AI Evaluation appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GTA基准测试 大型语言模型 工具使用 人工智能
相关文章