MarkTechPost@AI 2024年07月23日
WTU-Eval: A New Standard Benchmark Tool for Evaluating Large Language Models LLMs Usage Capabilities
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

WTU-Eval是一个新的基准,用于评估大语言模型(LLMs)在使用外部工具方面的能力,它包含11个数据集,其中6个需要使用工具,而另外5个则不需要,测试LLMs是否能准确判断何时需要使用工具。该基准测试包括机器翻译、数学推理和实时网络搜索等任务,提供了一个全面的评估框架。

🤔 WTU-Eval 评估大语言模型在使用外部工具方面的能力,包含11个数据集,涵盖需要和不需要使用工具的场景,以全面评估模型判断工具使用时机的能力。

🤖 研究人员还开发了一个包含4000个实例的微调数据集,用于提升模型在工具使用方面的决策能力,实验结果表明,通过微调,模型在需要实时信息检索和数学计算的数据集上表现出了显著提升。

📊 评估结果表明,大多数模型在一般数据集上难以判断是否需要使用工具,但当模型的能力与 ChatGPT 等模型更加接近时,它们在工具使用数据集上的表现有所提升。

📈 微调 Llama2-7B 模型后,其平均性能提升了 14%,错误工具使用率降低了 16.8%,表明针对性训练对提升模型工具使用决策能力至关重要。

💡 不同工具对 LLM 性能的影响也不同,例如,翻译器等简单工具更容易被 LLM 管理,而计算器和搜索引擎等复杂工具则更具挑战性。

Large Language Models (LLMs) excel in various tasks, including text generation, translation, and summarization. However, a growing challenge within NLP is how these models can effectively interact with external tools to perform tasks beyond their inherent capabilities. This challenge is particularly relevant in real-world applications where LLMs must fetch real-time data, perform complex calculations, or interact with APIs to complete tasks accurately.

One major issue is LLMs’ decision-making process regarding when to use external tools. In real-world scenarios, it is often unclear whether a tool is necessary. Incorrect or unnecessary tool usage can lead to significant errors and inefficiencies. Therefore, the core problem recent research addresses is enhancing LLMs’ ability to discern their capability boundaries and make accurate decisions about tool usage. This improvement is crucial for maintaining LLMs’ performance and reliability in practical applications.

Traditionally, methods to improve LLMs’ tool usage have focused on fine-tuning models for specific tasks where tool use is mandatory. Techniques such as reinforcement learning & decision trees have shown promise, particularly in mathematical reasoning and web searches. Benchmarks like APIBench and ToolBench have been developed to evaluate LLMs’ proficiency with APIs and real-world tools. However, these benchmarks typically assume that tool usage is always required, which does not reflect the uncertainty and variability encountered in real-world scenarios.

Researchers from Beijing Jiaotong University, Fuzhou University, and the Institute of Automation CAS introduced the Whether-or-not tool usage Evaluation benchmark (WTU-Eval) to address this gap. This benchmark is designed to assess the decision-making flexibility of LLMs regarding tool usage. WTU-Eval comprises eleven datasets, six of which explicitly require tool usage, while the remaining five are general datasets that can be solved without tools. This structure allows for a comprehensive evaluation of whether LLMs can discern when tool usage is necessary. The benchmark includes tasks such as machine translation, math reasoning, and real-time web searches, providing a robust framework for assessment.

The research team also developed a fine-tuning dataset of 4000 instances derived from WTU-Eval’s training sets. This dataset is designed to improve the decision-making capabilities of LLMs regarding tool usage. By fine-tuning the models with this dataset, the researchers aimed to enhance the accuracy and efficiency of LLMs in recognizing when to use tools and effectively integrating tool outputs into their responses.

The evaluation of eight prominent LLMs using WTU-Eval revealed several key findings. Firstly, most models need help determining tool use in general datasets. For example, the performance of Llama2-13B dropped to 0% on some tool questions in zero-shot settings, highlighting the difficulty LLMs face in these scenarios. However, the models improved performance in tool-usage datasets when their abilities aligned more closely with models like ChatGPT. Fine-tuning the Llama2-7B model led to a 14% average performance improvement and a 16.8% decrease in incorrect tool usage. This enhancement was particularly notable in datasets requiring real-time information retrieval and mathematical calculations.

Further analysis showed that different tools had varying impacts on LLM performance. For instance, simpler tools like translators were managed more efficiently by LLMs, while complex tools like calculators and search engines presented greater challenges. In zero-shot settings, the proficiency of LLMs decreased significantly with the complexity of the tools. For example, Llama2-7B’s performance dropped to 0% when using complex tools in certain datasets, while ChatGPT showed significant improvements of up to 25% in tasks like GSM8K when tools were used appropriately.

The WTU-Eval benchmark’s rigorous evaluation process provides valuable insights into LLMs’ tool usage limitations and potential improvements. The benchmark’s design, which includes a mix of tool usage and general datasets, allows for a detailed assessment of models’ decision-making capabilities. The fine-tuning dataset’s success in improving performance underscores the importance of targeted training to enhance LLMs’ tool usage decisions.

In conclusion, the research highlights the critical need for LLMs to develop better decision-making capabilities regarding tool usage. The WTU-Eval benchmark offers a comprehensive framework for assessing these capabilities, revealing that while fine-tuning can significantly improve performance, many models still struggle to determine their capability boundaries accurately. Future work should focus on expanding the benchmark with more datasets and tools and exploring different LLM types further to enhance their practical applications in diverse real-world scenarios. 


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post WTU-Eval: A New Standard Benchmark Tool for Evaluating Large Language Models LLMs Usage Capabilities appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WTU-Eval 大语言模型 工具使用 LLMs 基准
相关文章