DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

cs.AI updates on arXiv.org 07月03日 12:07

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

本文提出DICE-SCORE和DICE-BENCH框架，评估函数调用在对话中的信息分散度，通过合成对话构建实际场景下的函数调用数据集，以提升模型在现实应用中的有效性。

arXiv:2506.22853v2 Announce Type: replace-cross Abstract: Existing function-calling benchmarks focus on single-turn interactions. However, they overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue. Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness. The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Our code and data are all publicly available: https://snuhcc.github.io/DICE-Bench/.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DICE-SCORE DICE-BENCH 函数调用基准测试实际应用

相关文章

Cross-Device AI Acceleration, Compilation & Execution with Jeff Gehlhaar - #500

TIGER-Lab Introduces MMLU-Pro Dataset for Comprehensive Benchmarking of Large Language Models’ Capabilities and Performance

Researchers at the University of Freiburg and Bosch AI Propose HW-GPT-Bench: A Hardware-Aware Language Model Surrogate Benchmark

Ask HN: 函数调用在引擎盖下是如何工作的？

MMLU-Pro: An Enhanced Benchmark Designed to Evaluate Language Understanding Models Across Broader and More Challenging Tasks

benchexec： BenchExec：可靠的基准测试和资源测量框架

A Comprehensive Study by BentoML on Benchmarking LLM Inference Backends: Performance Analysis of vLLM, LMDeploy, MLC-LLM, TensorRT-LLM, and TGI

Benchmarking Federated Learning for Large Language Models with FedLLM-Bench

Unlocking the Language of Proteins: How Large Language Models Are Revolutionizing Protein Sequence Understanding

BiGGen Bench: A Benchmark Designed to Evaluate Nine Core Capabilities of Language Models