MarkTechPost@AI 2024年10月26日
WorFBench: A Benchmark for Evaluating Complex Workflow Generation in Large Language Model Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

WORFBENCH是用于评估LLM代理工作流生成能力的基准,它解决了现有评估方法的局限性,通过多方面场景和复杂结构进行验证,实验揭示了不同模型在序列和图规划能力上的显著差距,同时也指出了该方法存在的一些问题。

🌐WORFBENCH旨在评估LLM代理工作流生成能力,利用多场景和复杂结构,经数据过滤和人工评估验证,解决现有评估方法的局限,如场景范围有限、侧重线性关系、依赖特定评估方法等。

📋WORFBENCH的架构整合了来自既定数据集的任务和操作列表,先构建节点链再建工作流图,包括函数调用任务和具身任务,前者利用GPT - 4以REACT格式生成子任务节点,后者因环境动态性需独特处理。

📊性能分析显示各模型在线性和图规划能力上有显著差距,如GLM - 4 - 9B差距达20.05%,即使表现最佳的Llama - 3.1 - 70B也有15.01%的差异,基准测试中GPT - 4等模型的得分也体现了这一点。

⚠️WORFBENCH虽有优势,但也存在一些局限性,如节点链和工作流图的质量控制可能存在问题,工作流遵循一次生成范式且假定所有节点都需遍历完成任务。

Large Language Models (LLMs) have shown remarkable potential in solving complex real-world problems, from function calls to embodied planning and code generation. A critical capability for LLM agents is decomposing complex problems into executable subtasks through workflows, which serve as intermediate states to improve debugging and interpretability. While workflows provide prior knowledge to prevent hallucinations, current evaluation benchmarks for workflow generation face significant challenges. The challenges include (a) Limited scope of scenarios, focusing only on function call tasks, (b) Sole emphasis on linear relationships between subtasks where real-world scenarios often involve more complex graph structures, including parallelism (c) Evaluations heavily rely on GPT-3.5/4.

Existing methods in workflow generation have primarily focused on three key areas: Large Language Agents, Workflow and Agent Planning, and Workflow generation and Evaluation. While LLM agents have been deployed across various domains including web interfaces, medical applications, and coding tasks, their planning abilities involve either reasoning or environmental interaction. Existing evaluation frameworks attempt to evaluate workflow generation through semantic similarity matching and GPT-4 scoring in tool learning scenarios. However, these methods are limited by their focus on linear function-calling, inability to handle complex task structures, and heavy dependency on potentially biased evaluation methods.

Researchers from Zhejiang University and Alibaba Group have proposed WORFBENCH, a benchmark for evaluating workflow generation capabilities in LLM agents. This method addresses previous limitations by utilizing multi-faceted scenarios and complex workflow structures, validated through rigorous data filtering and human evaluation. Further, researchers presented WORFEVAL, a systematic evaluation protocol utilizing advanced subsequence and subgraph matching algorithms to evaluate chain and graph structure workflow generation. The experiments reveal there are significant performance gaps between sequence and graph planning capabilities, with even advanced models like GPT-4 showing approximately a 15% difference in performance.

WORFBENCH’s architecture integrates tasks and action lists from established datasets, using a systematic approach of constructing node chains before building workflow graphs. The framework handles two main task categories: 

Performance analysis reveals significant disparities between linear and graph planning capabilities across all models. While GLM-4-9B showed the largest gap of 20.05%, even the best-performing Llama-3.1-70B demonstrated a 15.01% difference. In benchmark testing, GPT-4 achieved only 67.32% and 52.47% in f1chain and f1graph scores respectively, while Claude-3.5 topped open-grounded planning tasks with 61.73% and 41.49%. As workflow complexity increases, with more nodes and edges, performance consistently declines leading to lower scores. Analysis of low-performing samples identified four primary error types: inadequate task granularity, vague subtask descriptions, incorrect graph structures, and format non-compliance.

In conclusion, researchers introduced WORFBENCH, a method to evaluate workflow generation capabilities in LLM agents. Through the WORFEVAL system’s quantitative algorithms, the researchers revealed substantial performance gaps between linear and graph-structured workflow generation across various LLM architectures. The paper highlights the current limitations of LLM agents in complex workflow planning and provides a foundation for future improvements in agent architecture development. However, the proposed method has some limitations. While enforcing strict quality control on the node chain and workflow graph, some queries might have quality issues. Also, the workflow currently follows a one-pass generation paradigm and assumes all nodes require traversal to complete the task.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] The Best Platform for Serving Fine-Tuned Models: Predibase Inference Engine (Promoted)

The post WorFBench: A Benchmark for Evaluating Complex Workflow Generation in Large Language Model Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

WORFBENCH LLM代理 工作流生成 性能差距 局限性
相关文章