Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

cs.AI updates on arXiv.org 12小时前

Auto-Eval Judge: Towards a General Agentic Framework for Task Completion Evaluation

本文提出一种通用的智能体任务完成评估框架，通过模拟人类评估过程，将任务分解为子任务，并利用智能体的输出和推理信息进行验证，提高评估准确性。

arXiv:2508.05508v1 Announce Type: new Abstract: The increasing adoption of foundation models as agents across diverse domains necessitates a robust evaluation framework. Current methods, such as LLM-as-a-Judge, focus only on final outputs, overlooking the step-by-step reasoning that drives agentic decision-making. Meanwhile, existing Agent-as-a-Judge systems, where one agent evaluates another's task completion, are typically designed for narrow, domain-specific settings. To address this gap, we propose a generalizable, modular framework for evaluating agent task completion independent of the task domain. The framework emulates human-like evaluation by decomposing tasks into sub-tasks and validating each step using available information, such as the agent's output and reasoning. Each module contributes to a specific aspect of the evaluation process, and their outputs are aggregated to produce a final verdict on task completion. We validate our framework by evaluating the Magentic-One Actor Agent on two benchmarks, GAIA and BigCodeBench. Our Judge Agent predicts task success with closer agreement to human evaluations, achieving 4.76% and 10.52% higher alignment accuracy, respectively, compared to the GPT-4o based LLM-as-a-Judge baseline. This demonstrates the potential of our proposed general-purpose evaluation framework.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

智能体评估任务分解评估框架

相关文章

This AI Paper from UC Berkeley Research Highlights How Task Decomposition Breaks the Safety of Artificial Intelligence (AI) Systems, Leading to Misuse

一篇OpenAI、微软等系统性Prompt技术报告

Enhancing Vision-Language Models: Addressing Multi-Object Hallucination and Cultural Inclusivity for Improved Visual Assistance in Diverse Contexts

A New AI Study from MIT Shows Someone’s Beliefs about an LLM Play a Significant Role in the Model’s Performance and are Important for How It is Deployed

Large language models don’t behave like people, even though we may expect them to

中科大/华为诺亚出手！芯片性能≠布局评分，EDA物理设计框架全面开源

RAGChecker: A Fine-Grained Evaluation Framework for Diagnosing Retrieval and Generation Modules in RAG

This AI Paper from MIT Explores the Complexities of Teaching Language Models to Forget: Insights from Randomized Fine-Tuning

LLM-CI: A New Machine Learning Framework to Assess Privacy Norms Encoded in LLMs

AI科学家太多，谁靠谱一试便知！普林斯顿新基准CORE-Bench：最强模型仅有21%准确率