CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

cs.AI updates on arXiv.org 07月25日 12:28

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy

本文介绍了CLEAR，一个基于大型语言模型的错误分析开源工具，旨在为LLM性能评估提供更深入的分析，包括实例反馈、系统错误问题和量化问题频率，并可通过交互式仪表盘进行可视化分析。

arXiv:2507.18392v1 Announce Type: cross Abstract: The evaluation of Large Language Models (LLMs) increasingly relies on other LLMs acting as judges. However, current evaluation paradigms typically yield a single score or ranking, answering which model is better but not why. While essential for benchmarking, these top-level scores obscure the specific, actionable reasons behind a model's performance. To bridge this gap, we introduce CLEAR, an interactive, open-source package for LLM-based error analysis. CLEAR first generates per-instance textual feedback, then it creates a set of system-level error issues, and quantifies the prevalence of each identified issue. Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations, applies interactive filters to isolate specific issues or score ranges, and drills down to the individual instances that exemplify a particular behavioral pattern. We demonstrate CLEAR analysis for RAG and Math benchmarks, and showcase its utility through a user case study.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 错误分析开源工具可视化分析性能评估

相关文章

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Learn AI Together — Towards AI Community Newsletter #23

This AI newsletter is all you need #98

Patterns and Middleware for LLM Applications with Kyle Roche - #659

Building LLM-Based Applications with Azure OpenAI with Jay Emery - #657

Mental Models for Advanced ChatGPT Prompting with Riley Goodside - #652

Building Foundational ML Platforms with Kubernetes and Kubeflow with Ali Rodell - #595

FastGen: Cutting GPU Memory Costs Without Compromising on LLM Quality

Intel Releases a Low-bit Quantized Open LLM Leaderboard for Evaluating Language Model Performance through 10 Key Benchmarks

Vidur: A Large-Scale Simulation Framework Revolutionizing LLM Deployment Through Cost Cuts and Increased Efficiency