CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

cs.AI updates on arXiv.org 07月15日 12:24

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

本文介绍了CompassJudger-2，一个基于多领域数据策略的通用评估模型，通过改进学习目标和奖励机制，在多个基准测试中取得优异表现，并提出JudgerBenchV2作为新的评估标准。

arXiv:2507.09104v1 Announce Type: cross Abstract: Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CompassJudger-2 LLM评估多领域数据 JudgerBenchV2

相关文章

Prometheus-Eval and Prometheus 2: Setting New Standards in LLM Evaluation and Open-Source Innovation with State-of-the-art Evaluator Language Model

GraCoRe: A New AI Benchmark for Unveiling Strengths and Weaknesses in LLM Graph Comprehension and Reasoning

This AI Paper from Cornell Introduces UCB-E and UCB-E-LRF: Multi-Armed Bandit Algorithms for Efficient and Cost-Effective LLM Evaluation

tinyBenchmarks: Revolutionizing LLM Evaluation with 100-Example Curated Sets, Reducing Costs by Over 98% While Maintaining High Accuracy

Comet Launches Opik: A Comprehensive Open-Source Tool for End-to-End LLM Evaluation, Prompt Tracking, and Pre-Deployment Testing with Seamless Integration

Salesforce AI Introduces SFR-Judge: A Family of Three Judge Models of 8-Billion Parameters 8B, 12B, and 70B Size, Built with Meta Llama 3 and Mistral NeMO

LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

Chatbot Arena Conversation Dataset Release

Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B

Vinoground: A Temporal Counterfactual Large Multimodal Models LMM Evaluation Benchmark Encompassing 1000 Short and Natural Video-Caption Pairs