热点
"评估基准" 相关文章
ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing
cs.AI updates on arXiv.org 2025-08-01T04:08:20.000000Z
Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?
cs.AI updates on arXiv.org 2025-07-29T04:22:23.000000Z
LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks
cs.AI updates on arXiv.org 2025-07-29T04:22:18.000000Z
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
cs.AI updates on arXiv.org 2025-07-29T04:21:48.000000Z
ACL 2025 | 过程奖励模型深陷“信任泥潭”,PRMBench撕开伪高精度面具
PaperWeekly 2025-07-27T09:01:22.000000Z
2.5k 个问题!HLE 突破性构建大语言模型精准评估体系;40 亿参数轻量级大语言模型 Jan-Nano,专为深度研究任务设计
智源社区 2025-07-21T06:11:45.000000Z
BEARCUBS: A benchmark for computer-using web agents
cs.AI updates on arXiv.org 2025-07-18T04:13:47.000000Z
Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects
cs.AI updates on arXiv.org 2025-07-15T04:24:23.000000Z
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
cs.AI updates on arXiv.org 2025-07-11T04:04:21.000000Z
LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation
cs.AI updates on arXiv.org 2025-07-11T04:04:05.000000Z
PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations
cs.AI updates on arXiv.org 2025-07-11T04:03:58.000000Z
The bitter lesson of misuse detection
cs.AI updates on arXiv.org 2025-07-10T04:05:40.000000Z
图像生成新基准来了!57 项任务全方位拷问模型生成力,谁能交出最令人满意的图像答卷?
我爱计算机视觉 2025-07-08T12:11:05.000000Z
MoralBench: Moral Evaluation of LLMs
cs.AI updates on arXiv.org 2025-07-08T05:53:45.000000Z
CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale
cs.AI updates on arXiv.org 2025-07-08T04:34:03.000000Z
Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions
cs.AI updates on arXiv.org 2025-07-08T04:34:03.000000Z
HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding
cs.AI updates on arXiv.org 2025-07-08T04:33:50.000000Z
LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing
cs.AI updates on arXiv.org 2025-07-02T04:03:47.000000Z
真实评估,北理发布全球首个「全场景教育」基准,支持4000+情境
36氪 - 科技频道 2025-06-03T08:18:58.000000Z
真实评估!北理发布全球首个「全场景教育」基准,支持4000+情境
新智元 2025-06-03T05:23:02.000000Z