评估基准_Fishai

热点

"评估基准" 相关文章

ElectriQ: A Benchmark for Assessing the Response Capability of Large Language Models in Power Marketing

cs.AI updates on arXiv.org 2025-08-01T04:08:20.000000Z

Survey of NLU Benchmarks Diagnosing Linguistic Phenomena: Why not Standardize Diagnostics Benchmarks?

cs.AI updates on arXiv.org 2025-07-29T04:22:23.000000Z

LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks

cs.AI updates on arXiv.org 2025-07-29T04:22:18.000000Z

MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs

cs.AI updates on arXiv.org 2025-07-29T04:21:48.000000Z

ACL 2025 | 过程奖励模型深陷“信任泥潭”，PRMBench撕开伪高精度面具

PaperWeekly 2025-07-27T09:01:22.000000Z

2.5k 个问题！HLE 突破性构建大语言模型精准评估体系；40 亿参数轻量级大语言模型 Jan-Nano，专为深度研究任务设计

智源社区 2025-07-21T06:11:45.000000Z

BEARCUBS: A benchmark for computer-using web agents

cs.AI updates on arXiv.org 2025-07-18T04:13:47.000000Z

Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects

cs.AI updates on arXiv.org 2025-07-15T04:24:23.000000Z

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

cs.AI updates on arXiv.org 2025-07-11T04:04:21.000000Z

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

cs.AI updates on arXiv.org 2025-07-11T04:04:05.000000Z

PlanQA: A Benchmark for Spatial Reasoning in LLMs using Structured Representations

cs.AI updates on arXiv.org 2025-07-11T04:03:58.000000Z

The bitter lesson of misuse detection

cs.AI updates on arXiv.org 2025-07-10T04:05:40.000000Z

图像生成新基准来了！57 项任务全方位拷问模型生成力，谁能交出最令人满意的图像答卷？

我爱计算机视觉 2025-07-08T12:11:05.000000Z

MoralBench: Moral Evaluation of LLMs

cs.AI updates on arXiv.org 2025-07-08T05:53:45.000000Z

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

cs.AI updates on arXiv.org 2025-07-08T04:34:03.000000Z

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

cs.AI updates on arXiv.org 2025-07-08T04:34:03.000000Z

HV-MMBench: Benchmarking MLLMs for Human-Centric Video Understanding

cs.AI updates on arXiv.org 2025-07-08T04:33:50.000000Z

LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

cs.AI updates on arXiv.org 2025-07-02T04:03:47.000000Z

真实评估，北理发布全球首个「全场景教育」基准，支持4000+情境

36氪 - 科技频道 2025-06-03T08:18:58.000000Z

真实评估！北理发布全球首个「全场景教育」基准，支持4000+情境

新智元 2025-06-03T05:23:02.000000Z

Copyright © 2019 FISHAI.All Rights Reserved