热点
"评估框架" 相关文章
On LLM-Assisted Generation of Smart Contracts from Business Processes
cs.AI updates on arXiv.org 2025-08-01T04:08:28.000000Z
Evaluation and Benchmarking of LLM Agents: A Survey
cs.AI updates on arXiv.org 2025-07-30T04:12:09.000000Z
Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation
cs.AI updates on arXiv.org 2025-07-29T04:21:54.000000Z
Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning
cs.AI updates on arXiv.org 2025-07-29T04:21:52.000000Z
Jailbreak迎来“最后一卷”?港科大用“内容评分”重塑大模型越狱评估范式
PaperWeekly 2025-07-27T09:01:21.000000Z
Jailbreak迎来“最后一卷”?港科大用“内容评分”重塑大模型越狱评估范式
PaperWeekly 2025-07-26T10:21:00.000000Z
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
cs.AI updates on arXiv.org 2025-07-25T04:28:48.000000Z
RAVine: Reality-Aligned Evaluation for Agentic Search
cs.AI updates on arXiv.org 2025-07-23T04:03:32.000000Z
AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results
cs.AI updates on arXiv.org 2025-07-21T04:06:41.000000Z
Assessing adaptive world models in machines with novel games
cs.AI updates on arXiv.org 2025-07-18T04:13:41.000000Z
VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks
cs.AI updates on arXiv.org 2025-07-18T04:13:41.000000Z
A Unified Framework for Evaluating the Effectiveness and Enhancing the Transparency of Explainable AI Methods in Real-World Applications
cs.AI updates on arXiv.org 2025-07-16T04:28:42.000000Z
Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks
cs.AI updates on arXiv.org 2025-07-15T04:26:56.000000Z
Towards Evaluating Robustness of Prompt Adherence in Text to Image Models
cs.AI updates on arXiv.org 2025-07-14T04:08:25.000000Z
ICML 2025 | 会刷题≠懂数学!CogMath打造“认知显微镜”,深扒大模型的数学能力
PaperWeekly 2025-07-14T00:19:01.000000Z
Multigranular Evaluation for Brain Visual Decoding
cs.AI updates on arXiv.org 2025-07-11T04:04:20.000000Z
Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment
cs.AI updates on arXiv.org 2025-07-09T04:02:08.000000Z
What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning
cs.AI updates on arXiv.org 2025-07-09T04:02:04.000000Z
DRAGON: Dynamic RAG Benchmark On News
cs.AI updates on arXiv.org 2025-07-09T04:01:48.000000Z
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
cs.AI updates on arXiv.org 2025-07-08T06:58:33.000000Z