评估框架_Fishai

热点

"评估框架" 相关文章

On LLM-Assisted Generation of Smart Contracts from Business Processes

cs.AI updates on arXiv.org 2025-08-01T04:08:28.000000Z

Evaluation and Benchmarking of LLM Agents: A Survey

cs.AI updates on arXiv.org 2025-07-30T04:12:09.000000Z

Can You Share Your Story? Modeling Clients' Metacognition and Openness for LLM Therapist Evaluation

cs.AI updates on arXiv.org 2025-07-29T04:21:54.000000Z

Mitigating Geospatial Knowledge Hallucination in Large Language Models: Benchmarking and Dynamic Factuality Aligning

cs.AI updates on arXiv.org 2025-07-29T04:21:52.000000Z

Jailbreak迎来“最后一卷”？港科大用“内容评分”重塑大模型越狱评估范式

PaperWeekly 2025-07-27T09:01:21.000000Z

Jailbreak迎来“最后一卷”？港科大用“内容评分”重塑大模型越狱评估范式

PaperWeekly 2025-07-26T10:21:00.000000Z

SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

cs.AI updates on arXiv.org 2025-07-25T04:28:48.000000Z

RAVine: Reality-Aligned Evaluation for Agentic Search

cs.AI updates on arXiv.org 2025-07-23T04:03:32.000000Z

AIvaluateXR: An Evaluation Framework for on-Device AI in XR with Benchmarking Results

cs.AI updates on arXiv.org 2025-07-21T04:06:41.000000Z

Assessing adaptive world models in machines with novel games

cs.AI updates on arXiv.org 2025-07-18T04:13:41.000000Z

VAR-MATH: Probing True Mathematical Reasoning in Large Language Models via Symbolic Multi-Instance Benchmarks

cs.AI updates on arXiv.org 2025-07-18T04:13:41.000000Z

A Unified Framework for Evaluating the Effectiveness and Enhancing the Transparency of Explainable AI Methods in Real-World Applications

cs.AI updates on arXiv.org 2025-07-16T04:28:42.000000Z

Task Priors: Enhancing Model Evaluation by Considering the Entire Space of Downstream Tasks

cs.AI updates on arXiv.org 2025-07-15T04:26:56.000000Z

Towards Evaluating Robustness of Prompt Adherence in Text to Image Models

cs.AI updates on arXiv.org 2025-07-14T04:08:25.000000Z

ICML 2025 | 会刷题≠懂数学！CogMath打造“认知显微镜”，深扒大模型的数学能力

PaperWeekly 2025-07-14T00:19:01.000000Z

Multigranular Evaluation for Brain Visual Decoding

cs.AI updates on arXiv.org 2025-07-11T04:04:20.000000Z

Composable Strategy Framework with Integrated Video-Text based Large Language Models for Heart Failure Assessment

cs.AI updates on arXiv.org 2025-07-09T04:02:08.000000Z

What Would You Ask When You First Saw $a^2+b^2=c^2$? Evaluating LLM on Curiosity-Driven Questioning

cs.AI updates on arXiv.org 2025-07-09T04:02:04.000000Z

DRAGON: Dynamic RAG Benchmark On News

cs.AI updates on arXiv.org 2025-07-09T04:01:48.000000Z

End-to-End Evaluation for Low-Latency Simultaneous Speech Translation

cs.AI updates on arXiv.org 2025-07-08T06:58:33.000000Z

Copyright © 2019 FISHAI.All Rights Reserved