热点
"性能评估" 相关文章
Can LLM-Reasoning Models Replace Classical Planning? A Benchmark Study
cs.AI updates on arXiv.org 2025-08-01T04:08:18.000000Z
The Ultimate 2025 Guide to Coding LLM Benchmarks and Performance Metrics
MarkTechPost@AI 2025-07-31T08:54:46.000000Z
Systematic Evaluation of Knowledge Graph Repair with Large Language Models
cs.AI updates on arXiv.org 2025-07-31T04:48:06.000000Z
When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions
cs.AI updates on arXiv.org 2025-07-29T04:22:24.000000Z
MMCircuitEval: A Comprehensive Multimodal Circuit-Focused Benchmark for Evaluating LLMs
cs.AI updates on arXiv.org 2025-07-29T04:21:48.000000Z
MazeEval: A Benchmark for Testing Sequential Decision-Making in Language Models
cs.AI updates on arXiv.org 2025-07-29T04:21:37.000000Z
Can LLMs Solve ASP Problems? Insights from a Benchmarking Study (Extended Version)
cs.AI updates on arXiv.org 2025-07-29T04:21:32.000000Z
The Right to be Forgotten in Pruning: Unveil Machine Unlearning on Sparse Models
cs.AI updates on arXiv.org 2025-07-28T04:42:46.000000Z
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy
cs.AI updates on arXiv.org 2025-07-25T04:28:53.000000Z
Decoupling Knowledge and Reasoning in LLMs: An Exploration Using Cognitive Dual-System Theory
cs.AI updates on arXiv.org 2025-07-25T04:28:31.000000Z
Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls
cs.AI updates on arXiv.org 2025-07-24T05:31:21.000000Z
On the transferability of Sparse Autoencoders for interpreting compressed models
cs.AI updates on arXiv.org 2025-07-23T04:03:15.000000Z
Benchmarking Foundation Models with Multimodal Public Electronic Health Records
cs.AI updates on arXiv.org 2025-07-22T04:44:45.000000Z
[问与答] 求助装机大神! all in one 小主机 PVE 三开是否扛得住
V2EX 2025-07-21T11:20:39.000000Z
[问与答] 求助装机大神! all in one 小主机 PVE 三开是否扛得住
V2EX 2025-07-21T09:16:59.000000Z
[问与答] 求助装机大神! all in one 小主机 PVE 三开是否扛得住
V2EX 2025-07-21T08:17:21.000000Z
[问与答] 想用 Claude Code,有合适的中转站推荐吗?
V2EX 2025-07-21T06:20:41.000000Z
Kolmogorov Arnold Networks (KANs) for Imbalanced Data -- An Empirical Perspective
cs.AI updates on arXiv.org 2025-07-21T04:06:44.000000Z
ES vs Milvus vs PG vector :LLM时代的向量数据库选型指南
Zilliz 2025-07-18T11:40:41.000000Z
Benchmarking Deception Probes via Black-to-White Performance Boosts
cs.AI updates on arXiv.org 2025-07-18T04:13:40.000000Z