热点
"模型评估" 相关文章
Can one size fit all?: Measuring Failure in Multi-Document Summarization Domain Transfer
cs.AI updates on arXiv.org 2025-08-01T04:08:22.000000Z
Iterative Repair with Weak Verifiers for Few-shot Transfer in KBQA with Unanswerability
cs.AI updates on arXiv.org 2025-08-01T04:08:16.000000Z
Measuring Time-Series Dataset Similarity using Wasserstein Distance
cs.AI updates on arXiv.org 2025-07-31T04:48:01.000000Z
Machine Learning Experiences: A story of learning AI for use in enterprise software testing that can be used by anyone
cs.AI updates on arXiv.org 2025-07-31T04:47:56.000000Z
Explainability Through Systematicity: The Hard Systematicity Challenge for Artificial Intelligence
cs.AI updates on arXiv.org 2025-07-31T04:47:52.000000Z
什么是大语言模型性能评估的 pass@5 指标
掘金 人工智能 2025-07-31T03:43:42.000000Z
Evaluating Deep Learning Models for African Wildlife Image Classification: From DenseNet to Vision Transformers
cs.AI updates on arXiv.org 2025-07-30T04:46:13.000000Z
Automatic Classification of User Requirements from Online Feedback -- A Replication Study
cs.AI updates on arXiv.org 2025-07-30T04:12:10.000000Z
Long-Term Fairness Inquiries and Pursuits in Machine Learning: A Survey of Notions, Methods, and Challenges
cs.AI updates on arXiv.org 2025-07-30T04:12:02.000000Z
GovRelBench:A Benchmark for Government Domain Relevance
cs.AI updates on arXiv.org 2025-07-30T04:11:53.000000Z
第七篇:模型评估与调优:让模型跑得更好
掘金 人工智能 2025-07-30T02:27:49.000000Z
Gen-AI Police Sketches with Stable Diffusion
cs.AI updates on arXiv.org 2025-07-28T04:42:45.000000Z
GAIA基准测试介绍
掘金 人工智能 2025-07-27T01:37:16.000000Z
REST: A Stress-Testing Framework for Evaluating Multi-Problem Reasoning in Large Reasoning Models
MarkTechPost@AI 2025-07-26T21:51:13.000000Z
HIVMedQA: Benchmarking large language models for HIV medical decision support
cs.AI updates on arXiv.org 2025-07-25T04:28:47.000000Z
ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models
cs.AI updates on arXiv.org 2025-07-25T04:28:39.000000Z
Adaptive Repetition for Mitigating Position Bias in LLM-Based Ranking
cs.AI updates on arXiv.org 2025-07-25T04:28:37.000000Z
Optimising Call Centre Operations using Reinforcement Learning: Value Iteration versus Proximal Policy Optimisation
cs.AI updates on arXiv.org 2025-07-25T04:28:33.000000Z
Language model developers should report train-test overlap
cs.AI updates on arXiv.org 2025-07-24T05:31:31.000000Z
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
cs.AI updates on arXiv.org 2025-07-24T05:31:09.000000Z