热点
关于我们
xx
xx
"
LLM评估
" 相关文章
Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
cs.AI updates on arXiv.org
2025-07-31T04:48:00.000000Z
LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
cs.AI updates on arXiv.org
2025-07-31T04:47:53.000000Z
Building Black-box Scheming Monitors
少点错误
2025-07-29T17:53:38.000000Z
SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models
cs.AI updates on arXiv.org
2025-07-28T04:42:59.000000Z
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models
cs.AI updates on arXiv.org
2025-07-25T04:28:48.000000Z
Benchmarking Amazon Nova: A comprehensive analysis through MT-Bench and Arena-Hard-Auto
AWS Machine Learning Blog
2025-07-24T18:40:33.000000Z
ICML 2025 | 大模型能在信息不完备的情况下问出正确的问题吗?
机器之心
2025-07-24T09:36:47.000000Z
ICML 2025 | 大模型能在信息不完备的情况下问出正确的问题吗?
机器之心
2025-07-24T09:01:18.000000Z
AI Guardrails and Trustworthy LLM Evaluation: Building Responsible AI Systems
MarkTechPost@AI
2025-07-23T09:15:50.000000Z
Detecting Benchmark Contamination Through Watermarking
cs.AI updates on arXiv.org
2025-07-22T04:44:47.000000Z
BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning
cs.AI updates on arXiv.org
2025-07-22T04:34:30.000000Z
Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI
AWS Machine Learning Blog
2025-07-17T22:16:00.000000Z
CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
cs.AI updates on arXiv.org
2025-07-15T04:24:34.000000Z
Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle
cs.AI updates on arXiv.org
2025-07-09T04:02:05.000000Z
告别刷榜内卷!清华×百度提出Feedbacker,开启LLM深度洞察新评估时代
PaperWeekly
2025-05-26T06:17:31.000000Z
It's really hard to make scheming evals look realistic
少点错误
2025-05-24T19:27:31.000000Z
让 LLM 来评判 | 技巧与提示
Hugging Face
2025-05-13T16:51:55.000000Z
Copilot Arena: A platform for code
ΑΙhub
2025-04-28T08:40:05.000000Z
Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)
MarkTechPost@AI
2025-04-22T15:20:41.000000Z
A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain
MarkTechPost@AI
2025-04-18T05:10:41.000000Z