LLM评估_Fishai

热点

"LLM评估" 相关文章

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

cs.AI updates on arXiv.org 2025-07-31T04:48:00.000000Z

LLM-Crowdsourced: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

cs.AI updates on arXiv.org 2025-07-31T04:47:53.000000Z

Building Black-box Scheming Monitors

少点错误 2025-07-29T17:53:38.000000Z

SpeechIQ: Speech Intelligence Quotient Across Cognitive Levels in Voice Understanding Large Language Models

cs.AI updates on arXiv.org 2025-07-28T04:42:59.000000Z

SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models

cs.AI updates on arXiv.org 2025-07-25T04:28:48.000000Z

Benchmarking Amazon Nova: A comprehensive analysis through MT-Bench and Arena-Hard-Auto

AWS Machine Learning Blog 2025-07-24T18:40:33.000000Z

ICML 2025 | 大模型能在信息不完备的情况下问出正确的问题吗？

机器之心 2025-07-24T09:36:47.000000Z

ICML 2025 | 大模型能在信息不完备的情况下问出正确的问题吗？

机器之心 2025-07-24T09:01:18.000000Z

AI Guardrails and Trustworthy LLM Evaluation: Building Responsible AI Systems

MarkTechPost@AI 2025-07-23T09:15:50.000000Z

Detecting Benchmark Contamination Through Watermarking

cs.AI updates on arXiv.org 2025-07-22T04:44:47.000000Z

BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

cs.AI updates on arXiv.org 2025-07-22T04:34:30.000000Z

Evaluating generative AI models with Amazon Nova LLM-as-a-Judge on Amazon SageMaker AI

AWS Machine Learning Blog 2025-07-17T22:16:00.000000Z

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards

cs.AI updates on arXiv.org 2025-07-15T04:24:34.000000Z

Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

cs.AI updates on arXiv.org 2025-07-09T04:02:05.000000Z

告别刷榜内卷！清华×百度提出Feedbacker，开启LLM深度洞察新评估时代

PaperWeekly 2025-05-26T06:17:31.000000Z

It's really hard to make scheming evals look realistic

少点错误 2025-05-24T19:27:31.000000Z

让 LLM 来评判 | 技巧与提示

Hugging Face 2025-05-13T16:51:55.000000Z

Copilot Arena: A platform for code

ΑΙhub 2025-04-28T08:40:05.000000Z

Atla AI Introduces the Atla MCP Server: A Local Interface of Purpose-Built LLM Judges via Model Context Protocol (MCP)

MarkTechPost@AI 2025-04-22T15:20:41.000000Z

A Hands-On Tutorial: Build a Modular LLM Evaluation Pipeline with Google Generative AI and LangChain

MarkTechPost@AI 2025-04-18T05:10:41.000000Z

Copyright © 2019 FISHAI.All Rights Reserved