热点
关于我们
xx
xx
"
评估框架
" 相关文章
Salesforce AI Researchers Introduce UAEval4RAG: A New Benchmark to Evaluate RAG Systems’ Ability to Reject Unanswerable Queries
MarkTechPost@AI
2025-05-20T06:15:47.000000Z
Is Recursive Viability a Missing Piece in How We Evaluate LLM Agents?
少点错误
2025-04-26T10:57:28.000000Z
深度|清华姚班学霸、OpenAI姚顺雨:AI下半场从“算法竞赛”转向“效用定义”,重构评估框架,将技术能力转化为真实世界价值
Z Potentials
2025-04-25T06:36:28.000000Z
Accuracy evaluation framework for Amazon Q Business – Part 2
AWS Machine Learning Blog
2025-04-22T17:25:56.000000Z
This AI Paper Introduces Inference-Time Scaling Techniques: Microsoft’s Deep Evaluation of Reasoning Models on Complex Tasks
MarkTechPost@AI
2025-04-08T04:25:59.000000Z
破解微分方程世纪难题:机器学习如何突破传统数值方法瓶颈?
集智俱乐部
2025-03-21T13:34:31.000000Z
#340 - AMA #69: Scrutinizing supplements: creatine, fish oil, vitamin D, and more—a framework for understanding effectiveness, quality, and individual need
The Peter Attia Drive
2025-03-20T05:14:33.000000Z
CVPR 2025:长Prompt对齐问题也能评估了!当前最大AIGC评估数据集,模型评分超越当前SOTA
硅星人Pro
2025-03-15T04:14:28.000000Z
How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs
MarkTechPost@AI
2025-02-16T06:20:08.000000Z
From concept to reality: Navigating the Journey of RAG from proof of concept to production
AWS Machine Learning Blog
2025-02-12T17:32:16.000000Z
DeepSeek/o3的弱点找到了!三心二意 明明对了又改错了
快科技资讯
2025-02-04T11:31:20.000000Z
SoftBank-backed billionaire to invest $230M in Indian AI startup Krutrim
TechCrunch News
2025-02-04T09:21:38.000000Z
顶级AI智能体不会社交,创业远不如人类!CMU等:最多完成24%任务
新智元
2025-01-28T16:16:40.000000Z
Quantifying Knowledge Transfer: Evaluating Distillation in Large Language Models
MarkTechPost@AI
2025-01-28T06:19:59.000000Z
Plurai Introduces IntellAgent: An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System
MarkTechPost@AI
2025-01-23T18:35:01.000000Z
华为与哈工深等最新研究成果:SPA-Bench,手机操控智能体评估新标准
AI科技评论
2025-01-06T07:09:28.000000Z
Evaluation Agent: A Multi-Agent AI Framework for Efficient, Dynamic, Multi-Round Evaluation, While Offering Detailed, User-Tailored Analyses
MarkTechPost@AI
2024-12-23T19:20:35.000000Z
DeBaTeR: A New AI Method that Leverages Time Information in Neural Graph Collaborative Filtering to Enhance both Denoising and Prediction Performance
MarkTechPost@AI
2024-11-18T12:05:05.000000Z
李飞飞吴佳俊团队新作:推出具身智能决策能力评价基准,o1-preview登顶 | NeurIPS
智源社区
2024-11-15T09:52:21.000000Z
李飞飞吴佳俊团队新作:推出具身智能决策能力评价基准,o1-preview登顶
36氪 - 科技频道
2024-11-14T09:58:17.000000Z