热点
"AI模型评估" 相关文章
Music Arena: Live Evaluation for Text-to-Music
cs.AI updates on arXiv.org 2025-07-29T04:22:00.000000Z
Why Context Matters: Transforming AI Model Evaluation with Contextualized Queries
MarkTechPost@AI 2025-07-27T05:26:36.000000Z
Tencent improves testing creative AI models with new benchmark
AI News 2025-07-09T14:19:51.000000Z
人類AI評估系統Yupp上線,可測試逾500個大型語言模型
AI & Big Data 2025-06-23T04:50:36.000000Z
两位大模型从业者群友如何评价小米MiMo大模型?
理想 TOP2 2025-05-08T07:51:29.000000Z
Crowdsourced AI benchmarks have serious flaws, some experts say
TechCrunch News 2025-04-22T12:36:37.000000Z
OpenAI 收购 Context.ai 团队,AI 评估能力再升级
IT之家 2025-04-15T23:28:40.000000Z
OpenAI hires team behind GV-backed AI eval platform Context.ai
TechCrunch News 2025-04-15T18:21:22.000000Z
OpenAI实名举报Grok3作弊,一题答64次踩着台阶和o3-mini比
量子位 2025-02-24T01:13:50.000000Z
令人难以置信!AI大神评Grok 3:性能媲美OpenAI最强模型,略优于DeepSeek-R1
华尔街见闻 - 资讯 - undefined 2025-02-18T06:38:39.000000Z
Singapore University of Technology and Design (SUTD) Explores Advancements and Challenges in Multimodal Reasoning for AI Models Through Puzzle-Based Evaluations and Algorithmic Problem-Solving Analysis
MarkTechPost@AI 2025-02-08T04:20:02.000000Z
How we evaluate AI models and LLMs for GitHub Copilot
The GitHub Blog 2025-01-17T18:00:52.000000Z
让「幻觉」无处遁形!谷歌DeepMind全新基准,三代Gemini同台霸榜
新智元 2025-01-13T16:54:44.000000Z
傳Google用Anthropic Claude測試Gemini模型
AI & Big Data 2024-12-25T05:02:35.000000Z
Google is using Anthropic’s Claude to improve its Gemini AI
TechCrunch News 2024-12-24T16:22:11.000000Z
A safe harbor for AI evaluation and red teaming
AI Snake Oil 2024-12-13T05:08:43.000000Z
集成500+多模态现实任务!全新MEGA-Bench评测套件:CoT对开源模型反而有害?
新智元 2024-11-16T14:16:08.000000Z
集成500+多模态现实任务,全新MEGA-Bench评测套件:CoT对开源模型反而有害?
36kr-科技 2024-11-15T07:36:44.000000Z
Scoring AI models: Endor Labs unveils evaluation tool
AI News 2024-10-16T13:19:32.000000Z
OpenAI 推出 SWE-bench Verified 基准,更准确评估 AI 模型代码生成表现
IT之家 2024-08-15T06:52:31.000000Z