热点
"AI能力评估" 相关文章
Interpreting the METR Time Horizons Post
少点错误 2025-04-30T03:12:28.000000Z
Recent AI model progress feels mostly like bullshit
少点错误 2025-03-24T19:32:10.000000Z
The Elicitation Game: Evaluating capability elicitation techniques
少点错误 2025-02-27T20:36:59.000000Z
These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models
TechCrunch News 2025-02-06T06:12:36.000000Z
Understanding Benchmarks and motivating Evaluations
少点错误 2025-02-06T01:51:47.000000Z
“人类终极考试”基准测试发布:顶级 AI 系统表现惨淡,回答准确率均未超 10%
IT之家 2025-01-24T08:37:28.000000Z