热点
"基准" 相关文章
OpenAI 开源 SimpleQA 新基准,专治大模型“胡言乱语”
IT之家 2024-10-30T23:37:55.000000Z
OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering
MarkTechPost@AI 2024-10-12T18:36:04.000000Z
Yandex Introduces TabReD: A New Benchmark for Tabular Machine Learning
MarkTechPost@AI 2024-07-23T16:19:23.000000Z
WTU-Eval: A New Standard Benchmark Tool for Evaluating Large Language Models LLMs Usage Capabilities
MarkTechPost@AI 2024-07-23T11:48:51.000000Z
UT Austin Researchers Introduce PUTNAMBENCH: A Comprehensive AI Benchmark for Evaluating the Capabilities of Neural Theorem-Provers with Putnam Mathematical Problems
MarkTechPost@AI 2024-07-20T11:48:45.000000Z
Planetarium: A New Benchmark to Evaluate LLMs on Translating Natural Language Descriptions of Planning Problems into Planning Domain Definition Language PDDL
MarkTechPost@AI 2024-07-15T21:01:19.000000Z
托克因涉嫌市场滥用在美被罚5500万美元
界面快报 2024-06-18T08:00:59.000000Z
CinePile: A Novel Dataset and Benchmark Specifically Designed for Authentic Long-Form Video Understanding
MarkTechPost@AI 2024-05-20T01:00:57.000000Z