MarkTechPost@AI 04月03日 05:10
Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI推出了PaperBench,这是一个新的基准测试,旨在评估AI代理自主复制前沿机器学习研究的能力。PaperBench要求AI系统能够准确解读研究论文、独立开发代码库并执行实验以重现经验结果。该基准测试包括来自ICML 2024的20篇论文,涵盖强化学习、鲁棒性和概率方法等领域。通过与论文作者共同开发的详细评分标准,PaperBench提供了对AI能力的精确评估。实验结果显示,不同AI模型的表现差异很大,其中Claude 3.5 Sonnet表现最佳,而人类研究人员的准确度显著高于AI模型。

🧠 PaperBench是一个用于评估AI代理自主复制机器学习研究能力的基准测试,旨在衡量AI系统解读研究论文、开发代码库和执行实验的能力。

🔬 该基准测试包含来自ICML 2024的20篇论文,涵盖了强化学习、鲁棒性和概率方法等多个领域,并提供了详细的评分标准,以实现对AI能力的精确评估。

🛠️ PaperBench要求AI代理从头开始构建代码库,包括完整的实验设置和执行脚本,禁止使用原始作者的代码,并使用SimpleJudge进行自动化评估,从而简化评分过程。

📊 实验结果表明,不同AI模型的表现差异很大,Claude 3.5 Sonnet的表现最佳,而人类研究人员的准确度显著高于AI模型,表明AI在持续任务执行、自适应问题解决和战略规划方面存在局限性。

💡 PaperBench Code-Dev是PaperBench的简化版本,更侧重于代码的正确性,减少了计算和评估成本,为社区提供了更广泛的应用,并促进了对AI研究能力的理解和发展。

The rapid progress in artificial intelligence (AI) and machine learning (ML) research underscores the importance of accurately evaluating AI agents’ capabilities in replicating complex, empirical research tasks traditionally performed by human researchers. Currently, systematic evaluation tools that precisely measure the ability of AI agents to autonomously reproduce ML research findings remain limited, posing challenges in fully understanding the potential and limitations of such systems.

OpenAI has introduced PaperBench, a benchmark designed to evaluate the competence of AI agents in autonomously replicating state-of-the-art machine learning research. PaperBench specifically measures whether AI systems can accurately interpret research papers, independently develop the necessary codebases, and execute experiments to replicate empirical outcomes. The benchmark comprises 20 papers selected from ICML 2024, covering areas including reinforcement learning, robustness, and probabilistic methods. Detailed rubrics, co-developed with original paper authors, specify 8,316 individually gradable tasks to facilitate precise evaluation of AI capabilities.

From a technical perspective, PaperBench requires AI agents to process provided research papers and supplementary clarifications to develop comprehensive code repositories from scratch. These repositories must include complete experimental setups and execution scripts, notably the reproduce.sh file. To ensure genuine independent replication, agents are prohibited from referencing or reusing code from the original authors’ repositories. Rubrics are structured hierarchically to detail explicit pass-fail criteria at various levels, allowing systematic and objective assessment. Evaluation is conducted using SimpleJudge, an automated large language model (LLM)-based judge, which simplifies the grading process. SimpleJudge achieved an F1 score of 0.83 on JudgeEval, an auxiliary evaluation dataset specifically designed to validate automated grading accuracy.

Empirical evaluations of several advanced AI models indicate varying performance levels on PaperBench. Claude 3.5 Sonnet exhibited the highest capability with an average replication score of 21.0%. Other models such as OpenAI’s GPT-4o and Gemini 2.0 Flash attained significantly lower scores of 4.1% and 3.2%, respectively. Comparatively, expert human ML researchers achieved considerably higher accuracy, reaching up to 41.4% after 48 hours of dedicated effort. Analysis of model performance revealed strengths in initial rapid code generation and early experimental setup but highlighted substantial weaknesses in managing prolonged tasks, troubleshooting, and adapting strategic approaches over time.

These results provide critical technical insights into current AI system capabilities. While AI models demonstrate competence in certain coding tasks and initial experiment implementation, significant gaps persist, particularly regarding sustained task execution, adaptive problem-solving, and strategic planning. Additionally, the introduction of PaperBench Code-Dev, a streamlined variant emphasizing code correctness without experimental execution, offers a practical alternative for broader and resource-limited community use due to reduced computational and evaluation costs.

In summary, PaperBench represents an important step toward methodically evaluating AI research capabilities. It provides a structured and detailed assessment environment that highlights specific strengths and limitations of contemporary AI models relative to human performance. The collaborative development of rubrics ensures precise and realistic evaluations. OpenAI’s open-sourcing of PaperBench supports further exploration and development in the field, enhancing understanding of autonomous AI research capabilities and informing responsible progression in this area.


Check out the Paper and GitHub page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Open AI Releases PaperBench: A Challenging Benchmark for Assessing AI Agents’ Abilities to Replicate Cutting-Edge Machine Learning Research appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

PaperBench 人工智能 机器学习 AI评估
相关文章