MarkTechPost@AI 2024年09月22日
CORE-Bench: A Benchmark Consisting of 270 Tasks based on 90 Scientific Papers Across Computer Science, Social Science, and Medicine with Python or R Codebases
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CORE-Bench 是一个由普林斯顿大学研究人员开发的综合性基准测试,旨在评估 AI 代理在自动化科学研究可重复性方面的能力。该基准测试包含来自计算机科学、社会科学和医学领域的 90 篇论文中的 270 个任务,涵盖编码、shell 交互、检索和工具使用等技能。CORE-Bench 提供三种难度级别,模拟研究人员在现实世界中可能遇到的各种情况。此外,研究人员还评估了两个基准代理:AutoGPT 和 CORE-Agent,结果表明,将通用代理适应特定任务可以显著提高性能。

😊 **CORE-Bench 概述:** CORE-Bench 是一个全面的基准测试,包含来自计算机科学、社会科学和医学领域的 90 篇论文中的 270 个任务。它旨在评估 AI 代理在自动化科学研究可重复性方面的能力,包括编码、shell 交互、检索和工具使用等技能。

🧐 **三种难度级别:** CORE-Bench 提供三种难度级别,模拟研究人员在现实世界中可能遇到的各种情况。分别是 CORE-Bench-Easy、CORE-Bench-Medium 和 CORE-Bench-Hard,难度逐级递增,分别对应提供不同程度的信息给代理。

🚀 **评估结果:** 研究人员评估了两个基准代理:AutoGPT 和 CORE-Agent,结果表明,将通用代理适应特定任务可以显著提高性能。CORE-Bench 为高效和可重复的代理评估提供了评估框架,显著缩短了评估时间,并确保了对硬件的标准化访问。

📊 **任务类型:** CORE-Bench 任务涵盖文本和图像输出,反映了科学结果的多样性。视觉任务要求代理从图形、图表、绘图和 PDF 表格中提取信息。文本任务要求代理从命令行文本、PDF 内容和其他格式(如 HTML、Markdown 或 LaTeX)中提取结果。

💡 **意义:** CORE-Bench 旨在评估 AI 代理在理解指令、调试代码、信息检索和解释结果方面的能力,这些能力与在现实世界中复制新研究所需的技能密切相关。该基准测试为评估 AI 自动化科学研究可重复性的能力提供了可靠的工具。

Computational reproducibility poses a significant challenge in scientific research across various fields, including psychology, economics, medicine, and computer science. Despite the fundamental importance of reproducing results using provided data and code, recent studies have exposed severe shortcomings in this area. Researchers face numerous obstacles when replicating studies, even when code and data are available. These challenges include unspecified software library versions, differences in machine architectures and operating systems, compatibility issues between old libraries and new hardware, and inherent result variances. The problem is so pervasive that papers are often found to be irreproducible despite the availability of reproduction materials. This lack of reproducibility undermines the credibility of scientific research and hinders progress in computationally intensive fields.

Recent advancements in AI have led to ambitious claims about agents’ ability to conduct autonomous research. However, reproducing existing research is a crucial prerequisite to conducting novel studies, especially when new research requires replicating earlier baselines for comparison. Several benchmarks have been introduced to evaluate language models and agents on tasks related to computer programming and scientific research. These include assessments for conducting machine learning experiments, research programming, scientific discovery, scientific reasoning, citation tasks, and solving real-world programming problems. Despite these advances, the critical aspect of automating research reproduction has received little attention. 

Researchers from Princeton University have addressed the challenge of automating computational reproducibility in scientific research using AI agents. The researchers introduce CORE-Bench, a comprehensive benchmark comprising 270 tasks from 90 papers across computer science, social science, and medicine. CORE-Bench evaluates diverse skills, including coding, shell interaction, retrieval, and tool use, with tasks in both Python and R. The benchmark offers three difficulty levels based on available reproduction information, simulating real-world scenarios researchers might encounter. In addition to that, the study presents evaluation results for two baseline agents: AutoGPT, a generalist agent, and CORE-Agent, a task-specific version built on AutoGPT. These evaluations demonstrate the potential for adapting generalist agents to specific tasks, yielding significant performance improvements. The researchers also provide an evaluation harness designed for efficient and reproducible testing of agents on CORE-Bench, dramatically reducing evaluation time and ensuring standardized access to hardware.

CORE-Bench addresses the challenge of building a reproducibility benchmark by utilizing CodeOcean capsules, which are known to be easily reproducible. The researchers selected 90 reproducible papers from CodeOcean, splitting them into 45 for training and 45 for testing. For each paper, they manually created task questions about the outputs generated from successful reproduction, assessing an agent’s ability to execute code and retrieve results correctly.

The benchmark introduces a ladder of difficulty with three distinct levels for each paper, resulting in 270 tasks and 181 task questions. These levels differ in the amount of information provided to the agent:

1. CORE-Bench-Easy: Agents are given complete code output from a successful run, testing information extraction skills.

2. CORE-Bench-Medium: Agents receive Dockerfile and README instructions, requiring them to run Docker commands and extract information.

3. CORE-Bench-Hard: Agents only receive README instructions, necessitating library installation, dependency management, code reproduction, and information extraction.

This tiered approach allows for evaluating specific agent abilities and understanding their capabilities even when they cannot complete the most difficult tasks. The benchmark ensures task validity by including at least one question per task that cannot be solved by guessing, marking a task as correct only when all questions are answered correctly.

CORE-Bench is designed to evaluate a wide range of skills crucial for reproducing scientific research. The benchmark tasks require agents to demonstrate proficiency in understanding instructions, debugging code, information retrieval, and interpreting results across various disciplines. These skills closely mirror those needed to reproduce new research in real-world scenarios.

The tasks in CORE-Bench encompass both text and image-based outputs, reflecting the diverse nature of scientific results. Vision-based questions challenge agents to extract information from figures, graphs, plots, and PDF tables. For instance, an agent might need to report the correlation between specific variables in a plotted graph. Text-based questions, on the other hand, require agents to extract results from command line text, PDF content, and various formats such as HTML, markdown, or LaTeX. An example of a text-based question could be reporting the test accuracy of a neural network after a specific epoch.

This multifaceted approach ensures that the CORE-Bench comprehensively assesses an agent’s ability to handle the complex and varied outputs typical in scientific research. By incorporating both vision and text-based tasks, the benchmark provides a robust evaluation of an agent’s capacity to reproduce and interpret diverse scientific findings.

The evaluation results demonstrate the effectiveness of task-specific adaptations to generalist AI agents for computational reproducibility tasks. CORE-Agent, powered by GPT-4o, emerged as the top-performing agent across all three difficulty levels of the CORE-Bench benchmark. On CORE-Bench-Easy, it successfully solved 60.00% of tasks, while on CORE-Bench-Medium, it achieved a 57.78% success rate. However, performance dropped significantly to 21.48% on CORE-Bench-Hard, indicating the increasing complexity of tasks at this level.

This study introduces CORE-Bench to address the critical need for automating computational reproducibility in scientific research. While ambitious claims about AI agents revolutionizing research abound, the ability to reproduce existing studies remains a fundamental prerequisite. The benchmark’s baseline results reveal that task-specific modifications to general-purpose agents can significantly enhance accuracy in reproducing scientific work. However, with the best baseline agent achieving only 21% test-set accuracy, substantial room for improvement exists. CORE-Bench aims to catalyze research in enhancing agents’ capabilities for automating computational reproducibility, potentially reducing the human labor required for this essential yet time-consuming scientific activity. This benchmark represents a crucial step towards more efficient and reliable scientific research processes.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

FREE AI WEBINAR: ‘SAM 2 for Video: How to Fine-tune On Your Data’ (Wed, Sep 25, 4:00 AM – 4:45 AM EST)

The post CORE-Bench: A Benchmark Consisting of 270 Tasks based on 90 Scientific Papers Across Computer Science, Social Science, and Medicine with Python or R Codebases appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CORE-Bench AI 可重复性 科学研究 基准测试
相关文章