MarkTechPost@AI 2024年10月13日
OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI的研究人员发布了MLE-bench,这是一个新的基准,用于评估AI代理在机器学习工程方面的能力。它包含了75个来自Kaggle的机器学习工程竞赛,涵盖了自然语言处理、计算机视觉和信号处理等领域。每个竞赛都评估了关键的机器学习技能,包括模型训练、数据预处理、运行实验和提交结果以进行评估。MLE-bench为研究人员提供了一个平台,用于评估AI代理在现实世界场景中的机器学习工程能力,并为开发更强大的AI系统提供指导。

😊 **MLE-bench的背景:** 现有的编码基准主要评估孤立的编码技能,而没有全面衡量AI代理执行复杂机器学习任务的能力,例如数据准备、模型训练和调试。为了解决这一差距,OpenAI研究人员开发了MLE-bench,这是一个全面的基准,评估AI代理在各种机器学习工程挑战中的能力,这些挑战来自现实世界场景。

😄 **MLE-bench的结构和细节:** MLE-bench包含75个来自Kaggle的机器学习工程竞赛,涵盖了自然语言处理、计算机视觉和信号处理等领域。每个竞赛都评估了关键的机器学习技能,例如模型训练、数据预处理、运行实验和提交结果以进行评估。每个竞赛都包含问题描述、数据集、本地评估工具和评分代码,用于评估代理的性能。为了确保可比性,每个竞赛的数据集都分为训练集和测试集,通常经过重新设计以避免任何重叠或污染问题。提交内容根据人类尝试使用竞赛排行榜进行评分,代理根据其相对于人类基准的性能获得奖牌(铜牌、银牌、金牌)。

😉 **实验结果和性能分析:** 对MLE-bench上的不同支架和AI模型进行评估揭示了一些有趣的发现。OpenAI的o1-preview模型与AIDE支架结合起来,成为性能最好的设置,在16.9%的竞赛中获得了奖牌,并且性能随着多次尝试而显著提高。当代理能够迭代其解决方案时,它们通常表现得更好,这突出了在解决挑战和优化解决方案时多次传递的重要性。当提供更多资源(例如增加计算时间和硬件)时,代理表现出更好的结果,强调了资源分配的影响。例如,GPT-4o的性能从在每个竞赛中花费24小时时的8.7%提高到在每个竞赛中花费100小时时的11.8%。此外,实验表明,增加尝试次数(pass@k)对成功率有显著影响,pass@6的性能几乎是pass@1的两倍。

🥳 **结论和未来方向:** MLE-bench代表了评估AI代理机器学习工程能力的重要一步,它专注于整体的端到端性能指标,而不是孤立的编码技能。该基准为评估机器学习工程的各个方面提供了一个强大的框架,包括数据预处理、模型训练、超参数调整和调试,这些对于现实世界的机器学习应用至关重要。它旨在促进对理解AI代理在自主执行实际机器学习工程任务方面的潜力和局限性的进一步研究。通过开源MLE-bench,OpenAI希望鼓励合作,使研究人员和开发人员能够贡献新的任务、改进现有的基准,并探索创新的支架技术。预计这种合作努力将加速该领域的进展,最终有助于更安全、更可靠地部署先进的AI系统。此外,MLE-bench是一个有价值的工具,可以用来识别AI代理需要进一步发展的关键领域,为未来在增强AI驱动的机器学习工程能力方面的研究工作提供明确的方向。

Machine Learning (ML) models have shown promising results in various coding tasks, but there remains a gap in effectively benchmarking AI agents’ capabilities in ML engineering. Existing coding benchmarks primarily evaluate isolated coding skills without holistically measuring the ability to perform complex ML tasks, such as data preparation, model training, and debugging.

OpenAI Researchers Introduce MLE-bench

To address this gap, OpenAI researchers have developed MLE-bench, a comprehensive benchmark that evaluates AI agents on a wide array of ML engineering challenges inspired by real-world scenarios. MLE-bench is a novel benchmark aimed at evaluating how well AI agents can perform end-to-end machine learning engineering. It is constructed using a collection of 75 ML engineering competitions sourced from Kaggle. These competitions encompass diverse domains such as natural language processing, computer vision, and signal processing. The competitions are carefully curated to assess key ML skills, including training models, data preprocessing, running experiments, and submitting results for evaluation. To provide an accurate baseline, human performance metrics are gathered from publicly available Kaggle leaderboards, enabling comparisons between the capabilities of AI agents and expert human participants.

Structure and Details of MLE-bench

MLE-bench features several design aspects to assess ML engineering effectively. Each of the 75 Kaggle competition tasks is representative of practical engineering challenges, making the benchmark both rigorous and realistic. Each Kaggle competition in MLE-bench consists of a problem description, dataset, local evaluation tools, and grading code used to assess the agent’s performance. To ensure comparability, each competition’s dataset is split into training and testing sets, often redesigned to avoid any overlap or contamination issues. Submissions are graded against human attempts using competition leaderboards, and agents receive medals (bronze, silver, gold) based on their performance relative to human benchmarks. The grading mechanism relies on standard evaluation metrics, such as the area under the receiver operating characteristic (AUROC), mean squared error, and other domain-specific loss functions, providing a fair comparison to Kaggle participants. AI agents, such as OpenAI’s o1-preview model combined with AIDE scaffolding, have been tested on these tasks, achieving results comparable to a Kaggle bronze medal in 16.9% of competitions. Performance significantly improved with repeated attempts, indicating that while agents can follow well-known approaches, they struggle to recover from initial mistakes or optimize effectively without multiple iterations. This highlights both the potential and the limitations of current AI systems in performing complex ML engineering tasks.

Experimental Results and Performance Analysis

The evaluation of different scaffolds and AI models on MLE-bench reveals interesting findings. OpenAI’s o1-preview model with AIDE scaffolding emerged as the best-performing setup, achieving medals in 16.9% of the competitions, and performance significantly improved with multiple attempts. Agents often performed better when they could iterate on their solutions, highlighting the importance of multiple passes in addressing challenges and optimizing solutions. When given additional resources, such as increased compute time and hardware, agents showed better results, emphasizing the impact of resource allocation. For example, the performance of GPT-4o doubled from 8.7% when given 24 hours to 11.8% when given 100 hours per competition. Furthermore, the experiments revealed that scaling up the number of attempts (pass@k) had a significant impact on the success rate, with pass@6 achieving nearly double the performance of pass@1. Additionally, experiments on scaling resources and agent scaffolding demonstrate the variability in performance based on resource availability and optimization strategies. Specifically, agents like o1-preview exhibited notable improvements in competitions requiring extensive model training and hyperparameter tuning when given longer runtimes or better hardware configurations. This evaluation provides valuable insights into the strengths and weaknesses of current AI agents, particularly in debugging, handling complex datasets, and effectively utilizing available resources.

Conclusion and Future Directions

MLE-bench represents a significant step forward in evaluating the ML engineering capabilities of AI agents, focusing on holistic, end-to-end performance metrics rather than isolated coding skills. The benchmark provides a robust framework for assessing various facets of ML engineering, including data preprocessing, model training, hyperparameter tuning, and debugging, which are essential for real-world ML applications. It aims to facilitate further research into understanding the potential and limitations of AI agents in performing practical ML engineering tasks autonomously. By open-sourcing MLE-bench, OpenAI hopes to encourage collaboration, allowing researchers and developers to contribute new tasks, improve existing benchmarks, and explore innovative scaffolding techniques. This collaborative effort is expected to accelerate progress in the field, ultimately contributing to safer and more reliable deployment of advanced AI systems. Additionally, MLE-bench serves as a valuable tool for identifying key areas where AI agents require further development, providing a clear direction for future research efforts in enhancing the capabilities of AI-driven ML engineering.

Setup

Some MLE-bench competition data is stored using Git-LFS. Once you have downloaded and installed LFS, run:

git lfs fetch --allgit lfs pull

You can install mlebench With pip:

pip install -e .

Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit

[Upcoming Event- Oct 17 202] RetrieveX – The GenAI Data Retrieval Conference (Promoted)

The post OpenAI Researchers Introduce MLE-bench: A New Benchmark for Measuring How Well AI Agents Perform at Machine Learning Engineering appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 AI代理 MLE-bench 基准 工程能力
相关文章