MarkTechPost@AI 05月15日 15:25
Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning Engineering (MLE) Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

佐治亚理工学院和斯坦福大学的研究人员推出了MLE-Dojo,这是一个专为训练、评估和基准测试自主机器学习工程(MLE)代理而设计的框架。该框架通过模拟真实世界的MLE场景,连接LLM代理与来自200多个Kaggle竞赛的机器学习任务。MLE-Dojo支持表格数据分析、计算机视觉、自然语言处理和时间序列预测等挑战,允许代理在沙盒环境中编写、执行和修改代码,从而实现结构化学习。该框架提供了一个交互式的环境,通过提供实时反馈和支持迭代实验,解决了传统评估工具的局限性,为开发通用、高性能的MLE代理奠定了基础。

🧑‍💻MLE-Dojo旨在解决机器学习工程中固有的迭代和反馈驱动性质的挑战,如超参数调整、模型调试和数据预处理,这些任务需要重复修改和评估,而传统评估工具无法提供实时反馈。

🛠️MLE-Dojo包含模块化组件,支持广泛的MLE挑战,每个任务都在独立的Docker容器中运行,以确保安全性和可重复性。代理通过部分可观察马尔可夫决策过程与环境交互,接收观察、执行操作并根据性能获得奖励。

📊MLE-Dojo的环境支持五种主要操作类型:请求任务信息、验证代码、执行代码、检索交互历史和重置环境。它还提供详细的观察空间,包括数据集、执行结果和错误消息,以及在每次交互后提供结构化反馈,从而逐步改进。

🏆在对八个前沿LLM的评估中,Gemini-2.5-Pro在四个核心机器学习领域中取得了最高的Elo评级和HumanRank,表明其性能优于人类基准。不同模型在代码执行频率和失败率方面存在差异,计算机视觉领域最具挑战性。

Machine learning engineering (MLE) involves developing, tuning, and deploying machine learning systems that require iterative experimentation, model optimization, and robust handling of data pipelines. As model complexity increases, so do the challenges associated with orchestrating end-to-end workflows efficiently. Researchers have explored the automation of MLE tasks using AI agents to handle these demands. Large Language Models (LLMs), particularly those with strong coding and problem-solving abilities, have shown potential to enhance this process significantly. Their role in automating structured workflows is now being tested through rigorous benchmarks and environments tailored to emulate real-world MLE scenarios.

A primary hurdle in automating machine learning engineering lies in the work’s inherently iterative and feedback-driven nature. Tasks such as hyperparameter tuning, model debugging, and data preprocessing cannot be resolved in one step; they require repeated modifications and evaluations. Traditional evaluation tools for AI models often rely on static datasets and do not allow for real-time error feedback or interactive problem-solving. This limitation prevents LLM agents from learning through trial and error, an essential component for mastering engineering tasks that evolve or require multiple attempts for success.

Earlier tools to evaluate LLMs in engineering or coding tasks have mostly focused on individual subtasks or isolated challenges. These include tools like MLAgentBench and DSBench, which rely on narrow test cases sourced from Kaggle competitions or synthetic datasets. While they cover more than basic tasks, they do not enable agents to perform code execution, debugging, or results interpretation in a live setting. Other environments, like SWE-Gym, focus exclusively on software engineering and lack support for machine learning-specific workflows. These limitations have slowed the creation of versatile, high-performing MLE agents that can handle real-time project complexities.

Researchers from Georgia Institute of Technology and Stanford University have introduced MLE-Dojo, a framework with an interactive environment that connects LLM agents with real-world machine learning tasks derived from over 200 Kaggle competitions. This framework supports tabular data analysis, computer vision, natural language processing, and time-series forecasting challenges. Research introduced MLE-Dojo to allow agents to write, execute, and revise code in a sandboxed, feedback-rich setting. The goal was to replicate the interactive cycles that human engineers follow, enabling structured learning for agents. The environment includes pre-installed dependencies, evaluation metrics, and supports supervised fine-tuning and reinforcement learning strategies.

MLE-Dojo’s structure consists of modular components that support a wide range of MLE challenges. Each task runs within its own Docker container, isolating it for safety and reproducibility. Agents interact with the environment through a Partially Observable Markov Decision Process, receiving observations, performing actions, and gaining rewards based on performance. The environment supports five primary action types: requesting task information, validating code, executing code, retrieving interaction history, and resetting the environment. It also provides a detailed observation space that includes datasets, execution results, and error messages. The agent receives structured feedback after every interaction, allowing for step-wise improvement. This modular setup helps maintain interoperability and simplifies adding new tasks to the system.

The evaluation included eight frontier LLMs—Gemini-2.5-Pro, DeepSeek-r1, o3-mini, GPT-4o, GPT-4o-mini, Gemini-2.0-Pro, Gemini-2.0-Flash, and DeepSeek-v3—across four core machine learning domains. Gemini-2.5-Pro achieved the highest Elo rating of 1257, followed by DeepSeek-r1 at 1137 and o3-mini at 1108. Regarding HumanRank, Gemini-2.5-Pro led with 61.95%, indicating its superior performance over human benchmarks. Models like GPT-4o-mini executed code only 20% of the time, adopting conservative strategies, while o3-mini performed executions in over 90% of the cases. The average failure rate for Gemini-2.5-Pro remained the lowest across validation and execution phases, reinforcing its robustness. Among domains, computer vision posed the greatest challenge, with most models scoring under 60 in HumanRank. Reasoning models generally produced longer outputs and maintained stronger performance consistency across iterations.

The research highlights the difficulty of applying LLMs to full machine learning workflows. It outlines a comprehensive solution in MLE-Dojo that enables learning through interaction, not just completion. MLE-Dojo sets a new standard for training and evaluating autonomous MLE agents by simulating engineering environments more accurately.


Check out the Paper, Project Page and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post Georgia Tech and Stanford Researchers Introduce MLE-Dojo: A Gym-Style Framework Designed for Training, Evaluating, and Benchmarking Autonomous Machine Learning Engineering (MLE) Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MLE-Dojo 机器学习工程 LLM AI Agent 自动化
相关文章