MarkTechPost@AI 02月24日
Meta AI Introduces MLGym: A New AI Framework and Benchmark for Advancing AI Research Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta AI 推出了 MLGym,这是一个用于推进 AI 研究代理的新框架和基准。该系统是首个用于机器学习任务的 Gym 环境,旨在促进强化学习技术在 AI 代理训练中的应用。MLGym-Bench 基准包含 13 个开放式任务,涵盖计算机视觉、自然语言处理、强化学习和博弈论等领域,需要具备实际的研究技能。该框架由 Agents、Environment、Datasets 和 Tasks 四个关键组件组成,旨在评估和开发用于机器学习研究任务的 LLM 代理。

🧪MLGym 是一个评估和开发用于机器学习研究任务的 LLM 代理的框架,它通过顺序命令实现与 shell 环境的交互。包括 Agents、Environment、Datasets 和 Tasks 四个关键组件。

🖥️MLGym-Bench 包含 13 个开放式任务,涵盖计算机视觉、自然语言处理、强化学习和博弈论等领域,需要具备实际的研究技能。该基准侧重于 Level 1:基线改进,即 LLM 优化模型但缺乏科学贡献。

🥇研究评估了五种最先进的模型——OpenAI O1-preview、Gemini 1.5 Pro、Claude-3.5-Sonnet、Llama-3-405b-Instruct 和 GPT-4o——在标准化设置下。OpenAI O1-preview 实现了最高的整体性能,Gemini 1.5 Pro 和 Claude-3.5-Sonnet 紧随其后。

🤝该研究强调了数据开放性的重要性,以加强协作和发现。随着 AI 研究的进展,推理、代理架构和评估方法的进步至关重要。加强跨学科合作可以确保 AI 驱动的代理加速科学发现,同时保持可重复性、可验证性和完整性。

The ambition to accelerate scientific discovery through AI has been longstanding, with early efforts such as the Oak Ridge Applied AI Project dating back to 1979. More recent advancements in foundation models have demonstrated the feasibility of fully automated research pipelines, enabling AI systems to autonomously conduct literature reviews, formulate hypotheses, design experiments, analyze results, and even generate scientific papers. Additionally, they can streamline scientific workflows by automating repetitive tasks, allowing researchers to focus on higher-level conceptual work. However, despite these promising developments, the evaluation of AI-driven research remains challenging due to the lack of standardized benchmarks that can comprehensively assess their capabilities across different scientific domains.

Recent studies have addressed this gap by introducing benchmarks that evaluate AI agents on various software engineering and machine learning tasks. While frameworks exist to test AI agents on well-defined problems like code generation and model optimization, most current benchmarks do not fully support open-ended research challenges, where multiple solutions could emerge. Furthermore, these frameworks often lack flexibility in assessing diverse research outputs, such as novel algorithms, model architectures, or predictions. To advance AI-driven research, there is a need for evaluation systems that incorporate broader scientific tasks, facilitate experimentation with different learning algorithms, and accommodate various forms of research contributions. By establishing such comprehensive frameworks, the field can move closer to realizing AI systems capable of independently driving meaningful scientific progress.

Researchers from the University College London, University of Wisconsin–Madison, University of Oxford, Meta, and other institutes have introduced a new framework and benchmark for evaluating and developing LLM agents in AI research. This system, the first Gym environment for ML tasks, facilitates the study of RL techniques for training AI agents. The benchmark, MLGym-Bench, includes 13 open-ended tasks spanning computer vision, NLP, RL, and game theory, requiring real-world research skills. A six-level framework categorizes AI research agent capabilities, with MLGym-Bench focusing on Level 1: Baseline Improvement, where LLMs optimize models but lack scientific contributions.

MLGym is a framework designed to evaluate and develop LLM agents for ML research tasks by enabling interaction with a shell environment through sequential commands. It comprises four key components: Agents, Environment, Datasets, and Tasks. Agents execute bash commands, manage history, and integrate external models. The environment provides a secure Docker-based workspace with controlled access. Datasets are defined separately from tasks, allowing reuse across experiments. Tasks include evaluation scripts and configurations for diverse ML challenges. Additionally, MLGym offers tools for literature search, memory storage, and iterative validation, ensuring efficient experimentation and adaptability in long-term AI research workflows.

The study employs a SWE-Agent model designed for the MLGYM environment, following a ReAct-style decision-making loop. Five state-of-the-art models—OpenAI O1-preview, Gemini 1.5 Pro, Claude-3.5-Sonnet, Llama-3-405b-Instruct, and GPT-4o—are evaluated under standardized settings. Performance is assessed using AUP scores and performance profiles, comparing models based on Best Attempt and Best Submission metrics. OpenAI O1-preview achieves the highest overall performance, with Gemini 1.5 Pro and Claude-3.5-Sonnet closely following. The study highlights performance profiles as an effective evaluation method, demonstrating that OpenAI O1-preview consistently ranks among the top models across various tasks.

In conclusion, the study highlights the potential and challenges of using LLMs as scientific workflow agents. MLGym and MLGymBench demonstrate adaptability across various quantitative tasks but reveal improvement gaps. Expanding beyond ML, testing interdisciplinary generalization, and assessing scientific novelty are key areas for growth. The study emphasizes the importance of data openness to enhance collaboration and discovery. As AI research progresses, advancements in reasoning, agent architectures, and evaluation methods will be crucial. Strengthening interdisciplinary collaboration can ensure that AI-driven agents accelerate scientific discovery while maintaining reproducibility, verifiability, and integrity.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post Meta AI Introduces MLGym: A New AI Framework and Benchmark for Advancing AI Research Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MLGym AI框架 AI研究
相关文章