EnterpriseAI 2024年10月30日
OpenAI Develops New AGI Benchmark to Assess Potential Risks of Advanced AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

OpenAI 研究人员开发了 MLE-bench,用于评估 AI 模型在“自主机器学习 (ML) 工程”中的能力。该基准测试包含一系列测试,旨在衡量 AI 代理是否能够在没有人类指令的情况下修改自己的代码并提升能力。MLE-bench 包含 75 个 Kaggle 测试,这些测试代表了现实世界的挑战,如斯坦福 OpenVaccine 计划和维苏威火山挑战。研究人员使用 OpenAI 最强大的 AI 模型 o1 对 MLE-bench 进行了测试,结果表明,该模型在 75 个测试中至少有 16.9% 达到了 Kaggle 铜牌水平。研究人员强调,AI 代理可以产生许多积极的影响,特别是在加速科学研究和发现方面,但如果 AI 代理不受控制,可能会导致无法控制的灾难。

🤔 **MLE-bench 的设计与目的:** MLE-bench 由 75 个 Kaggle 测试组成,这些测试代表了现实世界的挑战,旨在评估 AI 模型在“自主机器学习 (ML) 工程”中的能力。每个 Kaggle 测试都代表了 ML 工程师在高级研究和开发环境中使用的核心技能。

🏆 **OpenAI 模型在 MLE-bench 上的表现:** 研究人员使用 OpenAI 最强大的 AI 模型 o1 对 MLE-bench 进行了测试,结果表明,该模型在 75 个测试中至少有 16.9% 达到了 Kaggle 铜牌水平。这意味着该模型在 Kaggle 排行榜上排名前 40%。该模型在 MLE-bench 上获得了 7 枚金牌,比人类成为“Kaggle 大师”所需的 5 枚金牌还多。

⚠️ **AI 自主学习能力带来的风险:** 研究人员强调,虽然 AI 代理可以产生许多积极的影响,特别是在加速科学研究和发现方面,但如果 AI 代理不受控制,可能会导致无法控制的灾难。如果 AI 模型能够自主地改进自己的训练代码,它们可能会比人类研究人员更快地提升前沿模型的能力,从而可能导致模型被滥用或造成灾难性的后果。

🔓 **MLE-bench 的开源意义:** 为了支持对语言模型的代理能力的研究,并提高研究实验室对加速风险的透明度,研究人员决定将 MLE-bench 开源。这将允许其他研究人员使用 MLE-bench 测试其他 AI 模型。

🌎 **未来方向:** OpenAI 研究人员希望他们在 MLE-bench 上的工作将有助于更好地理解 AI 代理如何自主地执行 ML 工程任务,这对未来更强大模型的安全部署至关重要。

As artificial intelligence systems grow more advanced, there is increasing concern over models capable of modifying their own code and enhancing their abilities without human oversight. Such AI agents, if left unchecked, could progress at a rate that surpasses human understanding, potentially leading to unpredictable or even catastrophic outcomes.

Researchers from OpenAI have developed MLE-bench to assess how effectively AI models can perform tasks in “autonomous machine learning (ML) engineering.” This new artificial general intelligence (AGI) benchmark consists of a series of tests designed to measure whether AI agents can modify their own code and improve their capabilities without human instruction. AGI represents an AI system with intelligence surpassing human capabilities

The MLE-bench consists of 75 Kaggle tests, also referred to as Kaggle datasets, which serve as rigorous challenges to evaluate the ML engineering capabilities of AI models. Each sample of the Kaggle dataset was manually sourced from Kaggle to reflect a core set of day-to-day skills that ML engineers use in advanced research and development environments. 

The Kaggle tests represent real-world challenges, such as the Stanford OpenVaccine initiative, which focuses on developing an mRNA vaccine for COVID-19, and the Vesuvius Challenge, aimed at deciphering ancient scrolls from a library in Herculaneum, a town next to Pompeii.

The researchers tested MLE-bench with OpenAI’s most powerful AI model, the o1. The results indicate that the AI model achieved at least a Kaggle bronze medal level on 16.9% of all of the 75 tests in MLE-bench. This percentage increased as the model was given more opportunities to take on the challenges.

Achieving a bronze medal means ranking among the top 40% of human participants on the Kaggle leaderboard. OpenAI's o1 model earned a total of seven gold medals on MLE-bench, which is two more than what is required for a human to be considered a "Kaggle Grandmaster." The researchers noted in their paper that only two humans have ever achieved medals in the 75 different Kaggle competitions. 

The OpenAI researchers emphasize that AI agents could have numerous positive impacts, especially in accelerating scientific research and discovery, however, if the AI agents are not controlled, it could lead to unmitigated disaster.

"The capacity of agents to perform high-quality research could mark a transformative step in the economy. However, agents capable of performing open-ended ML research tasks, at the level of improving their own training code, could improve the capabilities of frontier models significantly faster than human researchers," the researchers explained in their paper published on arXiv

"If innovations are produced faster than our ability to understand their impacts, we risk developing models capable of catastrophic harm or misuse without parallel developments in securing, aligning, and controlling such models.” 

Highlighting the limitations of MLE-bench, the researchers pointed out the risk of contamination. As Kaggle competitions are publicly available, it's possible the models may have been trained on such data, including competition details and solutions. Additionally, the AGI benchmark is also resource-intensive. A single full run of the experiment by OpenAI scientists required 1800 GPU hours of compute, making it impractical for several applications.

The researchers further explained that a model capable of successfully tackling a substantial portion of MLE-bench is likely equipped to handle numerous open-ended ML tasks. To support research into the agentic capabilities of language models and improve transparency around acceleration risks in research labs, they have chosen to open-source MLE-bench.

This move to open source will allow other researchers to test other AI models against MLE-bench. The OpenAI researchers hope their work on MLE-bench will help advance the understanding of how AI agents can autonomously perform ML engineering tasks, which is crucial for the safe deployment of more powerful models in the future. 

 

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 自主学习 MLE-bench OpenAI Kaggle
相关文章