People are using Super Mario to benchmark AI now

TechCrunch News 03月04日

People are using Super Mario to benchmark AI now

加州大学圣地亚哥分校Hao AI Lab的研究人员将AI投入到实时超级马里奥兄弟游戏中。Anthropic的Claude 3.7表现最佳，其次是Claude 3.5，而谷歌的Gemini 1.5 Pro和OpenAI的GPT-4o则表现不佳。研究人员使用GamingAgent框架，向AI提供基本指令和游戏截图，AI生成Python代码来控制马里奥。研究发现，尽管推理模型在大多数基准测试中表现更强，但在这种实时游戏中表现较差，因为它们需要较长时间来决定行动。游戏一直被用于评估AI，但一些专家质疑将AI的游戏技能与技术进步联系起来的意义。

🕹️Hao AI Lab将AI投入实时《超级马里奥兄弟》游戏，旨在测试AI在复杂环境下的规划和策略能力。研究使用了GamingAgent框架，该框架向AI提供游戏截图和基本指令，让AI通过生成Python代码来控制马里奥的行动。

🧠Anthropic的Claude 3.7和Claude 3.5在测试中表现出色，而Google的Gemini 1.5 Pro和OpenAI的GPT-4o则表现不佳。这表明不同AI模型在处理实时游戏任务时存在显著差异。

⏱️研究发现，推理模型（如OpenAI的o1）在实时游戏中表现不如非推理模型，主要原因是推理模型需要较长时间来决定行动，而《超级马里奥兄弟》对时间要求极高。这揭示了AI在实时决策方面的挑战。

🎮游戏一直被用作AI的基准测试，但专家们对将AI的游戏技能与技术进步直接关联提出质疑。游戏环境相对抽象和简单，且提供无限的数据用于训练AI，这与现实世界的复杂性存在差异。

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

It wasn’t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

Image Credits:Hao Lab

GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, “If an obstacle or enemy is near, move/jump left to dodge” and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.

Still, Hao says that the game forced each model to “learn” to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that so-called reasoning models like OpenAI’s o1, which “think” through problems step by step to arrive at solutions, performed worse than “non-reasoning” models, despite being generally stronger on most benchmarks.

One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while — seconds, usually — to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to the death.

Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.

The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an “evaluation crisis.”

“I don’t really know what [AI] metrics to look at right now,” he wrote in a post on X. “TLDR my reaction is I don’t really know how good these models are right now.”

At least we can watch AI play Mario.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 超级马里奥游戏基准测试

相关文章

【iThome 2024 CIO大調查系列 1｜CIO年度目標】CIO更重視開創型IT戰略，AI創新優先度今年驟增

You Can’t Predict a Game of Pinball

How popular is ChatGPT? Part 1: more popular than Taylor Swift

Import AI 363: ByteDance’s 10k GPU training run; PPO vs REINFORCE; and generative everything

Weka Makes Life Simpler for Developers, Engineers, and Architects

✨ 人人都能用好AI，这款GPTs 助你定制高效工作流：Prompt for me 作为一个AI布道者，Hans 在即刻写下数百篇新产品介绍、模型研究和心得，却仍感受到不同领域和...

Redundancy in AI: A Hybrid Convolutional Neural Networks CNN Approach to Minimize Computational Overhead in Reliable Execution

OpenAI计划下周宣布ChatGPT和GPT-4更新，但不会推出GPT-5和搜索引擎

Intersect360 Research Takes a Deep Dive into the HPC-AI Market in New Report

Is the Future of Retail AI in the Hands of CTOs?