TechCrunch News 03月04日
People are using Super Mario to benchmark AI now
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

加州大学圣地亚哥分校Hao AI Lab的研究人员将AI投入到实时超级马里奥兄弟游戏中。Anthropic的Claude 3.7表现最佳,其次是Claude 3.5,而谷歌的Gemini 1.5 Pro和OpenAI的GPT-4o则表现不佳。研究人员使用GamingAgent框架,向AI提供基本指令和游戏截图,AI生成Python代码来控制马里奥。研究发现,尽管推理模型在大多数基准测试中表现更强,但在这种实时游戏中表现较差,因为它们需要较长时间来决定行动。游戏一直被用于评估AI,但一些专家质疑将AI的游戏技能与技术进步联系起来的意义。

🕹️Hao AI Lab将AI投入实时《超级马里奥兄弟》游戏,旨在测试AI在复杂环境下的规划和策略能力。研究使用了GamingAgent框架,该框架向AI提供游戏截图和基本指令,让AI通过生成Python代码来控制马里奥的行动。

🧠Anthropic的Claude 3.7和Claude 3.5在测试中表现出色,而Google的Gemini 1.5 Pro和OpenAI的GPT-4o则表现不佳。这表明不同AI模型在处理实时游戏任务时存在显著差异。

⏱️研究发现,推理模型(如OpenAI的o1)在实时游戏中表现不如非推理模型,主要原因是推理模型需要较长时间来决定行动,而《超级马里奥兄弟》对时间要求极高。这揭示了AI在实时决策方面的挑战。

🎮游戏一直被用作AI的基准测试,但专家们对将AI的游戏技能与技术进步直接关联提出质疑。游戏环境相对抽象和简单,且提供无限的数据用于训练AI,这与现实世界的复杂性存在差异。

Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher.

Hao AI Lab, a research org at the University of California San Diego, on Friday threw AI into live Super Mario Bros. games. Anthropic’s Claude 3.7 performed the best, followed by Claude 3.5. Google’s Gemini 1.5 Pro and OpenAI’s GPT-4o struggled.

It wasn’t quite the same version of Super Mario Bros. as the original 1985 release, to be clear. The game ran in an emulator and integrated with a framework, GamingAgent, to give the AIs control over Mario.

Image Credits:Hao Lab

GamingAgent, which Hao developed in-house, fed the AI basic instructions, like, “If an obstacle or enemy is near, move/jump left to dodge” and in-game screenshots. The AI then generated inputs in the form of Python code to control Mario.

Still, Hao says that the game forced each model to “learn” to plan complex maneuvers and develop gameplay strategies. Interestingly, the lab found that so-called reasoning models like OpenAI’s o1, which “think” through problems step by step to arrive at solutions, performed worse than “non-reasoning” models, despite being generally stronger on most benchmarks.

One of the main reasons reasoning models have trouble playing real-time games like this is that they take a while — seconds, usually — to decide on actions, according to the researchers. In Super Mario Bros., timing is everything. A second can mean the difference between a jump safely cleared and a plummet to the death.

Games have been used to benchmark AI for decades. But some experts have questioned the wisdom of drawing connections between AI’s gaming skills and technological advancement. Unlike the real world, games tend to be abstract and relatively simple, and they provide a theoretically infinite amount of data to train AI.

The recent flashy gaming benchmarks point to what Andrej Karpathy, a research scientist and founding member at OpenAI, called an “evaluation crisis.”

“I don’t really know what [AI] metrics to look at right now,” he wrote in a post on X. “TLDR my reaction is I don’t really know how good these models are right now.”

At least we can watch AI play Mario.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 超级马里奥 游戏 基准测试
相关文章