Debates over AI benchmarking have reached Pokémon

TechCrunch News 04月15日 06:44

Debates over AI benchmarking have reached Pokémon

一篇关于AI模型在宝可梦游戏中的表现引发争议的文章。文章指出，谷歌的Gemini模型在宝可梦游戏中领先于Anthropic的Claude模型，但这种领先可能并非完全公平，因为Gemini使用了定制的辅助工具。文章还提到了其他AI基准测试中类似的问题，如Anthropic的Claude模型在SWE-bench测试中的不同表现以及Meta对Llama 4 Maverick模型的微调。这些例子都说明了在AI基准测试中，定制和非标准实现可能会影响结果，使得模型比较变得更加复杂。

🕹️谷歌的Gemini模型在宝可梦游戏中似乎领先于Anthropic的Claude模型，这引发了人们对AI基准测试的关注。

🗺️Gemini的优势在于其开发者为其构建了一个定制的迷你地图，帮助模型识别游戏中的“tile”，如可砍伐的树木，从而减少了分析截图的需求，加快了游戏决策速度。

📊Anthropic的Claude模型在SWE-bench测试中，使用“定制支架”时的准确率显著高于未使用的版本，这表明不同的实现方式会影响基准测试结果。

⚙️Meta对Llama 4 Maverick模型进行了微调，使其在LM Arena基准测试中表现出色，但其原始版本在该测试中的得分较低，突显了定制对结果的影响。

Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x

— Jush (@Jush21e8) April 10, 2025

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 基准测试 Gemini Claude 宝可梦

相关文章

【iThome 2024 CIO大調查系列 1｜CIO年度目標】CIO更重視開創型IT戰略，AI創新優先度今年驟增

How popular is ChatGPT? Part 1: more popular than Taylor Swift

Weka Makes Life Simpler for Developers, Engineers, and Architects

✨ 人人都能用好AI，这款GPTs 助你定制高效工作流：Prompt for me 作为一个AI布道者，Hans 在即刻写下数百篇新产品介绍、模型研究和心得，却仍感受到不同领域和...

Redundancy in AI: A Hybrid Convolutional Neural Networks CNN Approach to Minimize Computational Overhead in Reliable Execution

OpenAI计划下周宣布ChatGPT和GPT-4更新，但不会推出GPT-5和搜索引擎

Intersect360 Research Takes a Deep Dive into the HPC-AI Market in New Report

Is the Future of Retail AI in the Hands of CTOs?

In the AI Revolution, Real-Time Data Platforms Are the Hidden Drivers of Innovation

Building LLM-Based Applications with Azure OpenAI with Jay Emery - #657