TechCrunch News 04月15日 06:44
Debates over AI benchmarking have reached Pokémon
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一篇关于AI模型在宝可梦游戏中的表现引发争议的文章。文章指出,谷歌的Gemini模型在宝可梦游戏中领先于Anthropic的Claude模型,但这种领先可能并非完全公平,因为Gemini使用了定制的辅助工具。文章还提到了其他AI基准测试中类似的问题,如Anthropic的Claude模型在SWE-bench测试中的不同表现以及Meta对Llama 4 Maverick模型的微调。这些例子都说明了在AI基准测试中,定制和非标准实现可能会影响结果,使得模型比较变得更加复杂。

🕹️谷歌的Gemini模型在宝可梦游戏中似乎领先于Anthropic的Claude模型,这引发了人们对AI基准测试的关注。

🗺️Gemini的优势在于其开发者为其构建了一个定制的迷你地图,帮助模型识别游戏中的“tile”,如可砍伐的树木,从而减少了分析截图的需求,加快了游戏决策速度。

📊Anthropic的Claude模型在SWE-bench测试中,使用“定制支架”时的准确率显著高于未使用的版本,这表明不同的实现方式会影响基准测试结果。

⚙️Meta对Llama 4 Maverick模型进行了微调,使其在LM Arena基准测试中表现出色,但其原始版本在该测试中的得分较低,突显了定制对结果的影响。

Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 基准测试 Gemini Claude 宝可梦
相关文章