TechCrunch News 03月21日
A high schooler built a website that lets you challenge AI models to a Minecraft build-off
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AI建设者以Minecraft为工具评估生成式AI模型能力。Minecraft Benchmark让AI模型在游戏中创作并接受用户投票。它便于人们观察AI发展,虽评估结果的实用性存争议,但发起者认为对企业有一定参考价值。

🎮Minecraft Benchmark让AI模型在Minecraft中创作并相互竞争,用户可投票评价。

👨‍🎓12 年级学生Adi Singh发起该项目,认为Minecraft的熟悉度有助于评估AI发展。

💻该项目虽为编程基准,但用户易通过作品外观评估,有更广泛吸引力和数据收集潜力。

🤔评估结果对AI实用性的价值存在争议,发起者认为对企业有一定参考意义。

As conventional AI benchmarking techniques prove inadequate, AI builders are turning to more creative ways to assess the capabilities of generative AI models. For one group of developers, that’s Minecraft, the Microsoft-owned sandbox-building game.

The website Minecraft Benchmark (or MC-Bench) was developed collaboratively to pit AI models against each other in head-to-head challenges to respond to prompts with Minecraft creations. Users can vote on which model did a better job, and only after voting can they see which AI made each Minecraft build.

Image Credits:Minecraft Benchmark (opens in a new window)

For Adi Singh, the 12th grader who started MC-Bench, the value of Minecraft isn’t so much the game itself, but the familiarity that people have with it — after all, it is the best-selling video game of all time. Even for people who haven’t played the game, it’s still possible to evaluate which blocky representation of a pineapple is better realized.

“Minecraft allows people to see the progress [of AI development] much more easily,” Singh told TechCrunch. “People are used to Minecraft, used to the look and the vibe.”

MC-Bench currently lists eight people as volunteer contributors. Anthropic, Google, OpenAI, and Alibaba have subsidized the project’s use of their products to run benchmark prompts, per MC-Bench’s website, but the companies are not otherwise affiliated.

“Currently we are just doing simple builds to reflect on how far we’ve come from the GPT-3 era, but [we] could see ourselves scaling to these longer-form plans and goal-oriented tasks,” Singh said. “Games might just be a medium to test agentic reasoning that is safer than in real life and more controllable for testing purposes, making it more ideal in my eyes.”

Other games like Pokémon RedStreet Fighter, and Pictionary have been used as experimental benchmarks for AI, in part because the art of benchmarking AI is notoriously tricky.

Researchers often test AI models on standardized evaluations, but many of these tests give AI a home-field advantage. Because of the way they’re trained, models are naturally gifted at certain, narrow kinds of problem-solving, particularly problem-solving that requires rote memorization or basic extrapolation.

Put simply, it’s hard to glean what it means that OpenAI’s GPT-4 can score in the 88th percentile on the LSAT, but cannot discern how many Rs are in the word “strawberry.” Anthropic’s Claude 3.7 Sonnet achieved 62.3% accuracy on a standardized software engineering benchmark, but it is worse at playing Pokémon than most five-year-olds.

MC-Bench is technically a programming benchmark, since the models are asked to write code to create the prompted build, like “Frosty the Snowman” or “a charming tropical beach hut on a pristine sandy shore.”

But it’s easier for most MC-Bench users to evaluate whether a snowman looks better than to dig into code, which gives the project wider appeal — and thus the potential to collect more data about which models consistently score better.

Whether those scores amount to much in the way of AI usefulness is up for debate, of course. Singh asserts that they’re a strong signal, though.

“The current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks,” Singh said. “Maybe [MC-Bench] could be useful to companies to know if they’re heading in the right direction.”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Minecraft Benchmark AI 评估 游戏应用 数据收集
相关文章