TechCrunch News 2024年11月05日
Can Pictionary and Minecraft test AI models’ ingenuity?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着AI技术的不断发展,一些AI爱好者开始利用游戏来测试AI的解决问题能力。例如,Paul Calcraft开发了一个类似于“猜画画”的游戏,让两个AI模型互相玩;16岁的Adonis Singh则创建了Mcbench工具,让AI在Minecraft中进行建筑设计。这些游戏测试旨在评估AI的推理能力,并帮助开发者了解AI模型在不同场景下的表现。与传统的AI基准测试相比,游戏测试更具挑战性,能够迫使AI模型超越训练数据进行“思考”。虽然游戏测试并非完美,但它为评估AI的逻辑能力和多模态理解能力提供了一种新的思路。

🤔 Paul Calcraft开发了一个类似“猜画画”的游戏,让两个AI模型互相玩,一个模型画图,另一个模型猜图,旨在测试AI对形状、颜色和介词等概念的理解能力。

🎮 16岁的Adonis Singh开发了Mcbench工具,让AI模型控制Minecraft角色并进行建筑设计,测试AI资源利用和行动能力,认为Minecraft比其他基准测试更具挑战性。

💡 游戏测试的优势在于,它能够提供一种直观的、可视化的方式来比较不同AI模型的表现,例如逻辑推理和决策能力。

🎲 游戏测试并非完美,一些研究人员认为,Minecraft等游戏虽然看起来更接近现实世界,但从解决问题的角度来看,与其他游戏并没有本质区别。

⚠️ AI模型通常难以适应新的环境,并解决之前未遇到过的问题,例如擅长Minecraft的模型不一定能很好地玩Doom。

Most AI benchmarks don’t tell us much. They ask questions that can be solved with rote memorization, or cover topics that aren’t relevant to the majority of users.

So some AI enthusiasts are turning to games as a way to test AIs’ problem-solving skills.

Paul Calcraft, a freelance AI developer, has built an app where two AI models can play a Pictionary-like game with each other. One model doodles, while the other model tries to guess what the doodle represents.

“I thought this sounded super fun and potentially interesting from a model capabilities point of view,” Calcraft told TechCrunch in an interview. “So I sat indoors on a cloudy Saturday and got it done.”

Calcraft was inspired by a similar project by British programmer Simon Willison that tasked models with rendering a vector drawing of a pelican riding a bicycle. Willison, like Calcraft, chose a challenge he believed would force models to “think” beyond the contents of their training data.

Image Credits:Paul Calcraft

“The idea is to have a benchmark that’s un-gameable,” Calcraft said. “A benchmark that can’t be beaten by memorizing specific answers or simple patterns that have been seen before during training.”

Minecraft is in this “un-gameable” category as well, or so believes 16-year-old Adonis Singh. He’s created a tool, Mcbench, that gives a model control over a Minecraft character and tests its ability to design structures, along the lines of Microsoft’s Project Malmo.

“I believe Minecraft tests the models on resourcefulness and gives them more agency,” he told TechCrunch. “It’s not nearly as restricted and saturated as [other] benchmarks.”

Using games to benchmark AI is nothing new. The idea dates back decades: Mathematician Claude Shannon argued in 1949 that games like chess were a worthy challenge for “intelligent” software. More recently, Alphabet’s DeepMind developed a model that could play Pong and Breakout; OpenAI trained AI to compete in Dota 2 matches; and Meta designed an algorithm that could hold its own against professional Texas hold ’em players.

But what’s different now is that enthusiasts are hooking up large language models (LLMs) — models with the ability to analyze text, images and more — to games to probe how good they are at logic.

There’s an abundance of LLMs out there, from Gemini and Claude to GPT-4o, and they all have different “vibes,” so to speak. They “feel” different in one interaction to the next — a phenomenon that can be difficult to quantify.

Note the typo; there’s no such model as Claude 3.6 Sonnet. Image Credits:Adonis Singh

“LLMs are known to be sensitive to particular ways questions are asked, and just generally unreliable and hard to predict,” Calcraft said.

In contrast to text-based benchmarks, games provide a visual, intuitive way to compare how a model performs and behaves, said Matthew Guzdial, an AI researcher and professor at the University of Alberta.

“We can think of every benchmark as giving us a different simplification of reality focused on particular types of problems, like reasoning or communication,” he said. “Games are just other ways you can do decision-making with AI, so folks are using them like any other approach.”

Those familiar with the history of generative AI will note how similar Pictionary is to generative adversarial networks (GANs), in which a creator model sends images to a discriminator model that then evaluates them.

Calcraft believes that Pictionary can capture an LLM’s ability to understand concepts like shapes, colors and prepositions (e.g., the meaning of “in” versus “on”). He wouldn’t go so far as to say that the game is a reliable test of reasoning, but he argued that winning requires strategy and the ability to understand clues — neither of which models find easy.

“I also really like the almost adversarial nature of the Pictionary game, similar to GANs, where you have the two different roles: one draws and the other guesses,” he said. “The best one to draw is not the most artistic, but the one that can most clearly convey the idea to the audience of other LLMs (including to the faster, much less capable models!).”

“Pictionary is a toy problem that’s not immediately practical or realistic,” Calcraft cautioned. “That said, I do think spatial understanding and multimodality are critical elements for AI advancement, so LLM Pictionary could be a small, early step on that journey.”

Image Credits:Adonis Singh

Singh believes that Minecraft is a useful benchmark, too, and can measure reasoning in LLMs. “From the models I’ve tested so far, the results literally perfectly align with how much I trust the model for something reasoning-related,” he said.

Others aren’t so sure.

Mike Cook, a research fellow at Queen Mary University specializing in AI, doesn’t think Minecraft is particularly special as an AI testbed.

“I think some of the fascination with Minecraft comes from people outside of the games sphere who maybe think that, because it looks like ‘the real world,’ it has a closer connection to real-world reasoning or action,” Cook told TechCrunch. “From a problem-solving perspective, it’s not so dissimilar to a video game like Fortnite, Stardew Valley or World of Warcraft. It’s just got a different dressing on top that makes it look more like an everyday set of tasks like building things or exploring.”

To Cook’s point, even the best game-playing AI systems generally don’t adapt well to new environments, and can’t easily solve problems they haven’t seen before. For example, it’s unlikely a model that excels at Minecraft will play Doom with any real skill.

“I think the good qualities Minecraft does have from an AI perspective are extremely weak reward signals and a procedural world, which means unpredictable challenges,” Cook continued. “But it’s not really that much more representative of the real world than any other video game.”

That being the case, there sure is something fascinating about watching LLMs build castles.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 游戏测试 推理能力 大型语言模型 基准测试
相关文章