TechCrunch News 01月01日
Will Smith eating spaghetti and other weird AI benchmarks that took off in 2024
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了2024年AI领域中一些非官方的、怪异的基准测试为何流行,例如让AI生成威尔·史密斯吃意面的视频、在Minecraft中建造结构以及让AI玩游戏等。这些测试虽然不严谨,但因其娱乐性和易于理解的特点而受到欢迎。文章指出,行业标准的AI基准测试往往难以被普通人理解,而像Chatbot Arena这样的众包测试也存在评判者不具代表性的问题。专家认为,AI社区应关注AI的下游影响,而非狭窄领域的能力,但这些怪异的基准测试因其娱乐性和易于理解的特点,可能会持续流行。

🍝 现象级基准:AI生成威尔·史密斯吃意面视频成为衡量AI视频生成能力的一种非官方标准,这种现象甚至被威尔·史密斯本人调侃,反映了大众对AI视觉效果的关注。

🧱 游戏化测试:16岁开发者创建的AI控制Minecraft应用,以及AI玩Pictionary和Connect 4等游戏,这些测试突显了AI在复杂环境下的决策和创造能力,同时也更具娱乐性。

📊 行业基准的局限性:行业标准的AI基准测试,如数学奥林匹克考试或博士级问题,对普通用户来说难以理解,而像Chatbot Arena这样的众包测试又存在评判者偏见的问题,无法真正反映AI的实际应用效果。

💡 关注下游影响:专家建议AI社区应关注AI的实际应用和下游影响,而非仅关注其在狭窄领域的能力,这反映了对AI技术发展方向的思考和对社会责任的重视。

When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.

It’s become something of a meme as well as a benchmark: Seeing whether a new video generator can realistically render Smith slurping down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.

Will Smith and pasta is but one of several bizarre “unofficial” benchmarks to take the AI community by storm in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures. Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect 4 against each other.

It’s not like there aren’t more academic tests of an AI’s performance. So why did the weirder ones blow up?

Image Credits:Paul Calcraft

For one, many of the industry-standard AI benchmarks don’t tell the average person very much. Companies often cite their AI’s ability to answer questions on Math Olympiad exams, or figure out plausible solutions to Ph.D.-level problems. Yet most people — yours truly included — use chatbots for things like responding to emails and basic research.

Crowdsourced industry measures aren’t necessarily better or more informative.

Take, for example, Chatbot Arena, a public benchmark many AI enthusiasts and developers follow obsessively. Chatbot Arena lets anyone on the web rate how well AI performs on particular tasks, like creating a web app or generating an image. But raters tend not to be representative — most come from AI and tech industry circles — and cast their votes based on personal, hard-to-pin-down preferences.

The Chatbot Arena interface.Image Credits:LMSYS

Ethan Mollick, a professor of management at Wharton, recently pointed out in a post on X another problem with many AI industry benchmarks: they don’t compare a system’s performance to that of the average person.

“The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless,” Mollick wrote.

Weird AI benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are most certainly not empirical — or even all that generalizable. Just because an AI nails the Will Smith test doesn’t mean it’ll generate, say, a burger well.

Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

One expert I spoke to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. That’s sensible. But I have a feeling that weird benchmarks aren’t going away anytime soon. Not only are they entertaining — who doesn’t like watching AI build Minecraft castles? — but they’re easy to understand. And as my colleague Max Zeff wrote about recently, the industry continues to grapple with distilling a technology as complex as AI into digestible marketing.

The only question in my mind is, which odd new benchmarks will go viral in 2025?

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI基准测试 威尔·史密斯 Minecraft Chatbot Arena 下游影响
相关文章