TechCrunch News 04月07日 05:21
Meta’s benchmarks for its new AI models are a bit misleading
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Meta公司最新发布的AI模型Maverick在LM Arena测试中排名第二,但研究人员发现,LM Arena上的Maverick与开发者可用的版本存在差异。LM Arena的版本经过优化,更偏向于对话,而实际版本可能表现不同。这种差异给开发者带来了挑战,使得难以预测模型在特定场景中的表现。文章指出,这种做法可能具有误导性,并强调了基准测试的局限性。

🤔 Meta发布的Maverick在LM Arena测试中表现出色,排名第二,但其版本与开发者可用的版本有所不同。

🗣️ LM Arena上的Maverick是“实验性聊天版本”,针对对话进行了优化,而实际版本可能未做此优化。

📈 LM Arena并非衡量AI模型性能的绝对可靠标准,AI公司通常不会为了在LM Arena上获得更好的成绩而特别定制或微调模型,或者至少他们没有承认这样做。

🧐 这种差异使得开发者难以准确预测模型在不同场景中的表现,也可能具有误导性,因为基准测试旨在提供模型在各种任务中的表现概览。

💬 研究人员观察到,LM Arena上的Maverick与可公开下载的版本在行为上存在显著差异,例如,LM Arena版本可能使用更多的表情符号,并给出冗长的回答。

One of the new flagship AI models Meta released on Saturday, Maverick, ranks second on LM Arena, a test that has human raters compare the outputs of models and choose which they prefer. But it seems the version of Maverick that Meta deployed to LM Arena differs from the version that’s widely available to developers.

As several AI researchers pointed out on X, Meta noted in its announcement that the Maverick on LM Arena is an “experimental chat version.” A chart on the official Llama website, meanwhile, discloses that Meta’s LM Arena testing was conducted using “Llama 4 Maverick optimized for conversationality.”

As we’ve written about before, for various reasons, LM Arena has never been the most reliable measure of an AI model’s performance. But AI companies generally haven’t customized or otherwise fine-tuned their models to score better on LM Arena — or haven’t admitted to doing so, at least.

The problem with tailoring a model to a benchmark, withholding it, and then releasing a “vanilla” variant of that same model is that it makes it challenging for developers to predict exactly how well the model will perform in particular contexts. It’s also misleading. Ideally, benchmarks — woefully inadequate as they are — provide a snapshot of a single model’s strengths and weaknesses across a range of tasks.

Indeed, researchers on X have observed stark differences in the behavior of the publicly downloadable Maverick compared with the model hosted on LM Arena. The LM Arena version seems to use a lot of emojis, and give incredibly long-winded answers.

We’ve reached out to Meta and Chatbot Arena, the organization that maintains LM Arena, for comment.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Meta AI模型 LM Arena Maverick 基准测试
相关文章