TechCrunch News 04月10日 21:43
The rise of AI ‘reasoning’ models is making benchmarking more expensive
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI推理模型基准测试成本快速上升的问题。虽然AI推理模型在特定任务上的表现优于非推理模型,但其高昂的测试成本给独立验证带来了挑战。文章通过对比不同模型的测试费用,揭示了推理模型测试成本远高于非推理模型。此外,文章还分析了高成本的原因,包括推理模型生成大量tokens以及复杂基准测试的需求。最后,文章还提到了AI实验室提供免费或补贴测试带来的潜在问题,强调了独立测试的重要性。

🧐 AI推理模型在特定任务上展现出更强的能力,但其基准测试成本显著高于非推理模型。例如,测试OpenAI的o1推理模型费用高达2767.05美元,而测试非推理模型GPT-4o仅需108.85美元。

💡 推理模型的高成本主要源于其生成大量tokens的需求。tokens代表原始文本片段,推理模型在测试中产生的tokens数量是GPT-4o的八倍左右,从而推高了测试费用。

📚 现代基准测试的复杂性也增加了成本。这些测试通常包含涉及多步骤任务的问题,旨在评估模型执行现实世界任务的能力,如编写和执行代码、浏览互联网等。

💸 AI实验室为基准测试机构提供免费或补贴访问模型的做法,可能影响测试结果的客观性。即使没有操纵,实验室的参与也会威胁评估的完整性。

AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

“At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

“We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are recent post on X. “[N]o one is going to be able to reproduce the results.”

Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

“[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $70 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

“[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

“From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “Was it ever science?”

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 推理模型 基准测试 成本
相关文章