TechCrunch News 03月25日 08:36
A new, challenging AGI test stumps most AI models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Arc Prize Foundation发布了新的AI通用智能测试ARC-AGI-2,旨在评估AI模型高效获取新技能的能力。该测试由视觉模式识别的难题组成,要求AI模型适应未见过的问题。目前,包括GPT-4.5和Claude 3.7 Sonnet在内的主流AI模型在ARC-AGI-2上的得分仅为1%左右,远低于人类平均60%的正确率。ARC-AGI-2引入了效率指标,并防止AI模型通过“蛮力”计算解决问题,与旧版测试相比,它更注重AI模型在解决问题时的效率和适应性。Arc Prize Foundation同时宣布了2025年的竞赛,目标是让开发者在低成本下达到85%的准确率。

🧠 ARC-AGI-2测试由Arc Prize Foundation推出,旨在评估AI模型在解决新问题时的通用智能和适应能力,测试包含视觉模式识别的难题。

📊 目前,包括OpenAI的o1-pro、DeepSeek的R1以及GPT-4.5、Claude 3.7 Sonnet和Gemini 2.0 Flash在内的先进AI模型在ARC-AGI-2上的得分仅为1%左右,远低于人类平均60%的正确率。

💡 ARC-AGI-2引入了效率指标,要求模型即时解释模式,而非依赖记忆,从而避免了AI模型通过“蛮力”计算来解决问题。

💰 Arc Prize Foundation同时宣布了Arc Prize 2025竞赛,挑战开发者在ARC-AGI-2测试中以每个任务0.42美元的成本达到85%的准确率。

🆚 ARC-AGI-2的出现,是为了弥补旧版测试ARC-AGI-1的不足,旧版测试中AI模型可通过大量的计算资源来获得高分。

The Arc Prize Foundation, a nonprofit co-founded by prominent AI researcher François Chollet, announced in a blog post on Tuesday that it has created a new, challenging test to measure the general intelligence of leading AI models.

So far, the new test, called ARC-AGI-2, has stumped most models.

“Reasoning” AI models like OpenAI’s o1-pro and DeepSeek’s R1 score between 1% and 1.3% on ARC-AGI-2, according to the Arc Prize leaderboard. Powerful non-reasoning models including GPT-4.5, Claude 3.7 Sonnet, and Gemini 2.0 Flash score around 1%.

The ARC-AGI tests consist of puzzle-like problems where an AI has to identify visual patterns from a collection of different-colored squares, and generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before.

The Arc Prize Foundation had over 400 people take ARC-AGI-2 to establish a human baseline. On average, “panels” of these people got 60% of the test’s questions right — much better than any of the models’ scores.

a sample question from Arc-AGI-2 (credit: Arc Prize).

In a post on X, Chollet claimed ARC-AGI-2 is a better measure of an AI model’s actual intelligence than the first iteration of the test, ARC-AGI-1. The Arc Prize Foundation’s tests are aimed at evaluating whether an AI system can efficiently acquire new skills outside the data it was trained on.

Chollet said that unlike ARC-AGI-1, the new test prevents AI models from relying on “brute force” — extensive computing power — to find solutions. Chollet previously acknowledged this was a major flaw of ARC-AGI-1.

To address the first test’s flaws, ARC-AGI-2 introduces a new metric: efficiency. It also requires models to interpret patterns on the fly instead of relying on memorization.

“Intelligence is not solely defined by the ability to solve problems or achieve high scores,” Arc Prize Foundation co-founder Greg Kamradt wrote in a blog post. “The efficiency with which those capabilities are acquired and deployed is a crucial, defining component. The core question being asked is not just, ‘Can AI acquire [the] skill to solve a task?’ but also, ‘At what efficiency or cost?’”

ARC-AGI-1 was unbeaten for roughly five years until December 2024, when OpenAI released its advanced reasoning model, o3, which outperformed all other AI models and matched human performance on the evaluation. However, as we noted at the time, o3’s performance gains on ARC-AGI-1 came with a hefty price tag.

The version of OpenAI’s o3 model — o3 (low) — that was first to reach new heights on ARC-AGI-1, scoring 75.7% on the test, got a measly 4% on ARC-AGI-2 using $200 worth of computing power per task.

Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2 (credit: Arc Prize).

The arrival of ARC-AGI-2 comes as many in the tech industry are calling for new, unsaturated benchmarks to measure AI progress. Hugging Face’s co-founder, Thomas Wolf, recently told TechCrunch that the AI industry lacks sufficient tests to measure the key traits of so-called artificial general intelligence, including creativity.

Alongside the new benchmark, the Arc Prize Foundation announced a new Arc Prize 2025 contest, challenging developers to reach 85% accuracy on the ARC-AGI-2 test while only spending $0.42 per task.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ARC-AGI-2 AI测试 通用智能 效率 AI竞赛
相关文章