TechCrunch News 2024年12月10日
A test for AGI is closer to being solved — but it may be flawed
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

通用人工智能(AGI)的一项著名测试已取得显著进展,但测试创建者认为这更多地反映了测试设计的缺陷,而非真正的研究突破.2019年,AI领域领军人物Francois Chollet推出了ARC-AGI基准测试,旨在评估AI系统能否高效获取训练数据之外的新技能.尽管过去几年中,性能最佳的AI只能解决ARC-AGI中不到三分之一的任务,但在最近的竞赛中,最佳AI的得分达到了55.5%,比2023年的最高分高出约20%.然而,这并不意味着我们距离AGI更近了,因为许多提交的方案是通过“蛮力”找到解决方案,这表明ARC-AGI任务中很大一部分并没有提供多少关于通用智能的有用信号.

🧪ARC-AGI基准测试由AI领域领军人物Francois Chollet于2019年推出,旨在评估AI系统能否高效获取训练数据之外的新技能,被认为是衡量通用人工智能进展的唯一测试.

📈在最近的ARC-AGI竞赛中,最佳AI的得分达到了55.5%,比2023年的最高分高出约20%,但远未达到85%的“人类水平”门槛.

🧩ARC-AGI测试包含一系列类似拼图的问题,要求AI根据不同颜色方块组成的网格,生成正确的“答案”网格.这些问题旨在迫使AI适应从未见过的新问题.

⚠️许多提交的方案能够通过“蛮力”找到解决方案,这表明ARC-AGI任务中很大一部分并没有提供多少关于通用智能的有用信号.

🔜Knoop和Chollet计划发布第二代ARC-AGI基准测试,以解决这些问题,并将在2025年举办竞赛.

A well-known test for artificial general intelligence (AGI) is closer to being solved. But the tests’s creators say this points to flaws in the test’s design, rather than a bonafide research breakthrough.

In 2019, Francois Chollet, a leading figure in the AI world, introduced the ARC-AGI benchmark, short for “Abstract and Reasoning Corpus for Artificial General Intelligence.” Designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, ARC-AGI, Francois claims, remains the only AI test to measure progress towards general intelligence (although others have been proposed.)

Until this year, the best-performing AI could only solve just under a third of the tasks in ARC-AGI. Chollet blamed the industry’s focus on large language models (LLMs), which he believes aren’t capable of actual “reasoning.”

“LLMs struggle with generalization, due to being entirely reliant on memorization,” he said in a series of posts on X in February. “They break down on anything that wasn’t in the their training data.”

To Chollet’s point, LLMs are statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like that “to whom” in an email typically precedes “it may concern.”

Chollet asserts that while LLMs might be capable of memorizing “reasoning patterns,” it’s unlikely that they can generate “new reasoning” based on novel situations. “If you need to be trained on many examples of a pattern, even if it’s implicit, in order to learn a reusable representation for it, you’re memorizing,” Chollet argued in another post.

To incentivize research beyond LLMs, in June, Chollet and Zapier co-founder Mike Knoop launched a $1 million competition to build open source AI capable of beating ARC-AGI. Out of 17,789 submissions, the best scored 55.5% — ~20% higher than 2023’s top scorer, albeit short of the 85%, “human-level” threshold required to win.

This doesn’t mean we’re ~20% closer to AGI, though, Knoop says.

In a blog post, Knoop said that many of the submissions to ARC-AGI have been able to “brute force” their way to a solution, suggesting that a “large fraction” of ARC-AGI tasks “[don’t] carry much useful signal towards general intelligence.”

ARC-AGI consists of puzzle-like problems where an AI has to, given a grid of different-colored squares, generate the correct “answer” grid. The problems were designed to force an AI to adapt to new problems it hasn’t seen before. But it’s not clear they’re achieving this.

Tasks in the ARC-AGI benchmark. Models must solve ‘problems’ in the top row; the bottom row shows solutions. Image Credits:ARC-AGI

“[ARC-AGI] has been unchanged since 2019 and is not perfect,” Knoop acknowledged in his post.

Francois and Knoop have also faced criticism for overselling ARC-AGI as benchmark toward AGI — at a time when the very definition of AGI is being hotly contested. One OpenAI staff member recently claimed that AGI has “already” been achieved if one defines AGI as AI “better than most humans at most tasks.”

Knoop and Chollet say that they plan to release a second-gen ARC-AGI benchmark to address these issues, alongside a 2025 competition. “We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI,” Chollet wrote in an X post.

Fixes likely won’t come easy. If the first ARC-AGI test’s shortcomings are any indication, defining intelligence for AI will be as intractable — and inflammatory — as it has been for human beings.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

通用人工智能 ARC-AGI 基准测试 人工智能竞赛 深度学习
相关文章