Mashable 03月26日 01:54
A new AI test is outwitting OpenAI, Google models, among others
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项名为ARC-AGI-2的新基准测试显示,当前领先的AI模型在实现通用人工智能(AGI)方面仍面临巨大挑战。该测试通过视觉谜题考验AI模型的模式识别、上下文理解和推理能力。OpenAI、Google和DeepSeek等公司的主流模型得分较低,表明它们在处理需要泛化能力和灵活应用知识的问题上存在局限。专家们对AGI的实现时间表存在争议,强调AI公司在追求重大投资时,炒作AGI概念的潜在风险。这项测试突显了AI在高效获取新技能方面与人类智能之间的差距。

🧠 ARC-AGI-2是一个新的基准测试,用于衡量AI在通用智能方面的进展,它通过视觉谜题来挑战AI模型的解决问题的能力,这些谜题需要模式识别、上下文理解和推理。

📊 OpenAI的o3-low模型在ARC-AGI-2上的得分为4%,而Google的Gemini 2.0 Flash和DeepSeek R1的得分为1.3%。Anthropic的Claude 3.7得分为0.9%。这表明当前AI模型在解决需要泛化能力的问题上存在困难。

🤔 ARC-AGI-2的设计避免了AI模型通过死记硬背来获取高分的陷阱,而是侧重于人类相对容易解决的谜题。这揭示了AI在从有限经验中进行泛化和在新情况下应用知识方面的差距。

🗣️ 专家们对AGI的实现时间表存在分歧,一些人认为可能在几年内实现,而另一些人则认为技术尚未成熟。同时,强调了AI公司炒作AGI以吸引投资的潜在风险。

Google, OpenAI, DeepSeek, et al. are nowhere near achieving AGI (Artificial General Intelligence), according to a new benchmark.

The Arc Prize Foundation, a nonprofit that measures AGI progress, has a new benchmark that is stumping the leading AI models. The test, called ARC-AGI-2 is the second edition ARC-AGI benchmark that tests models on general intelligence by challenging them to solve visual puzzles using pattern recognition, context clues, and reasoning.

According to the ARC-AGI leaderboard, OpenAI's most advanced model o3-low scored 4 percent. Google's Gemini 2.0 Flash and DeepSeek R1 both scored 1.3 percent. Anthropic's most advanced model, Claude 3.7 with an 8K token limit (which refers to the amount of tokens used to process an answer) scored 0.9 percent.

The question of how and when AGI will be achieved remains as heated as ever, with various factions bickering about the timeline or whether it's even possible. Anthropic CEO Dario Amodei said it could take as little as two to three years, and OpenAI CEO Sam Altman said "it's achievable with current hardware." But experts like Gary Marcus and Yann LeCun say the technology isn't there yet and it doesn't take an expert to see how fueling AGI hype is advantageous to AI companies seeking major investments.

The ARC-AGI benchmark is designed to challenge AI models beyond specialized intelligence by avoiding the memorization trap — spewing out PhD-level responses without an understanding of what it means. Instead it focuses on puzzles that are relatively easy for humans to solve because of our innate ability to take in new information and make inferences, thus revealing gaps that can't be resolved by simply feeding AI models more data.

"Intelligence requires the ability to generalize from limited experience and apply knowledge in new, unexpected situations. AI systems are already superhuman in many specific domains (e.g., playing Go and image recognition)" read the announcement.

"However, these are narrow, specialized capabilities. The 'human-ai gap' reveals what's missing for general intelligence - highly efficiently acquiring new skills."

To get a sense of AI models' current limitations, you can take the ARC-AGI test for yourself. And you might be surprised by its simplicity. There's some critical thinking involved, but the ARC-AGI test wouldn't be out of place next to the New York Times crossword puzzle, Wordle, or any of the other popular brain teasers. It's challenging but not impossible and the answer is there in the puzzle's logic, which is something the human brain has evolved to interpret.

OpenAI's o3-low model scored 75.7 percent on the first edition of ARC-AGI. By comparison, its 4 percent score on the second edition shows how difficult the test is, but also how there's a lot more work to be done with reaching human level intelligence.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AGI 人工智能 ARC-AGI-2
相关文章