Mashable 06月10日
The illusion of thinking: Apple research finds AI models collapse and give up with hard puzzles
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

苹果公司最新的人工智能研究表明,大型推理模型(LRMs)在面对复杂问题时会崩溃。研究发现,这些模型在解决中等难度问题时表现优于传统LLMs,但在简单问题上表现更差。更令人担忧的是,当面对难题时,LRMs会完全失效,提前放弃。这与此前LLMs在推理方面存在的缺陷相呼应,对通用人工智能(AGI)的乐观情绪泼了一盆冷水。研究人员通过经典的逻辑谜题,如汉诺塔问题等,来评估LRMs的推理能力,结果显示,随着问题复杂度的增加,模型的准确性会逐渐下降,最终达到完全崩溃。即使增加计算能力,也无法改善LRMs在复杂问题上的表现。研究还发现,LRMs在接近崩溃点时,反而会减少推理的“努力”。

🤔苹果研究发现,大型推理模型(LRMs)在复杂问题上表现出明显的局限性,当面对越来越复杂的问题时,其解决问题的能力会迅速下降,最终完全失效。

🧩研究人员使用汉诺塔等经典逻辑谜题测试LRMs的推理能力,结果表明,随着问题复杂度的增加,模型的准确性会逐渐降低,最终达到零准确率。

📉LRMs在解决复杂问题时,不仅准确性下降,而且在接近崩溃点时,会减少推理的“努力”,即减少用于思考的tokens数量。即使提供答案,模型的表现也无法提高。

💡研究结果表明,LRMs在数学和编码等任务上表现出色,但在更复杂的推理任务中,它们提供的只是“思考的假象”,无法替代良好的、明确的传统算法。

New artificial intelligence research from Apple shows AI reasoning models may not be "thinking" so well after all.

According to a paper published just days before Apple's WWDC event, large reasoning models (LRMs) — like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking — completely collapse when they're faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year.

The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple's research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple's research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely.

Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide "The Illusion of Thinking."

Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. With that in mind, this research might explain some of Apple's reticence to go all-in on AI, unlike Google and Samsung, which have frontloaded their devices with AI capabilities.

How Apple researchers tested reasoning skills

The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration.

You probably recognize these logic puzzles from math class or online games, since it's a simple way of testing humans' ability to reason and problem-solve. Once you figure it out, it's a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point.

"Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold," researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles.

What's more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. "Upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty," the paper read. So when the problems get harder, they spend less tokens, or "think" less.

But what about when the LRMs are given the answers? Nope, accuracy doesn't improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail.

But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn't mean LRMs don't reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, "(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs." As others have pointed out online, the research does not compare results from human attempts at these puzzles.

Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. "What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms," wrote Marcus, who has been very vocal about the reasoning limitations of AI models.

That's to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It's tempting to categorize AI's overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI推理 大型语言模型 苹果研究
相关文章