The illusion of thinking: Apple research finds AI models collapse and give up with hard puzzles

New artificial intelligence research from Apple shows AI reasoning models may not be "thinking" so well after all.

According to a paper published just days before Apple's WWDC event, large reasoning models (LRMs) — like OpenAI o1 and o3, DeepSeek R1, Claude 3.7 Sonnet Thinking, and Google Gemini Flash Thinking — completely collapse when they're faced with increasingly complex problems. The paper comes from the same researchers who found other reasoning flaws in LLMs last year.

The news was a bucket of cold water for artificial general intelligence (AGI) optimists (and welcome news for AI and AGI skeptics), as Apple's research seemed to show damning evidence about the limitations of reasoning model intelligence. While the much-hyped LRM performed better than LLMs on medium-difficulty puzzles, they performed worse on simple puzzles. And according to Apple's research, when they faced hard puzzles, they collapsed completely, giving up on the problem prematurely.

This Tweet is currently unavailable. It might be loading or has been removed.

Or, as the Apple researchers put it, while AI models perform extremely well at math and coding, when it comes to more complex problems, they only provide "The Illusion of Thinking."

Apple was slow to develop large language models and implement AI in its devices, largely staying out of the conversation. The company has added Apple Intelligence AI features, though they have generally been considered underwhelming. With that in mind, this research might explain some of Apple's reticence to go all-in on AI, unlike Google and Samsung, which have frontloaded their devices with AI capabilities.

How Apple researchers tested reasoning skills

The problems researchers used to evaluate the reasoning models, which they call LRMs or Large Reasoning Models, are classic logic puzzles like the Tower of Hanoi. The puzzle consists of discs, stacked largest to smallest on one of three pegs, and the goal is to move the discs to the third peg without ever placing a larger disc on top of a smaller disc. Other puzzles included jumping checker pieces into empty spaces, the river-crossing problem (the one usually involving a fox, a chicken, and a bag of grain), and stacking blocks in a specific configuration.

You probably recognize these logic puzzles from math class or online games, since it's a simple way of testing humans' ability to reason and problem-solve. Once you figure it out, it's a simple matter of following the logic even as the complexity increases, which in this case means more discs, checkers, animals, or blocks. However, researchers found that LRMs start to fail after a certain point.

"Results show that all reasoning models exhibit a similar pattern with respect to complexity: accuracy progressively declines as problem complexity increases until reaching complete collapse (zero accuracy) beyond a model specific complexity threshold," researchers wrote. In the results shown, Claude 3.7 Sonnet + thinking and DeepSeek R1 start to fail when a fifth disc is added to the Tower of Hanoi problem. Even when more computing power is applied to the LRMs, they still fail at the more complex puzzles.

What's more, researchers found that reasoning models initially apply more thinking tokens as complexity increases, but they actually give up at a certain point. "Upon approaching a critical threshold — which closely corresponds to their accuracy collapse point — models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty," the paper read. So when the problems get harder, they spend less tokens, or "think" less.

But what about when the LRMs are given the answers? Nope, accuracy doesn't improve. Even when researchers included the algorithm in the prompt, so all the models need to do is follow the steps, they continued to fail.

But before you fire up the grill because LLM reasoning is so cooked, season these findings with a grain of salt. The research doesn't mean LRMs don't reason at all, it just means they may not currently be much smarter than humans. As AI expert Gary Marcus pointed out on his blog, "(ordinary) humans actually have a bunch of (well-known) limits that parallel what the Apple team discovered. Many (not all) humans screw up on versions of the Tower of Hanoi with 8 discs." As others have pointed out online, the research does not compare results from human attempts at these puzzles.

This Tweet is currently unavailable. It might be loading or has been removed.

Essentially, LLMs have their uses for tasks like coding and writing, but they also have weaknesses. "What the Apple paper shows, most fundamentally, regardless of how you define AGI, is that LLMs are no substitute for good well-specified conventional algorithms," wrote Marcus, who has been very vocal about the reasoning limitations of AI models.

That's to say, take the findings from Apple researchers for what they are: important data to be considered within the context of other LLM research. It's tempting to categorize AI's overall advancements as overhyped when new research like this comes out. Or, on the flip side, for AGI boosters to claim victory when research has discovered new advancements. But the reality is usually somewhere in the boring middle.

How Apple researchers tested reasoning skills

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签