MarkTechPost@AI 06月22日 14:20
Why Apple’s Critique of AI Reasoning Is Premature
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近期,关于大型推理模型(LRMs)推理能力的争论因苹果公司的“思维幻觉”和Anthropic公司的反驳文章“思维幻觉的幻觉”而升温。苹果的研究声称LRMs的推理能力存在根本性限制,而Anthropic则认为这些说法源于评估方法而非模型本身的缺陷。Anthropic通过实验设计和评估方法的改进,挑战了苹果的结论,强调了测试环境和评估框架对结果的影响,并提出未来研究应关注推理能力与实际限制的区分、问题可解性的验证以及复杂性指标的完善。

🤔 苹果的研究在复杂性增加的情况下,观察到LRMs在汉诺塔和过河问题等谜题上的“准确性崩溃”,认为其无法应用精确计算和一致的算法推理。

🗣️ Anthropic认为苹果的实验设计存在缺陷,特别是输出token限制导致了看似的“推理崩溃”,实际上是由于模型主动截断输出以适应token限制。

🧐 Anthropic指出苹果的自动评估框架错误地将模型的截断行为解释为推理失败,导致对LRMs的不公正惩罚,并发现了苹果部分过河问题的不可解性。

💡 Anthropic通过使用简洁的解决方案(如Lua函数)进行测试,发现在复杂谜题上也能获得高准确率,表明问题在于评估方法而非推理能力。

🔍 Anthropic认为苹果使用的复杂性指标混淆了机械执行与真正的认知难度,例如汉诺塔需要更多步骤,但每个决策步骤都微不足道,而过河问题则涉及更高的认知复杂性。

The debate around the reasoning capabilities of Large Reasoning Models (LRMs) has been recently invigorated by two prominent yet conflicting papers: Apple’s “Illusion of Thinking” and Anthropic’s rebuttal titled “The Illusion of the Illusion of Thinking”. Apple’s paper claims fundamental limits in LRMs’ reasoning abilities, while Anthropic argues these claims stem from evaluation shortcomings rather than model failures.

Apple’s study systematically tested LRMs on controlled puzzle environments, observing an “accuracy collapse” beyond specific complexity thresholds. These models, such as Claude-3.7 Sonnet and DeepSeek-R1, reportedly failed to solve puzzles like Tower of Hanoi and River Crossing as complexity increased, even exhibiting reduced reasoning effort (token usage) at higher complexities. Apple identified three distinct complexity regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at medium complexity, and both collapse at high complexity. Critically, Apple’s evaluations concluded that LRMs’ limitations were due to their inability to apply exact computation and consistent algorithmic reasoning across puzzles.

Anthropic, however, sharply challenges Apple’s conclusions, identifying critical flaws in the experimental design rather than the models themselves. They highlight three major issues:

    Token Limitations vs. Logical Failures: Anthropic emphasizes that failures observed in Apple’s Tower of Hanoi experiments were primarily due to output token limits rather than reasoning deficits. Models explicitly noted their token constraints, deliberately truncating their outputs. Thus, what appeared as “reasoning collapse” was essentially a practical limitation, not cognitive failure.Misclassification of Reasoning Breakdown: Anthropic identifies that Apple’s automated evaluation framework misinterpreted intentional truncations as reasoning failures. This rigid scoring method didn’t accommodate models’ awareness and decision-making regarding output length, leading to unjustly penalizing LRMs.Unsolvable Problems Misinterpreted: Perhaps most significantly, Anthropic demonstrates that some of Apple’s River Crossing benchmarks were mathematically impossible to solve (e.g., cases with six or more individuals with a boat capacity of three). Scoring these unsolvable instances as failures drastically skewed the results, making models appear incapable of solving fundamentally unsolvable puzzles.

Anthropic further tested an alternative representation method—asking models to provide concise solutions (like Lua functions)—and found high accuracy even on complex puzzles previously labeled as failures. This outcome clearly indicates the issue was with evaluation methods rather than reasoning capabilities.

Another key point raised by Anthropic pertains to the complexity metric used by Apple—compositional depth (number of required moves). They argue this metric conflates mechanical execution with genuine cognitive difficulty. For example, while Tower of Hanoi puzzles require exponentially more moves, each decision step is trivial, whereas puzzles like River Crossing involve fewer steps but significantly higher cognitive complexity due to constraint satisfaction and search requirements.

Both papers significantly contribute to understanding LRMs, but the tension between their findings underscores a critical gap in current AI evaluation practices. Apple’s conclusion—that LRMs inherently lack robust, generalizable reasoning—is substantially weakened by Anthropic’s critique. Instead, Anthropic’s findings suggest LRMs are constrained by their testing environments and evaluation frameworks rather than their intrinsic reasoning capacities.

Given these insights, future research and practical evaluations of LRMs must:

Ultimately, Apple’s claim that LRMs “can’t really reason” appears premature. Anthropic’s rebuttal demonstrates that LRMs indeed possess sophisticated reasoning capabilities that can handle substantial cognitive tasks when evaluated correctly. However, it also stresses the importance of careful, nuanced evaluation methods to truly understand the capabilities—and limitations—of emerging AI models.


Check out the Apple Paper and Anthropic Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post Why Apple’s Critique of AI Reasoning Is Premature appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型推理模型 苹果 Anthropic 推理能力 评估方法
相关文章