The Problem with Reasoners by Aidan McLaughin

少点错误 2024年11月26日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文对OpenAI的o1和Deepseek的r1等推理模型进行了批判性分析。作者认为，o1类推理模型在超出其训练范围的领域并不能有效泛化。通过基准测试，作者发现o1模型在数学和编码方面表现出色，但在其他领域表现不佳，例如个人写作和文本编辑。此外，o1模型的推理能力提升并未带来显著的推理计算效率提升，这与预期存在差距。作者最后表达了对未来AI发展方向的担忧，如果整个行业都转向推理模型，未来AI的发展可能不如预期那样令人兴奋。

🤔 **o1类推理模型在超出训练范围的领域泛化能力有限：**OpenAI承认o1模型在易于验证的领域训练，但其是否能泛化到所有领域仍存疑。作者通过基准测试发现，o1模型在数学和编码方面表现出色，但在其他领域，如个人写作和文本编辑方面表现不佳，甚至不如GPT-4。

📊 **o1模型在情感理解等领域的性能表现平平：**在EQBench（测试情感理解能力）基准测试中，o1-preview的表现与Gemma-27B相当，o1-mini的表现与GPT-3.5-Turbo相当，这表明其推理能力在某些方面并没有显著超越现有的语言模型。

💰 **推理计算规模提升并未带来预期效果：**作者认为，推理计算规模提升旨在让当今模型拥有未来的能力，并加速AI发展。然而，o1模型并未实现推理计算的显著提升，这使得推理计算规模提升的意义受到质疑。

⚠️ **过度依赖推理模型可能导致未来AI发展乏味：**如果整个AI行业都转向推理模型，那么未来的AI发展可能不如预期那样令人兴奋，因为推理模型在泛化能力和计算效率方面存在局限性。

🚀 **作者期待未来AI发展能够解决推理模型的局限性：**作者希望AI实验室能够解决模型扩展过程中遇到的问题，并探索能够替代训练的推理计算方法，以实现更快的AI发展。

Published on November 25, 2024 8:24 PM GMT

Some critique on reasoning models like o1 (by OpenAI) and r1 (by Deepseek).

OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains. Whether or not they generalize beyond their RL training is a trillion-dollar question. Right off the bat, I’ll tell you my take:
o1-style reasoners do not meaningfully generalize beyond their training.

A straightforward way to check how reasoners perform on domains without easy verification is benchmarks. On math/coding, OpenAI's o1 models do exceptionally. On everything else, the answer is less clear.
Results that jump out:
o1-preview does worse on personal writing than gpt-4o and no better on editing text, despite costing 6 × more.OpenAI didn't release scores for o1-mini, which suggests they may be worse than o1-preview. o1-mini also costs more than gpt-4o.On eqbench (which tests emotional understanding), o1-preview performs as well as gemma-27b.On eqbench, o1-mini performs as well as gpt-3.5-turbo. No you didn’t misread that: it performs as well as gpt-3.5-turbo.

Throughout this essay, I’ve doomsayed o1-like reasoners because they’re locked into domains with easy verification. You won't see inference performance scale if you can’t gather near-unlimited practice examples for o1.
...
I expect transformative AI to come remarkably soon. I hope labs iron out the wrinkles in scaling model size. But if we do end up scaling model size to address these changes, what was the point of inference compute scaling again?
Remember, inference scaling endows today’s models with tomorrow’s capabilities. It allows you to skip the wait. If you want faster AI progress, you want inference to be a 1:1 replacement for training.
o1 is not the inference-time compute unlock we deserve.
If the entire AI industry moves toward reasoners, our future might be more boring than I thought.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签