LLMs Look Increasingly Like General Reasoners

Published on November 8, 2024 11:47 PM GMT

Summary

Four months after my post 'LLM Generality is a Timeline Crux', new research on o1-preview should update us significantly toward LLMs being capable of general reasoning, and hence of scaling straight to AGI, and shorten our timeline estimates.

Summary of previous post

In June of 2024, I wrote a post, 'LLM Generality is a Timeline Crux', in which I argue that

Leopold Aschenbrenner

Reasons to update

In the original post, I gave the three main pieces of evidence against LLMs doing general reasoning that I found most compelling: blocksworld, planning/scheduling, and ARC-AGI (see original for details). All three of those seem importantly weakened in light of recent research.

Most dramatically, a new paper on blocksworld has recently been published by some of the same highly LLM-skeptical researchers (Valmeekam et al, led by Subbarao Kambhampati^[1]: 'LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on Planbench'. Where the best previous success rate on non-obfuscated blocksworld was 57.6%, o1-preview essentially saturates the benchmark with 97.8%. On obfuscated blocksworld, where previous LLMs had proved almost entirely incapable (0.8% zero-shot, 4.3% one-shot), o1-preview jumps all the way to a 52.8% success rate. In my view, this jump in particular should update us significantly toward the LLM architecture being capable of general reasoning^[2].

o1-preview also does much better on ARC-AGI than gpt-4o, jumping from 9% to 21.2% on the public eval ('OpenAI o1 Results on ARC-AGI-Pub'). Note that since my original post, Claude-3.5-Sonnet also reached 21% on the public eval.

The planning/scheduling evidence, on the other hand, seemed weaker almost immediately after the post; a commenter quickly pointed out that the paper was full of errors. Nonetheless, note that another recent paper looks at a broader range of planning problems and also finds substantial improvements from o1-preview, although arguably not the same level of 0-to-1 improvement that Valmeekam et al find with obfuscated blocksworld ('On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability').

I would be grateful to hear about other recent research that helps answer these questions (and thanks to @Archimedes for calling my attention to these papers).

Discussion

My overall conclusion, and the reason I think it's worth posting this follow-up, is that I believe the new evidence should update all of us toward LLMs scaling straight to AGI, and therefore toward timelines being relatively short. Time will continue to tell, of course, and I have a research project planned for early spring that attempts to more rigorously investigate whether LLMs are capable of the particular sorts of general reasoning that will allow them to perform novel scientific research end-to-end.

My own numeric updates follow.

Updated probability estimates

(text copied from my previous post is italicized for clarity on what changed)

LLMs continue to do better at block world and ARC as they scale: 75%

-> 100%

LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10%

-> 20%

Scaffolding & tools help a lot, so that the next gen (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark: 60%

-> 75%

that

Same but for the gen after that (GPT-6, Claude 5): 75%

-> 90%

The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%

-> 80%

Sakana

Citations

'LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of Openai’S O1 on Planbench', Valmeekam et al (includes Kambhampati) 09/24

'OpenAI o1 Results on ARC-AGI-Pub', Mike Knoop 09/24

'On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability', Wang et al 09/24.

^{^}
I am restraining myself with some difficulty from jumping up and down and yelling about the level of goalpost-moving in this new paper.
^{^}
There's a sense in which comparing results from previous LLMs with o1-preview isn't entirely an apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem. In that way it's similar to Ryan's hybrid approach to ARC-AGI, as discussed in the original post. But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my thinking here; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.

Discuss

Summary

Summary of previous post

Reasons to update

Discussion

Updated probability estimates

Citations

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签