少点错误 2024年11月09日
LLMs Look Increasingly Like General Reasoners
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了大型语言模型(LLM)在通用推理能力方面取得的最新进展,特别是OpenAI的o1-preview模型在解决诸如积木世界、ARC-AGI等之前被认为是LLM弱点的任务上取得了显著的突破。这些进展削弱了此前对LLM通用推理能力的质疑,并暗示LLM可能直接扩展到通用人工智能(AGI),从而缩短AGI实现的时间线。文章还更新了作者之前关于LLM发展时间线的概率估计,认为LLM在未来几年内更有可能实现AGI,并借助工具和辅助手段进行原创AI研究。

🤔 **积木世界任务的显著进步:**o1-preview模型在积木世界任务中取得了显著的成功,尤其是在之前被认为难以解决的模糊积木世界任务中,成功率从之前的0.8%提升至52.8%,这表明LLM架构可能具备通用推理能力。

💡 **ARC-AGI任务表现提升:**o1-preview模型在ARC-AGI任务上的表现也优于GPT-4o,成功率从9%提升至21.2%,进一步支持了LLM通用推理能力的可能性。

🗓️ **AGI时间线预测更新:**基于o1-preview模型的出色表现,作者更新了之前对LLM发展时间线的概率估计,认为LLM直接扩展到AGI的可能性更高,并且AGI实现的时间线可能相对较短。

🛠️ **工具辅助下实现原创AI研究的可能性:**作者认为,LLM结合工具和辅助手段,未来有望实现原创AI研究,并且o1模型可能已经具备了进行部分原创AI研究的能力。

🤔 **LLM通用推理能力的进一步研究:**作者计划在未来开展研究项目,更严格地检验LLM是否具备进行新型科学研究所需的特定通用推理能力。

Published on November 8, 2024 11:47 PM GMT

Summary

Four months after my post 'LLM Generality is a Timeline Crux', new research on o1-preview should update us significantly toward LLMs being capable of general reasoning, and hence of scaling straight to AGI, and shorten our timeline estimates.

Summary of previous post

In June of 2024, I wrote a post, 'LLM Generality is a Timeline Crux', in which I argue that 

    LLMs seem on their face to be improving rapidly at reasoning.But there are some interesting exceptions where they still fail much more badly than one would expect given the rest of their capabilities, having to do with general reasoning. In particular, some argue that much of their apparent reasoning capability is much shallower than it appears, and that we're being fooled by having trouble internalizing just how vast their training data is.If in fact this is the case, we should be much more skeptical of the sort of scale-straight-to-AGI argument made by authors like Leopold Aschenbrenner and the short timeline that implies, because substantial additional breakthroughs will be needed first.

Reasons to update

In the original post, I gave the three main pieces of evidence against LLMs doing general reasoning that I found most compelling: blocksworld, planning/scheduling, and ARC-AGI (see original for details). All three of those seem importantly weakened in light of recent research.

Most dramatically, a new paper on blocksworld has recently been published by some of the same highly LLM-skeptical researchers (Valmeekam et al, led by Subbarao Kambhampati[1]: 'LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on Planbench'. Where the best previous success rate on non-obfuscated blocksworld was 57.6%, o1-preview essentially saturates the benchmark with 97.8%. On obfuscated blocksworld, where previous LLMs had proved almost entirely incapable (0.8% zero-shot, 4.3% one-shot), o1-preview jumps all the way to a 52.8% success rate. In my view, this jump in particular should update us significantly toward the LLM architecture being capable of general reasoning[2].

o1-preview also does much better on ARC-AGI than gpt-4o, jumping from 9% to 21.2% on the public eval ('OpenAI o1 Results on ARC-AGI-Pub'). Note that since my original post, Claude-3.5-Sonnet also reached 21% on the public eval.

The planning/scheduling evidence, on the other hand, seemed weaker almost immediately after the post; a commenter quickly pointed out that the paper was full of errors. Nonetheless, note that another recent paper looks at a broader range of planning problems and also finds substantial improvements from o1-preview, although arguably not the same level of 0-to-1 improvement that Valmeekam et al find with obfuscated blocksworld ('On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability').

I would be grateful to hear about other recent research that helps answer these questions (and thanks to @Archimedes for calling my attention to these papers).

Discussion

My overall conclusion, and the reason I think it's worth posting this follow-up, is that I believe the new evidence should update all of us toward LLMs scaling straight to AGI, and therefore toward timelines being relatively short. Time will continue to tell, of course, and I have a research project planned for early spring that attempts to more rigorously investigate whether LLMs are capable of the particular sorts of general reasoning that will allow them to perform novel scientific research end-to-end.

My own numeric updates follow.

Updated probability estimates

(text copied from my previous post is italicized for clarity on what changed)

 

Citations

'LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of Openai’S O1 on Planbench', Valmeekam et al (includes Kambhampati) 09/24 

'OpenAI o1 Results on ARC-AGI-Pub', Mike Knoop 09/24

'On The Planning Abilities of OpenAI’s o1 Models: Feasibility, Optimality, and Generalizability', Wang et al 09/24. 

 

  1. ^

    I am restraining myself with some difficulty from jumping up and down and yelling about the level of goalpost-moving in this new paper.

  2. ^

    There's a sense in which comparing results from previous LLMs with o1-preview isn't entirely an apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem. In that way it's similar to Ryan's hybrid approach to ARC-AGI, as discussed in the original post. But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my thinking here; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 通用人工智能 推理能力 AGI时间线 o1-preview
相关文章