少点错误 05月01日 04:02
State of play of AI progress (and related brakes on an intelligence explosion) [Linkpost]
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了关于人工智能(AI)在2027年可能取得突破性进展的观点,并提出了质疑。作者认为,当前对AI进步的评估方式存在误解,实验室的进展往往是针对特定目标的优化,而非通用智能的快速提升。文章深入分析了数据研究、RL训练以及计算资源在AI发展中的作用,并强调了AI在实际应用中面临的挑战,例如多领域控制和环境适应性。作者认为,AI的发展更可能是一个渐进的过程,而非突飞猛进的变革。

🤔 评估标准与实际进展: 作者指出,AI评估的进步速度容易被高估,因为当前的评估标准往往是针对特定任务的优化,而非通用智能的快速提升。实验室在这些评估上的成功,更多是训练目标的结果,而非AI能力的真正飞跃。

💡 现有AI的局限性: 文章认为,尽管AI在某些特定任务上会取得显著进步,但短期内不会出现颠覆性的变革。AI更可能在广泛的自动化领域带来价值,而非在根本上改变现有领域。AI的进步将是渐进的,而非革命性的。

💻 数据研究与算法进步: 作者强调,数据研究是AI算法进步的基础。AI研究的加速需要依赖于计算效率的提升和实现难度的降低。然而,当前的ML研究仍高度依赖于数据探索、直觉构建和试错,这使得AI的进步更像是一个缓慢的积累过程。

🚀 RL训练的挑战: 虽然强化学习(RL)在AI发展中扮演着重要角色,但将RL应用于多领域控制和实际应用中仍然面临诸多挑战。例如,在生产环境中持续训练语言模型以及控制多个真实世界系统都存在技术难题。现实世界的复杂性,如环境的持续变化,也增加了AI模型适应的难度。

💰 计算资源与未来趋势: 作者指出,随着AI的发展,计算资源在研究和推理中的分配比例将成为一个关键指标。如果AI进步的速度足够快,那么用于推理的计算资源的需求将会降低。然而,由于计算资源的稀缺性,研究领域的投入可能并不会像人们想象的那么大,AI的发展更可能是一个手把手的过程。

Published on April 30, 2025 7:58 PM GMT

This time around, I'm sharing a post on Interconnects on why he doesn't believe that the AI 2027 scenario by @Daniel Kokotajlo and many others will come true, and he has 4 sections on this plus a bonus section below:

1. How labs make progress on evaluations

2. Current AI is broad, not narrow intelligence

3. Data research is the foundation of algorithmic AI progress

4. Over-optimism of RL training

Bonus: A world with compute shifting to inference

I'll quote a couple paragraphs of each section below, but you will want to go to the article to get the full experience:

(1st section) One of the hardest things to communicate in AI is talking down the various interpretations of evaluation progress looking vertical over time. If the evals are going from 0 to 1 in one year, doesn’t that indicate the AI models are getting better at everything super fast? No, this is all about how evaluations are scoped as “reasonable” in AI development over time.

None of the popular evaluations, such as MMLU, GPQA, MATH, SWE-Bench, etc., that are getting released in a paper and then solved 18 months later are truly held out by the laboratories. They’re training goals. If these evaluations were unseen tests and going vertical, you should be much more optimistic about AI progress, but they aren’t.

Instead of thinking of stacking rapid evaluation progress on one line in a cumulative, rapid improvement in intelligence, the above plots should make one think that AI is getting better at many tasks, rather than being superhuman in narrow tasks.

(2nd section) In a few years, we’ll look back and see that AI is now 95% robust on a lot of things that only worked 1-5% of the time today. A bunch of new use cases will surprise us as well. We won’t see AI systems that are so intelligent that they cause seismic shifts in the nature of certain domains. Software will still be software. AI will be way better than us at completing a code task and finding a bug, but the stacks we are working on will be largely subject to the same constraints.

Epoch AI had a very complementary post to this view.

(Link below is to a Gradient Updates post titled Most AI value will come from broad automation, not from R&D)

https://epoch.ai/gradient-updates/most-ai-value-will-come-from-broad-automation-not-from-r-d

(Third section) One of the main points of the AI 2027 forecast is that AI research is going to get 2X, then 4X, then 100X, and finally 1,000X as productive as it is today. This is based on end-to-end time for integrating new ideas into models and misinterprets the reality of what machine learning research is bottlenecked on. Scaling is getting more expensive. We don’t know what paradigm will come after reasoning for inference-time compute.

For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty. Problems like getting the maximum theoretical FLOPs out of Nvidia GPUs and making the loss go as low as possible. These are things that people are currently doing and represent an important area of marginal gains in AI progress in recent years.

ML research is far messier. It is far more reliant on poking around the data, building intuitions, and launching yolo runs based on lingering feelings. AI models in the near future could easily launch yolo runs if we give them the compute, but they’re not using the same motivation for them. AI systems are going towards rapid cycles of trial and error to optimize very narrow signals. These narrow signals, like loss or evaluation scores, mirror very closely to the RL scores that current models are trained on.

These types of improvements are crucial for making the model a bit better, but they are not the type of idea that gets someone to try to train GPT-3 in the first place or scale up RL to get something like o1.

A very popular question in the AI discourse today is “Why doesn’t AI make any discoveries despite having all of human knowledge?” (more here). Quoting Dwarkesh Patel’s interview with Dario Amodei:

One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?

(Fourth section) A lot of people are really excited for RL training right now scaling up further, which will inevitably involve extending to more domains. Some of the most repeated ideas are adding RL training to continually fine-tune the model in real-world scenarios, including everything from web tasks to robotics and scientific experiments. There are two separate problems here:

    Continually training language models to add new capabilities to models “in flight” in production is not a solved problem,Training models to take actions in many domains.

The first problem is something that I’m confident we’ll solve. It’s likely technically feasible now that RL is the final stage of post-training and is becoming far more stable. The challenge with it is more of a release and control problem, where a model being trained in-flight doesn’t have time for the usual safety training. This is something the industry can easily adapt to, and we will as traditional pretraining scaling saturates completely.

The second issue is putting us right back into the territory of why projects on scaling robotics or RL agents to multiple domains are hard. Even the most breakthrough works like GATO, multi-domain RL control, or RT-X, multi-robot control policies, from DeepMind have major caveats with their obvious successes.

Building AI models that control multiple real-world systems is incredibly hard for many reasons, some of which involve:

    Different action spaces across domains mandate either modifying the domain to suit the underlying policy, which in this case is converting all control tasks to language, or modifying the model to be able to output more types of tokens.The real-world is subject to constant drift, so the constant fine-tuning of the model will need to do as much to just maintain performance on systems with real degradation as it will need to learn to use them in the first place.

This sort of scaling RL to new types of domains is going to look much more like recent progress in robotics research rather than the takeoff pace of reasoning language models. Robotics progress is a slow grind and feels so different that it is hard to describe concisely. Robotics faces far more problems due to the nature of the environment rather than just the learning.1

(Bonus section) While the AI research world is dynamic, engaging, and rapidly moving forward, some signs of the above being correct could already be emerging. A basic sign for this future coming true will be the share of compute spent on research decreasing relative to inference amid the rapid buildout. If extremely rapid AI progress were available for organizations that put in marginally more compute, serving inference would be a far lower priority. If investing in research has a positive feedback loop on your potential business revenue, they’d all need to do it.

For example, consider our discussion of Meta’s compute allocation on Dylan and I’s appearance on the Lex Podcast:

(01:03:56) And forever, training will always be a portion of the total compute. We mentioned Meta’s 400,000 GPUs. Only 16,000 made Llama 3.

OpenAI is already making allocation trade-offs on their products, regularly complaining about GPUs melting. Part of the reason they, or anyone, could release an open-weights model is to reduce their inference demand. Make the user(s) pay for the compute.

Part of the U.S.’s economic strength is a strong services sector. AI is enabling that, and the more it succeeds there, the more companies will need to continue to enable it with compute.

With the changing world economic order, cases like Microsoft freezing datacenter buildouts are correlated indicators. Microsoft’s buildout is correlated with many factors, only one of which is potential training progress, so it’s far from a sure thing.

In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs. Most of that compute goes to hillclimbing on fairly known gains for the next model! AI research with AI aid will be a hand-in-hand process and not an autonomous take-off, at least on a timeline for a few years in the future.

AI will make a ton of press, but it will not be an obvious acceleration. With traditional pretraining saturating, it could even be argued that after the initial gains of inference time compute, research is actually decelerating, but it will take years to know for sure.

Comments on this post are open to everyone, please discuss!

@ryan_greenblatt has responded on Substack and LessWrong, so Ryan Greenblatt is aware of the document and has responded to the post already, so this post is there so that other LWers can discuss, and @ryan_greenblatt's comment on LessWrong is below:

https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=gdtiyybyPyJts47fh

 


 


 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI 2027 评估标准 强化学习 计算资源
相关文章