Latent 06月18日 07:20
Andrej Karpathy on Software 3.0: Software in the Age of AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文总结了Andrej Karpathy在YC AI Startup School上关于人工智能的演讲要点。他深入探讨了软件3.0的概念,指出提示即程序,并讨论了LLMs(大型语言模型)的特性,如作为实用工具、工厂和操作系统的潜力,以及LLMs在心理学上的挑战,包括“锯齿状智能”和“顺行性遗忘”。演讲还涉及部分自主性、人机生成验证循环,以及构建面向AI代理的工具。Karpathy强调了部分自主性的重要性,以及在产品开发中弥合演示与实际应用差距的必要性,呼吁开发者关注AI代理这一新兴用户群体。

💡**软件3.0的兴起:** Andrej Karpathy认为提示正在成为新的程序,软件3.0正在吞噬软件1.0和2.0,预示着软件开发领域的重大变革。

⚙️**LLMs的特性与挑战:** LLMs类似于实用工具、工厂和操作系统,但它们也面临“锯齿状智能”和“顺行性遗忘”等心理学问题,这些问题影响了它们在实际应用中的表现。

🤖**部分自主性的重要性:** 作者强调了在AI产品中引入部分自主性的必要性,通过“自主性滑块”来控制AI的自主程度,以实现更流畅的人机交互。

🔄**生成与验证循环:** 文章提倡建立更快的生成与验证循环,以提高AI产品的可靠性,并强调在AI生成与人类验证之间建立有效的协作流程。

Lots of people were excited about Andrej’s talk at YC AI Startup School today. Sadly, I wasn’t invited. Talks will be published “over the next few weeks”, by which time the talk might be deprecated. Nobody seems to have recorded fancams.

But… it’s not over. You can just do things!

Using PeepResearch I collated all available tweets about the talk and ordered1 them using available hints from good notetakers (credited in last slide). I’ll go thru most of the impt takeaways here and subscribers can get the full slides at the bottom.

we’ll update this to annotate the full talk video when it’s up in a few weeks

Part 1a: Software 3.0 - Prompts are now Programs

We first discussed Software 3.0 in Rise of The AI Engineer, but it’s an obvious consequence of the Software 2.0 essay + “the hottest new programming language is English”.

He originally wrote the Software 2.0 essay while observing that it was eating Software 1.0 at Tesla. And he’s back now to update it for Software 3.0.

In place of modifying the Software 2.0 chart like I did, Andrej debuts a new diagram showing the patchwork/coexistence of Software 1.0/2.0/3.0, noting that “Software 3.0 is eating 1.0/2.0” and that “a huge amount of software will be rewritten”:

Andrej is still focused on prompts for programs, and we slightly disagreed back in 2023 and still do: the “1+2=3” variant of Software 3.0 is the entire reason why AI Engineers have far outperformed Prompt Engineers in the last few years and continue to do so.

Part 1b: LLMs are like….

LLMs are like Utilities

LLMs are like Fabs

LLMs are like OSes

LLMs are like Timeshare Mainframes…

although as he argues in Power to the People, LLMs also exhibit some unusual reversal of the normal flow of expensive frontier tech:

As we leave the cloud for Personal/Private AI, some signs of Personal Computing v2 are being born in Exolabs + Apple MLX work.

Part 1a + 1b summary:

Part 2: LLM Psychology

LLMs are “people spirits”: stochastic simulations of people, with a kind of emergent “psychology”

Andrej highlights two problems with how current LLMs simulate people:

Jagged Intelligence (https://x.com/karpathy/status/1816531576228053133):

The word I came up with to describe the (strange, unintuitive) fact that state of the art LLMs can both perform extremely impressive tasks (e.g. solve complex math problems) while simultaneously struggle with some very dumb problems. E.g. example from two days ago - which number is bigger, 9.11 or 9.9? Wrong.

Some things work extremely well (by human standards) while some things fail catastrophically (again by human standards), and it's not always obvious which is which, though you can develop a bit of intuition over time. Different from humans, where a lot of knowledge and problem solving capabilities are all highly correlated and improve linearly all together, from birth to adulthood.

Personally I think these are not fundamental issues. They demand more work across the stack, including not just scaling. The big one I think is the present lack of "cognitive self-knowledge", which requires more sophisticated approaches in model post-training instead of the naive "imitate human labelers and make it big" solutions that have mostly gotten us this far. For an example of what I'm talking about, see Llama 3.1 paper section on mitigating hallucinations: https://x.com/karpathy/status/1816171241809797335

For now, this is something to be aware of, especially in production settings. Use LLMs for the tasks they are good at but be on a lookout for jagged edges, and keep a human in the loop.

Anterograde Amnesia (https://x.com/karpathy/status/1930003172246073412):

I like to talk explain it as LLMs are a bit like a coworker with Anterograde amnesia - they don't consolidate or build long-running knowledge or expertise once training is over and all they have is short-term memory (context window). It's hard to build relationships (see: 50 First Dates) or do work (see: Memento) with this condition.

The first mitigation of this deficit that I saw is the Memory feature in ChatGPT, which feels like a primordial crappy implementation of what could be, and which led me to suggest this as a possible new paradigm of learning here: https://x.com/karpathy/status/1921368644069765486

We're missing (at least one) major paradigm for LLM learning. Not sure what to call it, possibly it has a name - system prompt learning?

Pretraining is for knowledge.

Finetuning (SL/RL) is for habitual behavior.

Both of these involve a change in parameters but a lot of human learning feels more like a change in system prompt. You encounter a problem, figure something out, then "remember" something in fairly explicit terms for the next time. E.g. "It seems when I encounter this and that kind of a problem, I should try this and that kind of an approach/solution". It feels more like taking notes for yourself, i.e. something like the "Memory" feature but not to store per-user random facts, but general/global problem solving knowledge and strategies. LLMs are quite literally like the guy in Memento, except we haven't given them their scratchpad yet. Note that this paradigm is also significantly more powerful and data efficient because a knowledge-guided "review" stage is a significantly higher dimensional feedback channel than a reward scaler.

Imo this is not the kind of problem solving knowledge that should be baked into weights via Reinforcement Learning, or least not immediately/exclusively. And it certainly shouldn't come from human engineers writing system prompts by hand. It should come from System Prompt learning, which resembles RL in the setup, with the exception of the learning algorithm (edits vs gradient descent). A large section of the LLM system prompt could be written via system prompt learning, it would look a bit like the LLM writing a book for itself on how to solve problems. If this works it would be a new/powerful learning paradigm. With a lot of details left to figure out (how do the edits work? can/should you learn the edit system? how do you gradually move knowledge from the explicit system text to habitual weights, as humans seem to do? etc.).

Part 2 Summary:

Part 3: Partial Autonomy

We like the Iron Man Suit analogy — the suit extends us in two useful ways:

How can we design AI products that follow these patterns?

Part 3a: Autonomy Sliders

The Autonomy Slider is an important concept that lets us choose the level of autonomy for the context, eg:

Part 3b: Human-AI Generation-Verification Loop

In the Generation <-> verification cycle, we a need full workflow of partial autonomy - the faster the loop the better:

Part 3c: The Demo-Product Gap

The reason we need PARTIAL autonomy is because of the significant gap still between a working demo and a reliable product.

He recounts riding a Waymo prototype with zero interventions in 2014 and thinking that self-driving is “here”… but there was still a lot to work out.

"Demo is works.any(), product is works.all()"

Part 4: Vibe Coding & Building for Agents

The tweet that launched a thousand startups:

now has its own Wikipedia page!

However, there are still a lot of remaining issues. While Vibe coding MenuGen, he found that the AI speedups vanished shortly after getting local code running:

The reality of building web apps in 2025 is a disjoint mess of services that are very much designed for webdev experts to keep their jobs, and not accessible to AI.

Poor old Clerk got a NEGATIVE mention, and Vercel’s @leerob a positive one, in how their docs approaches will respectively tuned for humans vs agents.

He also shouted out “Context builders” like Cognition’s DeepWiki, which we profiled for a lightning pod:

The bottom line is that toolmakers must realize that “there is new category of consumer/manipulator of digital information”:

1. Humans (GUls)

2. Computers (APls)

3. NEW: Agents <- computers... but human-like

Closing / Recap

Less AGI 2027 and flashy demos that don’t work.

More partial autonomy, custom GUIs and autonomy sliders.

Remember that Software 3.0 is eating Software 1/2, that their Utility/Fabs/OS characteristics will dictate their destiny, improve the generator-verifier loop, and BUILD FOR AGENTS 🤖.

Full Slides for LS Subscribers

here :)

Read more

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Andrej Karpathy 软件3.0 LLMs AI代理 自主性
相关文章