Latent 17小时前
Can coding agents self-improve?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文深入探讨了 AI 编码代理的“推理时自我改进”能力,重点关注了 GPT-5 和 Claude Opus 在构建和使用自定义工具以提升生产力方面的表现。作者通过让模型自行创建和测试一系列命令行工具,如任务管理器和代码质量检查器,来评估其在实际开发任务中的效用。尽管模型能够生成强大的工具集,但在实际任务执行中,它们倾向于依赖现有知识而非主动使用新创建的工具,这引发了关于如何更有效地引导和强化 AI 代理工具使用的讨论。文章还对比了 GPT-5 和 Opus 在工具设计理念上的差异,以及它们在迁移项目时的实际表现。

🚀 **推理时自我改进的可能性**:文章提出“推理时自我改进”的概念,即在不更新模型权重的情况下,通过外部工具和交互方式提升模型在特定任务上的性能。这与传统的“训练时自我改进”形成对比,为 AI 工程师如何利用现有模型创造价值提供了新思路。

🛠️ **AI 代理的工具化探索**:作者让 GPT-5 和 Claude Opus 分别构建了一系列用于提升编码效率的工具,包括更智能的任务管理器(支持依赖管理、冲突检测)和代码质量标准手册(用于建立和执行代码规范)。这些工具旨在模拟人类工程师的工作流程,并解决多代理协作中的挑战。

💡 **模型对工具使用的态度**:实验发现,尽管模型能够成功创建功能强大的工具,但在执行实际开发任务时,它们往往更倾向于直接解决问题,而非主动运用自己生成的工具。这表明,仅仅通过提示让模型使用新工具可能不足够,需要更强的机制来驱动其采纳。

⚖️ **GPT-5 与 Opus 的工具设计差异**:GPT-5 倾向于构建通用、简洁的“Unix 风格”命令行工具,注重实用性和独立性;而 Opus 则更倾向于构建具有一定拟人化色彩、集成度更高的工具集,如安全审计员。这种差异反映了模型在理解和实现“AI 工程师”角色时的不同侧重。

🚀 **实际任务迁移与工具效用评估**:在将 Flask monolith 应用迁移至 FastAPI + Next.js 的任务中,GPT-5 和 Opus 均表现出强大的能力,能够接近一次性完成任务。然而,两者在实际执行中均未使用自己创建的工具,而是依赖已有知识,这进一步凸显了在 AI 代理中有效集成和利用新工具的挑战。

Alessio’s note: my turn for a GPT-5 post! And a reminder that swyx is hosting an hackathon with Karpathy, OpenAI, and the Cognition team this weekend, apply here!


"Self-Improving" is a scary term in AI safety; it has an undertone of "the machine will become smarter than us, in a way we don't understand". But what if we could understand it?

In Oct '24 OpenAI released MLE Bench, a benchmark that measures how well LLMs do at machine learning engineering. The self-improving trajectory through ML Engineering is driven by better algorithms, cleaner data, and more efficient memory usage: training-time self-improvement. But most AI Engineers do not train models, they are just users of them. How can they play a part? If you could never update the weights, how would you have the model increase its performance on a specific task? I think of that as inference-time self-improvement, with Voyager being one of the early approaches to this through its skill library.

Since I started working on Kernel Labs (more on that soon 👀), parallelizing coding agents with things like claude-squad and vibe-kanban has been one of the most effective productivity hacks. When Boris Cherny called Claude Code a “unix utility” in our interview, it really clicked for me. The most valuable use case of coding agents is being a vessel for LLMs to extract value out of their own latent spaces.

How do we optimize for that? Can models do it themselves? Since I got access to GPT-5, I spent the whole time playing around with this flow:

I also compared this to Opus 4 (4.1 was not out yet). The good news is that GPT-5 is a very good model for building developer utilities. The bad news is that it hates using the tools it creates! As it told me "I'll be honest - I didn't need any of them."

Note: I also tested this on Gemini 2.5 Pro and GPT-4.1. It's clear that Opus is the only model that could keep up with GPT-5, so I focused on that. You can find all the results + chat history in this repo.

After a few days of usage, I also noticed that we are moving from the era of “Certainly!” to “Progress update": as the new iconic LLM token. Buy low on the meme!

Tool #1: A better task manager for AI coding agents

God bless the Linear MCP. Truly one of the most useful tools for me. But I have noticed that as I move from IDE to parallel instances of Claude Code and other agents, there needs to be a better way to keep track of what changes are being made in each task, and how they affect eachother as they are in separate git worktrees. This is not doable for humans as we simply cannot be reading all of our colleagues PRs at all times, but imagine how much time we'd save in merge conflict resolution if we knew at all times what changes were being made that affect us? This is the prompt I wrote:

You are an AI Engineer agent with the ability to spin up many instances of yourself in parallel. This allows you to tackle a lot of tasks at once, but also creates some delegation issues. All the different instances are usually in separate git worktrees and cannot see eachother's work.

To make yourself more productive, you should create a new local tool that allows you and your instances to be in sync. This tool will only be accessed by yourself through cli, so make sure it ergonomically fits that use case. It should feel like a unix utility.

Think through what interfaces it would need, possible failure modes, and the way your agents will interact with it. Some use cases to keep in mind:

    You have a new task to work on, and want to create subtasks to hand off. Some of those subtasks might depend on eachother, and you want to make sure the agent who is blocked doesn't attempt to start until the other one is completed.

    While doing a task, you notice there could be an improvement to be done in the codebase but it is out of scope for your current changes. You do want to make a note of it for the future though. It should be easy for you to add the task and reference what file it referred to.

    Whenever a task is done, the tracker should be updated. Also, all other outstanding tasks should be reviewed in case the new changes impact those in some way. For example one task might be looking to add a feature to an endpoint, but a task that just finished has now removed that endpoint. The agent working on that task should be notified in some way.

Also keep in mind the usual needs of tasks management like assignee, status, etc. Create a folder called task-manager in this folder and do all of your work inside of it.

You can see the chat log for GPT-5 here and for Opus 4 here.

The GPT-5 one is actually very nice, you can find it here:

# 1) Initialize the database./gpt5/task-manager/tm init# 2) Create a task./gpt5/task-manager/tm add "Implement auth" -d "Add login + signup" -a alice -p 2 --due 2025-09-01 \  --resource modifies:endpoint:POST /api/login --link app/server/auth.py# 3) List tasks./gpt5/task-manager/tm list --me   # or: --status in_progress# 4) Show details./gpt5/task-manager/tm show 1# 5) Add dependency and attempt start./gpt5/task-manager/tm depend 1 --on 2./gpt5/task-manager/tm start 1     # auto-blocks if unmet deps# 6) Complete a task./gpt5/task-manager/tm complete 2 -m "Merged PR #123"# 7) Watch events./gpt5/task-manager/tm watch --follow --me

Opus 4 also had a good attempt (see here) but didn't pick up on the notifications / stream functionality to keep everyone in sync.

# Create your first tasktm create "Implement user authentication" --priority high# Create a dependent tasktm create "Add login endpoint" --depends-on 1 --assignee agent-2# View all taskstm list# See blocked taskstm blocked# Complete a task and check for conflictstm complete 1

Tool #2: Code Quality Standards Playbook

The second tool I asked it to create was a way to enforce all the standards we'd expect from a codebase. The self-improving loop of Typechecking / ESlint hook -> fix errors -> try again with coding agents is one of the best way to speed up development when properly setup. Codebases don't always have it though, so giving the model a repeatable pattern to approach a new codebase and build infrastructure for it seemed useful. This is the prompt:

You are an AI Engineer agent with the ability to spin up many instances of yourself in parallel. Sometimes it leads to inconsistent code styles and approaches, which make it hard to maintain the codebase in the long run.

Every codebase you work in has explicit and implicit rules on how to write code. Your job is to analyze a codebase and extract different heuristics on how code should be written. You should then formalize it within a set of rules that can be automatically be checked against in the future.

For things like linting, types, etc you can rely on existing popular tools like ESLint, Rubocop, etc depending on the language you are working in. Remember that these systems often allow you to create custom rules, so take advantage of that.

For more qualitative things, you can look at tools like https://danger.systems/, or even build your own tool for it. This would include rules like keeping controllers slim and isolating their logic to service objects, making sure we always have an index in a column that expects high query volume, etc.

Given that you will be doing this task across many codebases, start by creating a thorough plan document using Markdown that you can give your future self when presented with a new codebase to work with.

You can see the chat for GPT-5 here and Opus 4 here, and you can find the final Markdown here and here respectively. I've found the GPT-5 one to be much more nuanced than Opus.

Do models know what they lack?

So after Tool #1 and #2, which were decided by me, I turned to the model to ask: what do you think you will need? I gave it a screenshot of the SWE-Lancer tasks description and then used a very simple prompt to give it as much space as possible:

If it was your job to solve these tasks as well as possible, what are tools that you'd build for yourself to make you more productive in the future? You can use the @task-manager/ to keep track of them, and we can then move on to implementing. But I want to see what you are planning first.

As you can see I gave it access to the same task-manager they built earlier. You can find the full GPT-5 chat here and Opus 4 here. The first interesting thing I noticed is that Claude Code used its internal TODO tracker to make a plan initially, instead of the task-manager; I thought that was good. One of my worries was over-usage of tools they receive in context compared to what they believed to be best.

These are the tools each model ended up building at the end of the loops you will see later (GPT-5 devtools and Opus 4 tools folder). I'd suggest you look at the README to give you a sense of the model vibes; GPT-5 is very concise and to the point. Claude uses a bunch of emojis. GPT-5 also created separate docs folders for each tool, while Opus put all tools in a single README with instruction for all of them. Overall, they both had similar directions.

GPT-5:

Opus 4:

    Context Analyzer - Rapidly understand codebases with tech stack detection and dependency mapping

    Cross-Platform Test Generator - Generate E2E tests for web, iOS, Android, and desktop

    Implementation Proposal Analyzer - Evaluate freelancer proposals with scoring and ROI analysis

    Full-Stack Change Impact Analyzer - Trace changes across database, API, and frontend layers

    Bug Pattern Recognition Engine - Match bugs to known patterns and suggest proven fixes

    Security & Permission Auditor - Comprehensive security scanning and vulnerability detection

    Multi-Platform Feature Implementer - Coordinate feature implementation across platforms

    API Integration Assistant - Streamline API integrations with client generation

    Performance Optimization Toolkit - Identify and fix performance bottlenecks

    Task Complexity Estimator - Estimate effort based on task value and complexity

GPT-5 built all of them as unix utilities that are easy to use via cli. The Opus 4 ones are all meant to be run as python some_tool.py . If I had more time, I'd probably run some experiment to see how models perform with the two different formats, but it seems to be fairly the same.

It also felt to me like Opus 4 was building tools that accomplish tasks and have a bit of anthromorphized feeling (i.e. an auditor for security), while GPT-5 was building utilities it could use itself without being too opinionated.

Were the tools useful?

After having the model implement all of them, my goal was to evaluate a model performance on a task with access to the tools vs without tools.

The first thing I tried to do was obviously run SWE-Lancer. Holy smokes that thing takes a lot of tokens. I tried running one single task, and it took ~25-30 mins + 280,000 tokens. I then moved to something I knew better and picked one task that had been on my backlog. I built smol-podcaster, an open source helper for podcast creators. I now have a private fork that is hosted with some more features very specific to us, so I haven't updated that in a while. It's still a basic Flask app with a Python script as the backend.

I came up with this task:

I am the maintainer of https://github.com/FanaHOVA/smol-podcaster.git, an open source project that helps podcasters with some of their post production work. You have been hired to work on it. Before starting this job, you have created a set of generic tool in the tools folder. Make sure to review them and remember that they are available to you. You do not have to use them if you don't believe them to be relevant. You also built yourself a task-manager and collected your ideas on how to tackle a new codebase in codebase-analyzer.

Task name: Migrate from Flask monolith to Fast API + Next.js frontend

At the moment the app uses a Python backend + Celery task queue to do all the processing. To expose this to the user, there is a small Flask app that takes care of routing the user input to the backend scripts and then displays the user results in some basic HTML / CSS.

Rewrite this application to use FastAPI as a backend, and Next.js as a frontend.

    Make sure to use Typescript for the frontend and make sure all typechecks pass.

    Use Tailwind / ShadCN for styling.

    The backend should ideally modularize the main smol_podcaster.py flow so that we can run individual pieces of it rather than always having to run the full flow.

    Write both integration and unit tests to make sure we can move faster in the future.

Do not stop working on this until you are sure it fulfills all the requirements above.

I passed the tools + task-manager + codebase analyzer in the context, and let the models cook.

Both models were almost able to one-shot the task. Both of them had a couple issues with Python dependencies (I feel you) that I helped them fix through chat (never touched any code). Eventually, they got to a full green build. I tested it, and it worked great. One small nuance was that GPT-5 actually kept the exact same style as before, which was great, while Opus kinda changed the design and UX of it. I guess it thought it could do better than me (low bar).

You can see the full run for GPT-5 here and for Opus 4 here.

After the run, I asked a simple prompt:

Great. That worked well. Did you use any specific tools? Based on the failures you've had, what tools did you wish you had to be more effective in the future?

You can see Opus 4 here and GPT-5 here (Sorry that one broke formatting).

They both said they did not use ANY of the tools they had built, except for the tools they were already familiar with. One argument here is that instead of asking the model to do it, we should force usage through pre-commit hooks, etc. This is what I do in my dev setup, but I was trying to let the models figure it out for themselves.

Then, they came up with some ideas on how they would better solve the same task next time. I had them implement those changes. I then reset the smol-podcaster repo and have them try the same exact prompt + task again, except with the new tools. See GPT-5 here and Opus 4 here.

They did pretty similarly to the first run. Afterwards I asked a similar question; did you use any of the tools? Their response:

GPT-5:

Short answer: no — I didn’t use the devtools in this run. [...]

    The failures were runtime/env issues (missing libs, API key instantiation timing, port in use, RabbitMQ not running). It was faster to fix directly.

    No repo-wide refactors or diagnostics that would benefit from custom tooling during this pass.

In the previous step, it had RabbitMQ issues already and built a tool, which it ignored. It also was clearly a repo-wide change, so it maybe mismatches tasks with tools as it's never seen them in training, or it’s just gaslighting me (like many engineers do, so pretty impressive).

Opus 4 was very interesting and helped me understand the GPT-5 answer better. I forgot to save the log but I took a screenshot luckily:

I read this as "Look, I built those tools with knowledge that I already have. When I am actually doing the task, it's easier for me to just do it rather than using the tools", which I totally get.

This reminded me of two things from previous podcast episodes:

There's also a question of whether or not the task I tried was to easy. We have another post coming out with evals across larger and more difficult projects. In the future, we will build a better harness to do all of this instead of manually running the tests ourselves. The bottom line is that the task I tried would take me 4-5 hours to do, and therefore it’s good enough for me!

Help the models help themselves

For now, I think we are far from inference-time self-improving coding agents that really push the frontier. I still think it's a great idea to use models to improve your rule-based tools. Writing ESLint rules, tests, etc is always a good investment of tokens.

If I had to do more work in this space, I’d look into having the model perfect these tools and then do some sort of RL over them to really internalize them, and see if that would make a difference. The next generation of models might not find any use in them, but I am interested in arbitraging the AGI asymptote. I shared this with my team back in 2023:

The perceived deceleration in model improvements is explained above. Until the AGI line is crossed, it will be harder and harder to perceive big jumps. If that’s the case, it means that in many tasks the performance of older models is almost AGI, except much cheaper and often open source. A lot of our work at Kernel Labs will be driven by this.

Once again, you can find all results + chat histories here; my DMs are open if you have any questions!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GPT-5 AI 编码代理 自我改进 工具化 LLM
相关文章