(Linkpost) METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

少点错误 07月11日 17:02

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了一项关于AI辅助开发人员的研究。研究招募了16位来自主要开源项目的开发人员，让他们完成一系列编码任务，并对比在允许和禁止使用AI工具的情况下，他们的工作效率。研究结果显示，AI工具对开发效率的提升有限，这可能与现有AI工具在处理大型、复杂的代码库时的局限性有关。此外，研究还考虑了开源项目的代码风格规范以及开发人员的经验水平对AI工具使用的影响。文章还引用了相关推文，进一步探讨了AI辅助开发的现状和未来发展。

🧐 研究方法：研究招募了16位经验丰富的开源项目开发者，让他们从待办事项列表中选择编码任务，并将大项目分解成可在1-2小时内完成的小任务，总共包括246个任务。开发者首先预估在正常和不使用AI工具情况下的完成时间，然后将任务随机分配到“允许使用AI”和“禁止使用AI”两组。开发者在完成任务时记录屏幕，并在完成后报告耗时，以及AI工具节省的时间。

🤔 研究背景：这项研究是在成熟的、拥有大量代码库的项目上进行的。研究中的项目平均有10年以上历史，包含超过100万行代码，这与“绿地”项目形成鲜明对比。由于AI工具在理解大型代码库方面存在局限性，这限制了其在实际开发中的应用效果。此外，使用大型文件进行编辑也可能超出大多数AI模型的训练范围。

💡 研究发现：研究结果表明，AI工具在提高开发效率方面的效果有限。这可能是由于现有AI工具在处理大型、复杂的代码库时存在局限性，例如难以理解项目的整体架构和代码风格。此外，研究还指出，开源项目通常有严格的代码风格规范，而AI工具生成的代码可能不符合这些规范，需要开发者进行额外的审查和修改。

Published on July 11, 2025 8:57 AM GMT

This seems like an important piece of work - an RCT on the use of AI capabilities for developers. The TL;DR of the paper is

The devil is in the details of course. This post by Steve Newman does a good job of working through some of them.

I have highlighted some considerations from it:

The methodology was as follows:
METR recruited 16 developers from major open-source projects.Each developer selected a list of coding tasks from their todo list, breaking up large projects into tasks that they could complete in an hour or two. In all, 246 tasks were included in the study.The developers estimated how long it would take them to complete each task (a) under normal conditions, and (b) without using any AI tools. The percentage difference between these figures yields the predicted speedup – the degree to which the developer expected that AI tools would boost their productivity.Each task was randomly assigned to one of two categories: “AI Allowed” (the developer can use any tools they like) or “AI Disallowed” (the developer cannot use AI coding tools or features).The developers went about their work, while recording their screens for later analysis. After each task, they reported the time spent¹. For AI Allowed tasks, they also estimated how much time AI tools had saved them – the retrodicted speedup

Several aspects of the study play to the weaknesses of current tools. First, it was conducted on mature projects with extensive codebases. The average project in the study is over 10 years old and contains over 1 million lines of code – the opposite of “greenfield”. Carrying out a task may require understanding large portions of the codebase, something that current AI tools struggle with. (This may be less a fundamental weakness of AI models, and more a design choice in some coding tools to limit the amount of “context” sent to the model, in order to control costs and get quicker responses.) It also involved editing large files, which may be “out of distribution” for most AI models (i.e. they may not get much training on large files).
Second, most of these open-source projects have strict style guidelines. The experienced developers in the study were accustomed to coding according to their project’s guidelines, but the AI tools are not – thus requiring developers to review and fix the AI’s output.
Third, the developers in the study had years of experience working on their projects, meaning that they were able to work very efficiently – posing a high standard for AI to compete with.

Note that the model used here was Claude Sonnet 3.5/3.7, which was SOTA at the time this study was being conducted. I expect things to be slightly different now; my intuition is that Opus 4 seems to have been a big jump among my SWE friends.

Tweets I found interesting:

https://x.com/ruben_bloom/status/1943553501520646214

https://x.com/joel_bkr/status/1943363822741524530

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签