Communications of the ACM - Artificial Intelligence 07月25日 00:13
Benchmarks for AI in Software Engineering
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了当前人工智能应用于软件工程领域基准测试(benchmarks)的现状和不足。作者指出,虽然HumanEval和SWE-bench等基准测试在推动AI在代码生成和问题解决方面取得了进展,但它们在代表性、多样性、复杂性以及是否被模型训练数据污染等方面存在诸多问题。ML社区偏好大规模、易于自动评分的基准,而软件工程社区则需要更能反映实际工作流程的测试。文章呼吁跨社区合作,开发更具现实代表性、多样性且能有效评估AI在软件开发生命周期各环节能力的基准测试,并提出了一些未来的发展方向,包括与行业合作、持续数据curation、改进自动化评分以及填补现有基准测试的空白。

📏 **基准测试的重要性与当前挑战**:基准测试是衡量AI在软件工程领域进展的关键工具,尤其是在AI驱动代码生成和问题解决方面。然而,当前主流的基准测试,如HumanEval和SWE-bench,在代表软件工程的真实复杂性和多样性、避免训练数据污染以及评估AI在实际工作场景中的表现方面存在明显不足,导致评估结果可能无法准确反映AI的真实能力。

📊 **HumanEval与SWE-bench的局限性**:HumanEval主要侧重于“编程谜题”式的Python问题,与软件工程师的日常工作关联不大,且已接近饱和,可能存在数据污染问题。SWE-bench虽然更贴近实际的GitHub问题解决,但其数据来源狭窄(仅限12个Python项目),且存在问题描述即包含解决方案、测试用例未能充分验证问题等缺陷,SWE-bench-Live的数据显示LLM在该新数据集上表现远不如旧数据,暗示了潜在的数据污染。

⚖️ **ML与软件工程社区的视角差异**:机器学习社区倾向于大规模、易于自动评分的基准,以驱动模型性能的快速提升,有时会将用户体验置于次要地位。而软件工程社区则更看重基准测试能否紧密反映具体的产品体验,即使测试规模较小、运行复杂或评分不易自动化,只要能提供合理的离线评估信号即可。这种视角差异导致双方的基准测试需求和标准难以统一。

🚀 **未来基准测试的发展方向**:为了弥合差距,需要多方协作,开发更具代表性、多样性和复杂性的基准测试,并建立持续的数据curation和更新机制。同时,应探索更接近人类判断的自动化评分方法,并致力于填补代码转换、代码审查、调试和代码推理等重要但尚未被充分覆盖的软件开发生命周期环节的基准测试空白。

Benchmarks drive many areas of research forward, and this is indeed the case for two areas of research that I engage with: software engineering and machine learning. With increasing emphasis on AI (especially LLMs) for coding, it is no surprise that benchmarks have played an important role at the intersection of the two areas. Researchers working on AI applied to code have recently been trying to improve the performance of large language models on SWE-bench, a dataset of GitHub issues; and, just a couple of years ago, HumanEval, a dataset of programming problems meant to challenge the code generation capabilities of large language models.

Benchmarks are important for those of us who build software development products that incorporate AI. While the actual product experience in the hands of real users is the ultimate measure of success, such a signal comes with a time lag and with some inconvenience to some of the users. Therefore, a reasonable offline proxy evaluation of the product is important, and most companies invest in such evals, which we will treat as a synonym of “benchmarks.” The offline evaluation judges whether our products that incorporate AI are getting better. Sometimes there also is a choice of models or agent frameworks to be made, and evals are needed for these.

Despite this importance, something is not quite right with the present state of benchmarks for AI applied to software engineering, and indeed this is the opinion I wish to express in this note.

When assessing a benchmark suite, we should ask the following:

    Does the suite represent the software engineering work, including its complexity and diversity, that we want AI to help with in the first place?Is there headroom for improvement (else the current AI “aces it”)?Is the benchmark contaminated, in that the model training already has seen the answer and might have memorized it?Is it simple and inexpensive to run the benchmark suite repeatedly?Do we have robust scoring techniques to judge whether the answers produced are good?

Unfortunately, most eval sets fail one or more of these criteria.

HumanEval, which was popular until recently, was used for a variety of AI-for-code evaluations, but they are primarily “coding puzzle”-like Python questions, and do not represent the reality of any software engineer’s day-to-day work, except perhaps those preparing for coding interviews. The benchmark is now saturated, perhaps both due to contamination and because models have become a lot more competent in the past three years.

SWE Bench is a newer, exciting benchmark about solving issues on Github using AI. This benchmark has surged in popularity and has essentially spurred the move towards agentic techniques, where an LLM is not invoked just once, but repeatedly, incrementally gaining information in its context using external tools (such as Devin, OpenHands, or Jules). SWE Bench, however is also not perfect:

Based on the recently launched SWE-bench-Live, which is a more recent re-curation of the benchmark after training cutoff of the latest frontier LLMs, the performance of LLMs on the new dataset is far behind the original one, pointing to the possibility of contamination. Given all these observations, whether SWE-bench properly assesses coding ability of LLMs more broadly is not as clear.

Moreover, work by Rondon et al. showed that various complexity characteristics of bugs drawn from a bug database at Google were notably different from those of SWE-bench.

HumanEval and SWE-bench have taken hold in the ML community, and yet, as indicated above, neither is necessarily reflective of LLMs’ competence in everyday software engineering tasks. I conjecture one of the reasons is the differences in points of view of the two communities! The ML community prefers large-scale, automatically scored benchmarks, as long as there is a “hill climbing” signal to improve LLMs. The business imperative for LLM makers to compete on popular leaderboards can relegate the broader user experience to a secondary concern.

On the other hand, the software engineering community needs benchmarks that capture specific product experiences closely. Because curation is expensive, the scale of these benchmarks is sufficient only to get a reasonable offline signal for the decision at hand (A/B testing is always carried out before a launch). Such benchmarks may also require a complex setup to run, and sometimes are not automated in scoring; but these shortcomings can be acceptable considering a smaller scale. For exactly these reasons, these are not useful to the ML community.

Much is lost due to these different points of view. It is an interesting question as to how these communities could collaborate to bridge the gap between scale and meaningfulness and create evals that work well for both communities.

What’s the path forward? I offer the following challenges:

All of the above have costs associated with them, which is why a community effort might be more approachable than one driven by a single entity.

Satish Chandra is a software engineer at Google. Opinions expressed here are his own.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI软件工程 基准测试 HumanEval SWE-bench LLMs
相关文章