MarkTechPost@AI 04月24日 06:30
AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS AI实验室推出了SWE-PolyBench,这是一个多语言、基于存储库的基准测试,旨在评估AI编码代理。该基准测试涵盖Java、JavaScript、TypeScript和Python四种常用编程语言的21个GitHub存储库,包含2110个任务,涉及错误修复、功能实现和代码重构。与之前的基准测试不同,SWE-PolyBench结合了实际的pull request(PR),并包含了相关的测试用例,从而实现了可验证的评估。此外,还发布了一个较小的子集——SWE-PolyBench500,以支持更快的实验,同时保持任务和语言的多样性。

💡SWE-PolyBench是一个多语言、基于存储库的基准测试,用于评估AI编码代理的性能。它涵盖了Java、JavaScript、TypeScript和Python四种编程语言,并包含了2110个任务,如错误修复、功能实现和代码重构。

✅SWE-PolyBench采用了执行为基础的评估流程,每个任务都包含一个存储库快照和来自GitHub issue的问题描述。该基准测试使用fail-to-pass(F2P)和pass-to-pass(P2P)两种类型的单元测试来衡量结果。

📊SWE-PolyBench引入了基于具体语法树(CST)的指标,例如文件级别和节点级别的检索分数,以更细粒度地评估编码代理。实验表明,代理在Python任务上的表现最好,但在TypeScript上表现不佳;任务复杂性也会影响成功率。

Recent advancements in large language models (LLMs) have enabled the development of AI-based coding agents that can generate, modify, and understand software code. However, the evaluation of these systems remains limited, often constrained to synthetic or narrowly scoped benchmarks, primarily in Python. These benchmarks seldom reflect the structural and semantic diversity of real-world codebases, and as a result, many agents overfit to benchmark-specific patterns rather than demonstrating robust, transferable capabilities.

AWS Introduces SWE-PolyBench: A More Comprehensive Evaluation Framework

To address these challenges, AWS AI Labs has introduced SWE-PolyBench, a multilingual, repository-level benchmark designed for execution-based evaluation of AI coding agents. The benchmark spans 21 GitHub repositories across four widely-used programming languages—Java, JavaScript, TypeScript, and Python—comprising 2,110 tasks that include bug fixes, feature implementations, and code refactorings.

Unlike prior benchmarks, SWE-PolyBench incorporates real pull requests (PRs) that close actual issues and include associated test cases, allowing for verifiable evaluation. A smaller, stratified subset—SWE-PolyBench500—has also been released to support quicker experimentation while preserving task and language diversity.

Technical Structure and Evaluation Metrics

SWE-PolyBench adopts an execution-based evaluation pipeline. Each task includes a repository snapshot and a problem statement derived from a GitHub issue. The system applies the associated ground truth patch in a containerized test environment configured for the respective language ecosystem (e.g., Maven for Java, npm for JS/TS, etc.). The benchmark then measures outcomes using two types of unit tests: fail-to-pass (F2P) and pass-to-pass (P2P).

To provide a more granular assessment of coding agents, SWE-PolyBench introduces Concrete Syntax Tree (CST)-based metrics. These include both file-level and node-level retrieval scores, assessing the agent’s ability to locate and modify relevant sections of the codebase. These metrics offer insights beyond binary pass/fail outcomes, especially for complex, multi-file modifications.

Empirical Evaluation and Observations

Three open-source coding agents—Aider, SWE-Agent, and Agentless—were adapted for SWE-PolyBench. All used Anthropic’s Claude 3.5 as the underlying model and were modified to handle the multilingual, repository-level requirements of the benchmark.

The evaluation revealed notable differences in performance across languages and task types. For instance, agents performed best on Python tasks (up to 24.1% pass rate) but struggled with TypeScript (as low as 4.7%). Java, despite its higher complexity in terms of average node changes, achieved higher success rates than TypeScript, suggesting that pretraining exposure and syntax familiarity play a critical role in model performance.

Performance also varied with task complexity. Tasks limited to single-function or single-class changes yielded higher success rates (up to 40%), while those requiring mixed or multi-file changes saw a significant drop. Interestingly, high retrieval precision and recall—particularly for file and CST node identification—did not always translate to higher pass rates, indicating that code localization is necessary but insufficient for problem resolution.

Conclusion: Toward Robust Evaluation of AI Coding Agents

SWE-PolyBench presents a robust and nuanced evaluation framework for coding agents, addressing key limitations in existing benchmarks. By supporting multiple programming languages, covering a wider range of task types, and incorporating syntax-aware metrics, it offers a more representative assessment of an agent’s real-world applicability.

The benchmark reveals that while AI agents exhibit promising capabilities, their performance remains inconsistent across languages and tasks. SWE-PolyBench provides a foundation for future research aimed at improving the generalizability, robustness, and reasoning capabilities of AI coding assistants.


Check out the AWS DevOps Blog, Hugging Face – SWE-PolyBench and GitHub – SWE-PolyBench. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

The post AWS Introduces SWE-PolyBench: A New Open-Source Multilingual Benchmark for Evaluating AI Coding Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SWE-PolyBench AI编码代理 基准测试 多语言
相关文章