MarkTechPost@AI 11小时前
TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

TikTok及合作机构的研究者们发布了SWE-Perf,这是首个专门用于评估大型语言模型(LLMs)在真实代码库中优化代码性能能力的基准测试。该基准测试旨在解决现有模型在仓库级别性能优化方面的不足,通过包含140个经过精心挑选、能带来可衡量且稳定的性能提升的实例,并提供完整的代码库、单元测试和Docker环境,以确保结果的可复现性。SWE-Perf分为Oracle(仅目标函数)和Realistic(整个代码库)两种设置,评估模型在应用补丁、保持功能完整性以及提升运行性能方面的能力。实验结果显示,尽管先进的LLMs在某些场景下有所表现,但与人类专家的优化水平仍有显著差距,特别是Agent-based框架在处理复杂优化任务时更具优势。

⭐ SWE-Perf是首个专注于评估大型语言模型(LLMs)在真实代码仓库级别进行性能优化的基准测试,旨在弥合现有模型在这一领域的能力差距,为LLM在软件工程中的应用提供新的衡量标准。

📦 SWE-Perf数据集由超过10万个GitHub高影响力仓库的Pull Request构建而成,包含140个经过验证的、能带来稳定性能提升的实例,每个实例都提供了优化前后的完整代码库、单元测试及Docker环境,确保了评估的严谨性和可复现性。

⚙️ 该基准测试在“Oracle”和“Realistic”两种模式下进行评估。“Oracle”模式测试模型在局部代码优化能力,而“Realistic”模式则模拟真实工程师的工作流程,要求模型自主识别并优化整个代码库中的性能瓶颈,更贴近实际应用场景。

📊 实验结果表明,当前顶尖的LLMs在代码性能优化方面与人类专家相比仍有较大差距,即使是Agent-based的框架(如OpenHands)也未能完全超越人类的优化水平,显示出LLMs在处理跨文件交互、复杂依赖关系以及大规模代码优化方面的局限性。

💡 LLMs在优化时倾向于关注低级代码结构(如导入、环境设置),而人类专家则侧重于高级语义抽象。此外,随着目标函数数量的增加或在长运行时长场景下,LLMs的性能会显著下降,这表明其在可扩展性和处理复杂优化策略方面仍需提升。

Introduction

As large language models (LLMs) advance in software engineering tasks—ranging from code generation to bug fixing—performance optimization remains an elusive frontier, especially at the repository level. To bridge this gap, researchers from TikTok and collaborating institutions have introduced SWE-Perf—the first benchmark specifically designed to evaluate the ability of LLMs to optimize code performance in real-world repositories.

Unlike prior benchmarks focused on correctness or function-level efficiency (e.g., SWE-Bench, Mercury, EFFIBench), SWE-Perf captures the complexity and contextual depth of repository-scale performance tuning. It provides a reproducible, quantitative foundation to study and improve the performance optimization capabilities of modern LLMs.

Image source: https://arxiv.org/abs/2507.12415

Why SWE-Perf Is Needed

Real-world codebases are often large, modular, and intricately interdependent. Optimizing them for performance requires understanding of cross-file interactions, execution paths, and computational bottlenecks—challenges beyond the scope of isolated function-level datasets.

LLMs today are largely evaluated on tasks like syntax correction or small function transformations. But in production environments, performance tuning across repositories can yield more substantial system-wide benefits. SWE-Perf is explicitly built to measure LLM capabilities in such settings.

Image source: https://arxiv.org/abs/2507.12415

Dataset Construction

SWE-Perf is constructed from over 100,000 pull requests across high-profile GitHub repositories. The final dataset covered 9 repositories including:

To ensure validity, each unit test must:

    Pass before and after the patch.Show statistically significant runtime gains over 20 repetitions (Mann-Whitney U test, p < 0.1).

Performance is measured using minimum performance gain (δ), isolating statistical improvements attributable to the patch while filtering noise.

Benchmark Settings: Oracle vs. Realistic

Evaluation Metrics

SWE-Perf defines a three-tier evaluation framework, reporting each metric independently:

    Apply: Can the model-generated patch be applied cleanly?Correctness: Does the patch preserve functional integrity (all unit tests pass)?Performance: Does the patch yield measurable runtime improvement?

The metrics are not aggregated into a single score, allowing more nuanced evaluation of tradeoffs between syntactic correctness and performance gains.

Experimental Results

The benchmark evaluates several top-tier LLMs under both oracle and realistic settings:

ModelSettingPerformance (%)
Claude-4-opusOracle1.28
GPT-4oOracle0.60
Gemini-2.5-ProOracle1.48
Claude-3.7 (Agentless)Realistic0.41
Claude-3.7 (OpenHands)Realistic2.26
Expert (Human Patch)10.85

Notably, even the best-performing LLM configurations fall significantly short of human-level performance. The agent-based method OpenHands, built on Claude-3.7-sonnet, outperforms other configurations in the realistic setting but still lags behind expert-crafted optimizations.

Key Observations

Conclusion

SWE-Perf represents a pivotal step toward measuring and improving the performance optimization capabilities of LLMs in realistic software engineering workflows. It uncovers a significant capability gap between existing models and human experts, offering a strong foundation for future research in repository-scale performance tuning. As LLMs evolve, SWE-Perf can serve as a north star guiding them toward practical, production-ready software enhancement at scale.


Check out the Paper, GitHub Page and Project. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post TikTok Researchers Introduce SWE-Perf: The First Benchmark for Repository-Level Code Performance Optimization appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SWE-Perf LLM 代码优化 性能基准 软件工程
相关文章