ΑΙhub 04月28日 16:40
Copilot Arena: A platform for code
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Copilot Arena是一款VS Code扩展,旨在通过直接从开发者工作流程中收集人类偏好,从而在真实环境中评估大型语言模型(LLM)。它为开发者提供代码补全功能,类似于GitHub Copilot。目前已有超过11,000名用户下载,累计超过10万次补全和25,000次代码补全对战。通过用户对不同LLM提供的代码补全进行选择,Copilot Arena能够更好地理解开发者偏好,并用于评估Mistral AI的Codestral模型和InceptionAI的Mercury Coder等新模型。该工具还通过一种新颖的界面展示配对的代码补全,并采用抽样策略来减少延迟,同时保持模型比较的覆盖率。

🧑‍💻 **用户界面与投票机制**: Copilot Arena通过展示来自不同LLM的成对代码补全,允许用户选择偏好的代码,从而了解开发者对LLM的偏好。投票过程设计得非常流畅,用户可以使用键盘快捷键快速接受代码补全,避免中断工作流程。

⚡️ **模型配对抽样策略**: 为了最小化延迟,Copilot Arena采用了一种抽样策略。该策略通过捕捉每个模型的延迟作为对数正态分布,并调整温度参数,在延迟优化分布和均匀分布之间进行插值,从而将体验到的中值延迟降低了33%(从1.61秒降至1.07秒)。

🧩 **代码补全提示工程**: Copilot Arena采用一种简单的提示技巧,允许聊天模型执行代码补全任务。该方法通过指示模型首先重新输出部分前缀,然后以部分后缀结束,再匹配输入中的部分输出代码并删除重复代码,从而使聊天模型能够成功地进行代码补全。

📊 **排行榜与模型评估**: 通过Bradley-Terry模型评估模型强度,Copilot Arena会根据模型在代码补全对战中的表现进行排序,并构建包含95%置信区间的排行榜,从而直观地展示各个模型的优劣。排行榜会实时更新,托管在lmarena.ai网站上。

Figure 1. Copilot Arena is a VSCode extension that collects human preferences of code directly from developers. 

By Wayne Chi, Valerie Chen, Chris Donahue, Ameet Talwalkar

As model capabilities improve, large language models (LLMs) are increasingly integrated into user environments and workflows. In particular, software developers code with LLM-powered tools in integrated development environments such as VS Code, IntelliJ, or Eclipse. While these tools are increasingly used in practice, current LLM evaluations struggle to capture how users interact with these tools in real environments, as they are often limited to short user studies, only consider simple programming tasks as opposed to real-world systems, or rely on web-based platforms removed from development environments.

To address these limitations, we introduce Copilot Arena, an app designed to evaluate LLMs in real-world settings by collecting preferences directly in a developer’s actual workflow. Copilot Arena is a Visual Studio Code extension that provides developers with code completions, akin to the type of support provided by GitHub Copilot. Thus far, over 11,000 users have downloaded Copilot Arena, and the tool has served over 100K completions, and accumulated over 25,000 code completion battles. The battles form a live leaderboard on the LMArena website. Since its launch, Copilot Arena has also been used to evaluate two new code completion models prior to their release: a new Codestral model from Mistral AI and Mercury Coder from InceptionAI. 

In this blog post, we discuss how we designed and deployed Copilot Arena. We also highlight how Copilot Arena provides new insights into developer code preferences.

Copilot Arena system design

To collect user preferences, Copilot Arena presents a novel interface that shows users paired code completions from two different LLMs, which are determined based on a sampling strategy that mitigates latency while preserving coverage across model comparisons. Additionally, we devise a prompting scheme that allows a diverse set of models to perform code completions with high fidelity. Figure 1 overviews this workflow. We will overview each component below:

User Interface: Copilot Arena allows users to select between pairs of code completions from different LLMs. User selections allow us to better understand developer preferences between LLMs. To avoid interrupting user workflows, voting is designed to be seamless—users use keyboard shortcuts to quickly accept code completions.   

Sampling model pairs: We explore a sampling strategy to minimize the experienced latency. Since our interface shows two code completions together, the slowest completion determines the latency. We capture each model’s latency as a log-normal distribution and tune a temperature parameter to interpolate between a latency-optimized distribution and a uniform distribution, observing a decrease in median experienced latency by 33% (from 1.61 to 1.07 seconds) compared to a uniform distribution.

Figure 2: We develop a simple prompting scheme to enable LLMs to perform infilling tasks compared to the vanilla performance.  

Prompting for code completions: During development, models need to “fill in the middle”, where code needs to be generated based on both the current prefix and suffix. While some models, such as DeepSeek and Codestral, are designed to fill in the middle, many chat models are not and require additional prompting. To accomplish this, we allow the model to generate code snippets, which is a more natural format, and then post-process them into a FiM completion. Our approach is as follows: in addition to the same prompt templates above, the models are provided with instructions to begin by re-outputting a portion of the prefix and similarly end with a portion of the suffix. We then match portions of the output code in the input and delete the repeated code. This simple prompting trick allows chat models to perform code completions with high success (Figure 2).

Deployment

Figure 3. Copilot Arena leaderboard is live on lmareana.ai.

We deploy Copilot Arena as a free extension available on the VSCode extension store. During deployment, we log user judgments and latency for model responses, along with the user’s input and completion. Given the sensitive nature of programming, users can restrict our access to their data. Depending on privacy settings, we also collect the user’s code context and model responses.

As is standard in other work on pairwise preference evaluation (e.g., Chatbot Arena), we apply a Bradley-Terry (BT) model to estimate the relative strengths of each model. We bootstrap the battles in the BT calculation to construct a 95% confidence interval for the rankings, which are used to create a leaderboard that ranks all models, where each model’s rank is determined by which other models’ lower bounds fall below its upper bound. We host a live leadboard of model rankings at lmarena.ai (Figure 3). 

Findings

Figure 4. Model rankings in Copilot Arena (1st column) differ from existing evaluations, both for static benchmarks (2nd-4th column) and live preference evaluations (last two columns). We also report Spearman’s rank correlation (r) between Copilot Arena and other benchmarks. 

Comparison to prior datasets

We compare our leaderboard to existing evaluations, which encompass both live preference leaderboards with human feedback and static benchmarks (Figure 4). The static benchmarks we compare against are LiveBench, BigCodeBench, and LiveCodeBench, which evaluate models’ code generation abilities on a variety of Python tasks and continue to be maintained with new model releases. We also compare to Chatbot Arena and their coding-specific subset, which are human preferences of chat responses collected through a web platform.

We find a low correlation (r ≤ 0.1) with most static benchmarks, but a relatively higher correlation (Spearman’s rank correlation (r) of 0.62) with Chatbot Arena (coding) and a similar correlation (r = 0.48) with Chatbot Arena (general). The stronger correlation with human preference evaluations compared to static benchmarks likely indicates that human feedback captures distinct aspects of model performance that static benchmarks fail to measure. We notice that smaller models tend to overperform (e.g., GPT-4o mini and Qwen-2.5-Coder 32B), particularly in static benchmarks. We attribute these differences to the unique distribution of data and tasks that Copilot Arena evaluates over, which we explore in more detail next.

Figure 5. Copilot Arena data is diverse in programming and natural languages, downstream tasks, and code structures (e.g., context lengths, last-line contexts, and completion structures).

In comparison to prior approaches, evaluating models in real user workflows leads to a diverse data distribution in terms of programming and natural languages, tasks, and code structures (Figure 5):

Insights into user preferences

Conclusion

While Copilot Arena represents a shift in the right direction for LLM evaluation, providing more grounded and realistic evaluations, there is still significant work to be done to fully represent all developer workflows. For example, extending Copilot Arena to account for interface differences from production tools like GitHub Copilot and tackling privacy considerations that limit data sharing. Despite these constraints, our platform reveals that evaluating coding LLMs in realistic environments yields rankings significantly different from static benchmarks or chat-based evaluations and highlights the importance of testing AI assistants with real users on real tasks. We’ve open-sourced Copilot Arena to encourage the open source community to include more nuanced feedback mechanisms, code trajectory metrics, and additional interaction modes.

If you think this blog post is useful for your work, please consider citing it.

@misc{chi2025copilotarenaplatformcode,
title={Copilot Arena: A Platform for Code LLM Evaluation in the Wild},
author={Wayne Chi and Valerie Chen and Anastasios Nikolas Angelopoulos and Wei-Lin Chiang and Aditya Mittal and Naman Jain and Tianjun Zhang and Ion Stoica and Chris Donahue and Ameet Talwalkar},
year={2025},
eprint={2502.09328},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2502.09328},
}

This article was initially published on the ML@CMU blog and appears here with the author’s permission.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Copilot Arena LLM评估 代码补全
相关文章