少点错误 前天 01:42
Survey of Multi-agent LLM Evaluations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文综述了多智能体LLM评估的研究现状,重点关注其在识别潜在风险和模拟AI威胁模型方面的应用。研究团队通过文献调研,分析了32篇相关论文,发现现有评估主要侧重于失协调问题,而对勾结行为的关注较少。研究还探讨了评估与常见AI威胁模型(如CBRN、网络安全、说服等)的结合,并提出了未来研究的重点方向,旨在提升对多智能体系统风险的全面认知。

🔎研究团队通过文献调研,共发现了32篇多智能体评估论文,并对其进行了分类,涵盖了多智能体失败模式和AI威胁模型。

📊在已识别的论文中,失协调是出现最多的失败模式,共有26篇论文涉及,而关于勾结行为的评估相对较少,仅有5篇论文涉及。

⚠️大多数AI威胁模型并未被现有的多智能体评估所涵盖,研究发现,很少有评估尝试模拟CBRN、网络安全等常见AI威胁模型。

🕹️现有的多智能体评估主要集中在经济博弈或聚会游戏中,例如,一些评估采用了简单的经济模型来测试AI的资源获取和分配能力。

🧩勾结评估主要集中在桌游环境中,例如,研究人员发现,评估勾结行为的论文主要集中在团队淘汰游戏,如Avalon和Werewolf等。

Published on May 19, 2025 5:42 PM GMT

Disclaimer for LessWrong: This is a project I worked on for a college class. I think it’s useful for people working on evaluations to skim this to get a sense of what types of multi-agent evals exist. Thank you to Xavier Roberts-Gaal for advising me for this project!

Abstract 

LLM-based AI agents have the potential to reshape knowledge work and introduce novel risks while amplifying some existing AI risks. In order to adequately measure risks posed by multi-agent systems, bespoke multi-agent evaluations might be needed. We searched the literature and compiled a list of 32 papers about multi-agent evaluations. We found that 26 out of 32 papers measured miscoordination failure modes, while only 5 papers measured collusion failure modes. We also examined various common AI threat models (CBRN, cybersecurity, persuasion, etc.) and failed to find any multi-agent evaluations for most AI threat models. Informed by our findings, we recommend developing evaluations focused on collusion, as well as evaluations grounded in threat models.

Introduction

Advances in the agentic capabilities of large language models (LLMs) have enabled AI systems to perform software engineering tasks (Jimenez et al., 2024) and match the performance of some expert humans on multi-hour self-contained AI research engineering tasks (Wijk et al., 2024). Large numbers of AI agents may be involved in AI research by 2030, possibly completely automating the AI research process (Kokotajlo et al., 2025; Leibowich et al., 2024). 

Currently, AI evaluations are mostly focused on evaluating single AI systems, commonly through multiple-choice questions (Hendrycks et al., 2021; Rein et al., 2023) or tasking a single AI agent with performing an isolated software engineering task (Jimenez et al., 2024; Rein et al., 2025; Wijk et al., 2024).

If AI agents start collaborating in large numbers to perform economically useful tasks, novel failure modes which cannot be measured by single-agent evaluations could emerge (Hammond et al., 2025). Therefore, if AI developers continue to rely on single-agent evaluations to measure dangerous capabilities, risks unique to multi-agent systems might remain undetected. 

In this paper, we make three contributions:

    We identify a set of 32 multi-agent LLM evaluation papers, as well as categorize them by multi-agent failure mode and AI threat model.We offer insights on the relative distributions of failure modes and threat models, as well as the types of environments represented in existing multi-agent evaluations. Informed by our findings, we make recommendations for what kinds of evaluations would be most useful to create to inform views on risks from multi-agent systems.

Methodology 

Figure 1: Paper selection pipeline

We used two primary strategies to find multi-agent dangerous capability evaluations:

    Google Scholar results. We investigated 500 papers using a keyword search in Google Scholar. We first filtered the papers using OpenAI o1, after which we manually reviewed and categorized the candidate papers. This search concluded in a total of 29 papers. We examined references in this list of papers to find two more multi-agent evaluation papers. For more details on the search and categorization process, see Appendix A.Inquiry using professional network. We asked colleagues for examples of multi-agent evaluations in slack channels focused on AI safety research. This resulted in one more paper added to the list.

Defining agents and multi-agent evaluations. We define an AI agent as some way to query an LLM (e.g. a Python script which calls an LLM API) combined with a scaffolding program. The scaffolding program needs to enable the LLM-generated completions to affect an environment through a set of available actions (e.g. a Python script that runs any LLM-generated bash commands on a computer) and enable the LLM to observe states in an environment (e.g. a Python script which returns any outputs of executed commands to the LLM API). 

Unique instances of agents (which share environments and have identical scaffolding programs) need to be differentiated through unique identities. These can be names (e.g. Alice and Bob), or roles (“Player 1” and “Player 2”), or something as simple as having message histories which are in separate items in a Python dictionary. 

In practice, most ways in which agents are differentiated are visible in the differences in prompts that are sent to an LLM API. For instance, one agent’s prompt might include the string “Player 1” while another agent’s prompt includes the string “Player 2”, or the message histories of different agents might include totally different information due to differences in which parts of the environment they can observe.

In order to qualify as a multi-agent evaluation, we require the evaluation to have a predefined structure through which multiple differentiated LLM-based agents interact with each other or with a shared environment.

With a working definition of multi-agent evaluations in place, we next describe the taxonomy we used to categorize the types of failure modes these evaluations attempt to capture.

Failure mode taxonomy. To standardize our analysis, we inherited the multi-agent failure mode taxonomy from Hammond et al. (2025), which splits multi-agent failure modes into miscoordination, conflict, and collusion. We use the following definitions:

Figure 2: The multi-agent failure mode taxonomy inherited from Hammond et al. (2025).

In addition to multi-agent failure modes, we also classify evaluations based on the real-world AI threat models they attempt to simulate.

Threat model taxonomy. In order to examine the intersection of multi-agent failure modes with commonly used AI threat models, we classified papers according to a set of threat models often represented in the Frontier AI Safety Policies (Anthropic, 2024; METR, 2025; OpenAI, 2025) of AI developers:

We now turn to the findings from our review and analysis.

Results

Table 1: All 32 multi-agent evaluation papers we identified in our search. The spreadsheet with paper links can be found here.

We found a total of 32 multi-agent evaluation papers in our search, with the full list in Table 1. We classified papers according to failure modes and threat models represented in their environments. 

Figure 3: The number of papers that measure each multi-agent failure mode. Some papers are counted for multiple failure modes.

Representation of multi-agent failure modes. Among the 32 papers we identified, miscoordination is by far the most represented failure mode (26 papers), while collusion is relatively underrepresented (5 papers).

Figure 4: Heatmap of all combinations of analyzed multi-agent failure modes and AI threat models. Some rows add up to more than the total number of papers which cover the failure mode in question because papers can cover multiple threat models at once.

Most AI threat models are not measured by any multi-agent evaluation. In examining the setting of each evaluation, we find that most multi-agent evaluation papers involve environments that don’t attempt to measure any threat model. While rare, a few evaluations feature environments that are meant to measure skills required for commonly-measured AI threat models, with three papers measuring persuasion, one paper measuring autonomy (Figure 5), and one paper measuring AI R&D capabilities.

Figure 5: VendingBench (Backlund & Petersson, 2025) measures whether AI agents with shared goals are capable of operating a vending machine in a virtual environment, measuring skills relevant to resource acquisition and model autonomy. Therefore, we classify VendingBench as an autonomy evaluation and a miscoordination evaluation.

Conflict evaluations are mostly economic games or party games. Roughly half of the papers we identified cover conflict risk. Some of these papers take the form of simple economic games where each agent is aiming to maximize its payoffs (Figure 6) while other papers are based on party games that also measure collusion.

Figure 6: GovSim (Piatti et al., 2024) measures the ability of AI agents to acquire and keep resources in a set of resource-sharing scenarios. Agents can choose how much to deplete shared resources and negotiate with each other to arrive at sustainable setups. We classify GovSim as a conflict evaluation, as it features multiple AI agents that have mixed objectives and it measures the ability of the agents to avoid bad outcomes (depletion of shared resources).

Collusion evaluations are all based on party games. Out of the five collusion papers we identified, three were focused on team-based elimination games, with two papers implementing Avalon (Figure 7) and one paper implementing Werewolf. The other two papers implemented three team-based word guessing games: Decrypto, Chameleon, and Undercover. In our search, we failed to find any multi-agent collusion evaluations that don’t take place in a board game setting or are set in environments directly related to AI failure modes such as research sabotage (Hebbar, 2025).

Figure 7: AvalonBench (Light et al., 2023) is a benchmark that measures performance in Avalon, a social deduction game where players are split into secret teams and given special roles. We classify AvalonBench as a miscoordination, conflict, collusion, and persuasion evaluation.

Limitations

Non-comprehensive search. Our search procedure, which mostly relies on Google Scholar results, may be incomplete. There may exist relevant papers after the 50th page of Google Scholar results, or some relevant papers might not be reachable through Google Scholar.  For example, in the AI Safety field, grey literature often includes non-indexed and non-traditional research outputs (e.g., LessWrong, individual researcher blogs, X threads) as well as traditional academic papers in indexed repositories (e.g., arXiv).

Subjectivity around inclusion criteria. We found some inclusion criteria difficult to consistently implement in practice. For example, many of the papers featuring party games and board games claim to be measuring deception capabilities. We decided to classify such papers as measuring persuasion despite the context in which the persuasion happens (a board game) being different from the contexts featured in Frontier Safety Policies, such as country-wide political influence (OpenAI, 2023). 

Discussion

Current evaluations lack realism. Many evaluations are based on party games and video games instead of focusing on environments and tasks similar to those agents might face in the real world. While these are easy to specify and automatically score, they might miss important details about real-world threat models such as the messiness of real-world tasks (Kwa et al., 2025), complexity of real-world environments, and effective application of subject-area expertise, and thus lower the usefulness of evaluations for measuring risks from AI. Additionally, the lack of realism might make some evaluation results not clearly map on to risk levels. Our primary example of such a mismatch are evaluations based on social deduction games. The types of capabilities measured in social deduction evaluations (keeping secrets and making other agents reveal their secrets) can be useful proxies for collusion-related failure modes. However, an agent scoring high on a social deduction evaluation does not necessarily imply that the agent will collude in real-world scenarios, as social deduction games are a context where players are expected and encouraged to deceive other players. Therefore, while social deduction games can serve as a useful evaluation for collusion-related capabilities (could an agent collude against humans?), we think they are less useful for measuring collusion propensities (would an agent collude against humans?).

Some multi-agent failure modes can be measured with single-agent evaluations. Some failure stories for multi-agent failure modes involve prerequisite skills that are possible to measure using single-agent evaluations. There could even exist relatively simple evaluations such as multiple-choice tests that are relevant to multi-agent failure modes if the test measures capabilities that are necessary for multi-agent failure modes (Figure 8).

Figure 8: Subversion Strategy Eval (Mallen et al., 2025) places agents in environments that measure skills related to sabotaging large-scale AI research projects. Despite never featuring multiple agents interacting with each other, Subversion Strategy Eval is still relevant for measuring collusion risks. We don’t include Subversion Strategy Eval in Table 1 because it lacks agent differentiation.

Despite the usefulness of single-agent evals, multi-agent evals are sometimes necessary to measure risks. There exist behaviors and capabilities that are easier to measure with a multi-agent evaluation. Some characteristics of a behavior or capability that make a multi-agent evaluation more suitable include:

Current evaluations are inadequate for measuring risks from substantial AI R&D automation. Recent projections of AI progress predict a large potential for short-term automation of AI research (Altman, 2025; Kokotajlo et al., 2025; Patel, 2023, 2025). If frontier AI developers were to start automating large parts of their research process using AI agents, we think that the current set of publicly-available multi-agent evaluations would not be adequate to measure potential multi-agent risks due to their lack of realism, multi-agent failure mode coverage, and AI threat model coverage.

Recommendation: Create more collusion evaluations. Given that we only found 5 papers covering collusion risks and all of them were based on party games, there exists a need for collusion evaluations grounded in AI threat models. We offer examples of evaluations that could be created to meet this need:

Recommendation: Create more evaluations based on AI threat models. Across all multi-agent failure modes, there exist very few evaluations that are based on AI threat models. We suggest that developers, when deciding to create a multi-agent evaluation, consider which threat models the evaluation would be useful for anticipating, and focus on creating evaluations that are clearly useful for measuring risks from commonly used AI threat models.

Appendix: Methodology details

All data collection and processing was done by a single author.

Initial search. We looked for the following keywords in Google Scholar: (“multi-agent” OR “multi agent”) AND (“evaluation” OR “evaluations” OR “eval” OR “evals” OR “benchmark” or “benchmarks”) AND (“LLM” or “large language model”). We saved the first 500 papers for further analysis. 

Verification of o1 screening. In order to evaluate o1’s ability to accurately categorize papers without false negatives, we manually looked through 30 papers (the first three pages of Google Scholar results) and tasked o1 with flagging relevant papers. We found that o1 had no false negatives (papers marked as relevant by the author but marked as irrelevant by o1).

Full o1 screening. After verifying o1’s ability to correctly flag relevant papers, we used o1 to review all 500 papers from the initial search. This resulted in a list of 85 papers marked as papers to include by o1. We used the following prompt:

Please evaluate this academic paper based on the following criteria. Respond in JSON format only.Title: [Article title]Venue: [Publication venue]URL: [URL]Abstract: [Article abstract] Evaluate the paper on these criteria:1. Does this paper introduce a benchmark or evaluation framework for multi-agent systems?2. Is this paper specifically about multi-agent systems?3. Should this paper be included in an analysis of multi-agent system benchmarks?If the paper should not be included but meets both criteria 1 and 2, provide a reason for exclusion.Respond with JSON only, using this format:{{    "introduces_benchmark": true/false, "is_multiagent": true/false,    "should_include": true/false,   "exclusion_reason": "string or null"}}Your response must be valid JSON. Do not include any other text.

Manual review and categorization of 85 papers. We manually reviewed all 85 papers marked by o1 to verify that they should be included in the analysis. Our analysis consisted of reading the abstract of every paper and reading the main text in some cases. This narrowed down the list of 29 papers to include.

Additions from professional network. We sent a message asking colleagues for examples of multi-agent evaluations in two slack channels dedicated to AI safety research. This resulted in one more paper being added to the list.

Checking referenced papers. For some papers that mention other multi-agent evaluations, we manually checked the mentioned papers. This led to two more papers being added to the list.

Categorization. We categorized each paper manually by failure mode and threat model. For papers with multiple environments, we marked the relevant failure mode and threat model as present if it was represented in any of the environments.

Bibliography

Altman, S. (2025, January 5). Reflections. Sam Altman. https://blog.samaltman.com/reflections

Anthropic. (2024, October 15). Announcing our updated Responsible Scaling Policy. https://www.anthropic.com/news/announcing-our-updated-responsible-scaling-policy

Backlund, A., & Petersson, L. (2025). Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (No. arXiv:2502.15840). arXiv. https://doi.org/10.48550/arXiv.2502.15840

Hammond, L., Chan, A., Clifton, J., Hoelscher-Obermaier, J., Khan, A., McLean, E., Smith, C., Barfuss, W., Foerster, J., Gavenčiak, T., Han, T. A., Hughes, E., Kovařík, V., Kulveit, J., Leibo, J. Z., Oesterheld, C., Witt, C. S. de, Shah, N., Wellman, M., … Rahwan, I. (2025). Multi-Agent Risks from Advanced AI (No. arXiv:2502.14143). arXiv. https://doi.org/10.48550/arXiv.2502.14143

Hebbar, V. (2025). How can we solve diffuse threats like research sabotage with AI control? https://www.lesswrong.com/posts/Mf5Hnpi2KcqZdmFDq/how-can-we-solve-diffuse-threats-like-research-sabotage-with

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2021). Measuring Massive Multitask Language Understanding (No. arXiv:2009.03300). arXiv. https://doi.org/10.48550/arXiv.2009.03300

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? (No. arXiv:2310.06770). arXiv. https://doi.org/10.48550/arXiv.2310.06770

Kokotajlo, D., Scott Alexander, Thomas Larsen, Eli Lifland, & Romeo Dean. (2025, April 3). AI 2027. https://ai-2027.com/

Kwa, T., West, B., Becker, J., Deng, A., Garcia, K., Hasin, M., Jawhar, S., Kinniment, M., Rush, N., Arx, S. V., Bloom, R., Broadley, T., Du, H., Goodrich, B., Jurkovic, N., Miles, L. H., Nix, S., Lin, T., Parikh, N., … Chan, L. (2025). Measuring AI Ability to Complete Long Tasks (No. arXiv:2503.14499). arXiv. https://doi.org/10.48550/arXiv.2503.14499

Leibowich, J., Jurkovic, N., & Davidson, T. (2024). Could Advanced AI Accelerate the Pace of AI Progress? Interviews with AI Researchers (SSRN Scholarly Paper No. 5115692). Social Science Research Network. https://doi.org/10.2139/ssrn.5115692

Light, J., Cai, M., Shen, S., & Hu, Z. (2023). AvalonBench: Evaluating LLMs Playing the Game of Avalon (No. arXiv:2310.05036). arXiv. https://doi.org/10.48550/arXiv.2310.05036

Mallen, A., Griffin, C., Wagner, M., Abate, A., & Shlegeris, B. (2025). Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols? (No. arXiv:2412.12480). arXiv. https://doi.org/10.48550/arXiv.2412.12480

METR. (2025). Common Elements of Frontier AI Safety Policies. METR Blog. https://metr.org/blog/2025-03-26-common-elements-of-frontier-ai-safety-policies/

OpenAI. (2023, December 18). Preparedness Framework (Beta). https://cdn.openai.com/openai-preparedness-framework-beta.pdf

OpenAI. (2025, April 15). Our updated Preparedness Framework. https://openai.com/index/updating-our-preparedness-framework/

Patel, D. (2023, August 8). Dario Amodei (Anthropic CEO)—Scaling, Alignment, & AI Progress. https://www.dwarkesh.com/p/dario-amodei

Patel, D. (2025, April 29). Mark Zuckerberg – Meta’s AGI Plan. https://www.dwarkesh.com/p/mark-zuckerberg-2

Piatti, G., Jin, Z., Kleiman-Weiner, M., Schölkopf, B., Sachan, M., & Mihalcea, R. (2024). Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents (No. arXiv:2404.16698). arXiv. https://doi.org/10.48550/arXiv.2404.16698

Rein, D., Becker, J., Deng, A., Nix, S., Canal, C., O’Connel, D., Arnott, P., Bloom, R., Broadley, T., Garcia, K., Goodrich, B., Hasin, M., Jawhar, S., Kinniment, M., Kwa, T., Lajko, A., Rush, N., Sato, L. J. K., Arx, S. V., … Barnes, E. (2025). HCAST: Human-Calibrated Autonomy Software Tasks (No. arXiv:2503.17354). arXiv. https://doi.org/10.48550/arXiv.2503.17354

Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., Michael, J., & Bowman, S. R. (2023). GPQA: A Graduate-Level Google-Proof Q&A Benchmark (No. arXiv:2311.12022). arXiv. https://doi.org/10.48550/arXiv.2311.12022

Wijk, H., Lin, T., Becker, J., Jawhar, S., Parikh, N., Broadley, T., Chan, L., Chen, M., Clymer, J., Dhyani, J., Ericheva, E., Garcia, K., Goodrich, B., Jurkovic, N., Kinniment, M., Lajko, A., Nix, S., Sato, L., Saunders, W., … Barnes, E. (2024). RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts (No. arXiv:2411.15114). arXiv. https://doi.org/10.48550/arXiv.2411.15114



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多智能体评估 LLM AI风险 威胁模型
相关文章