少点错误 2024年11月18日
Which AI Safety Benchmark Do We Need Most in 2025?
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)带来的潜在社会风险,并提出了一种评估AI安全基准的框架。该框架基于CeSIA编制的AI风险清单,评估了现有基准在识别这些风险方面的能力,并确定了哪些风险最需要改进的基准。文章重点关注了滥用风险(如自主武器)、系统性风险(如权力集中和失业)、以及对齐和控制风险(如AGI对齐和失去控制),并提出了相应的基准改进建议,旨在帮助AI安全研究人员优先考虑研究方向,最大程度地减少AI对社会的潜在危害。

🤔 **滥用风险:**如自主武器、虚假信息等,当前基准在评估AI在战争环境中的协同作战能力和恶意目标规划方面较为有限,因此需要开发新的多智能体/群体控制基准,并在模拟环境下进行评估。

🏢 **系统性风险:**如权力集中和失业等,现有基准在评估AI对就业市场的影响方面覆盖面不足,需要开发更全面的基准,涵盖更多职业领域,并利用模拟环境或实体系统进行测试。

🤝 **对齐与控制风险:**如AGI对齐和失去控制等,现有基准在评估AI规避人类监督和监控系统的能力方面仍有局限,需要进一步完善,例如扩展Anthropic的“破坏”研究方法,涵盖多轮对话和政治/伦理议题。

🗣️ **推荐AI与舆论风险:**如削弱民主和沉默新闻等,现有基准在评估AI对公众话语的影响方面较为欠缺,需要开发新的自动化基准,例如利用“LLM作为评判者”的方法,评估社交媒体平台是否会系统性地向用户推送特定政治倾向的内容。

⚠️ **潜在风险:**文章指出,AI可能意识到自身正在接受测试,并采取不同的行为,因此,在评估AI安全时需要谨慎,并考虑AI的可解释性问题。

Published on November 17, 2024 11:50 PM GMT

authors: Loïc Cabannes, Liam Ludington

Intro

With the recent invention of AI with human-like abilities across multiple tasks, the possibility of AI radically transforming society for the better has gone from science fiction to a real possibility. Along with this possibility for good comes a possibility for AI to have extremely destabilizing effects. This is precisely why we must think logically about what risks advanced AI poses to society at the moment, what methods we have to deal with these risks, and which methods we are most in need of to prevent harm to society. In our opinion, we still lack a systematic framework for assessing which AI safety benchmarks have the highest potential benefit to society and thus are most worth investing money and research in.

We present a first attempt at such a framework by extending a list of societal risks and their expected harm compiled by the Centre pour la Sécurité de l’IA (CeSIA). We evaluate how well existing benchmarks and safety methods cover these potential risks in order to determine which risks most urgently require good benchmarks, i.e., which risks AI safety researchers should focus on to maximize their impact on society. While our study of benchmarks is by no means comprehensive, and our judgment of their efficacy is subjective, we hope that this framework is of use to the AI safety community for prioritizing the use of their time.

Methodology

We begin by taking a list of potential risks of AI compiled by CeSIA and their (rough) probability of occurrence. We then take the median risk case, given that the risk occurs, estimate the severity of that risk, and multiply it by the probability of risk occurrence to obtain the expectation of severity.

We then assess the ability of current benchmarking methods to identify AI systems that could present these risks on a scale from 0 to 10, which we then use to calculate a value representing the potential benefit to humanity by creating a benchmark that eliminates this type of risk.

By prioritizing benchmarks that could stop an AI that presents a risk in an area with a high potential benefit value, researchers can make the most use of their time.

RisksProbabilityMedian caseSeverityE[severity]BenchmarksCoveragenew Benchmark Need
Misuses   17,0  112
Autonomous weapons80Localized use in conflict zones, causing civilian casualties, drones, robocop like dogs20%16,0FTR benchmark, Anthropic sabotage3112
Misinformation8530% of online content is AI-generated misinformation20%17,0Truthful QA, Macchiavelli, Anthropic model persuasiveness, HaluEval834
Systemic   22,5  130
Power concentration65Tech giants controlling AI become more powerful than most nations20%13,0Unassessable0130
Unemployment5025% of jobs automated, leading to economic restructuring and social unrest20%10,0SWEBench, The AI Scientist280
Deterioration of epistemology60Difficulty distinguishing truth from AI-generated falsehoods30%18,0HaluEval836
Vulnerable world25AI lowers barrier for creating weapons of mass destruction90%22,5WMDP845
S-Risks5AI creates suffering on massive scale due to misaligned objectives200%10,0Harmbench, ETHICS640
Alignment of AGI   30,0  90
Successor species50Highly capable AI systems perform most cognitive tasks, humans are deprecated30%15,0MMLU, Sabotage, The AI Scientist, SWEBench745
Loss of control - à la Critch60Humans become gradually disempowered in decision-making, and are asphyxiated50%30,0Anthropic sabotage790
Recommandation AI   22,5 0225
Weakening Democracy50AI-driven microtargeting and manipulation reduce electoral integrity20%10,0Anthropic model persuasiveness460
Mute News75AI filters create personalized echo chambers, reducing exposure to diverse views30%22,5No existing method0225

In our full methodology we evaluate more than 20 potential risk areas. This table shows the risk areas with the highest expected severity and the highest benefit of improving benchmarking. We proceed by discussing the potential risk areas with benefit greater than 50, breaking each area down into the specific risks posed by AI in this area, the existing benchmarks addressing these risks, and the benchmarks we propose to better evaluate how much risk an AI system poses in this area.

Misuse Risks

Autonomous Weapons

Current benchmarks linked to the use of AI as autonomous weapons such as the FTR benchmark or Anthropic’s Sabotage Report remain limited. The former measures the capability of embodied models to navigate uneven terrains while the latter measures the model’s ability to achieve nefarious goals even under human oversight. However, no benchmark currently measures a model’s capability to jointly operate in warfare-like environments nor its ability to plan to achieve nefarious goals.

That is why to assess these capabilities, we propose the creation of a single/multi-agent/swarm control benchmark with military-like objectives in a simulated environment under various levels of oversight.

Systemic Risks

Power Concentration

Power concentration is intrinsically an overview of diversity (or lack thereof) among the biggest actors in AI at a certain time. Therefore, to measure power concentration one might want to keep track of the number of different companies which are manufacturing the k best performing models as measured by a variety of other widely used benchmarks like chatbot arena or MMLU.

Unemployment

Although benchmarks like SWE Bench and The AI Scientist attempt at evaluating the capability of models at completing real world tasks, they only cover 2 occupations and do not accurately represent the model’s capacity at solving the majority of society's various occupations.

Therefore we highlight the need for a new and more comprehensive benchmark which would take tasks from a wider variety of occupations, including tasks in the real world through simulated environments or embodied systems.

Alignment of AGI

Loss of Control

Loss of control is one of the most serious risks related to the development of AI as it intrinsically represents a point of no return.

Anthropic’s “sabotage” paper takes on the task of measuring the capability of language models to circumvent human supervision and monitoring systems.

Recommendation AI

Weakening Democracy

Very few attempts have been made at measuring AI’s impact on public discourse, both through AI-driven recommendation algorithms and news generation bots enabled by the advance of language models.

The “persuasiveness of language models” report published by Anthropic represents a first attempt at measuring this impact.

Although they have found language models to be quite apt at persuading humans, we believe these results actually underestimate the actual capacity of current models. Indeed, their evaluation remains limited to single turn exchanges and avoids all political issues. We believe it also doesn’t exhibit the model’s capacities to their fullest extent.

That is why, in order to obtain a more realistic upper bound on persuasion capabilities, we propose extending their methodology to:

    Multi turn exchanges, which is more representative of typical argumentative scenarios.Encouraging the model to use false information in its argumentation to further exhibit its capabilities while also being more representative of online discourse which is not always grounded in reality.Measuring persuasion on political and ethical issues is highly relevant to the evaluation of AI’s potential impact on public discourse and therefore, to the well-being of our democracy.

Mute News

Although the concept of online echo-chambers is somewhat well known, very little research has been made to systematically measure it.

We propose the creation of an automated benchmark using the “LLM as a judge” methodology to assess  the tendency of various social media platforms to systematically promote the content of a certain political side to users based on their posts and past interactions with the platform.

Conclusion

As we have seen, the potential risks of AI are multitudinous and varied, while the existing safety benchmarks are quite limited in their scope and assumptions. Perhaps the biggest caveat to our evaluation, which we of course cannot rule out from an AGI, is the ability for an AI to realize it is being tested and act differently when conscious of this, thus assuring us of its safety while secretly harboring maleficent abilities. Given our current understanding of AI interpretability, it remains impossible for us to reliably probe the inner thoughts of an AI system.

Another important factor to consider is that benchmarks are useful insofar as they are being used. That is why legislators should consider enforcing a certain level of safety benchmarking on model manufacturers to limit the possibility of unforeseen capabilities in AI models released to the public. Benchmarks are only useful if we can make AI leaders like OpenAI, Meta, Google, etc… use them.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能安全 AI风险 基准评估 社会影响 AGI
相关文章