少点错误 04月15日 00:52
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了随着大型语言模型(LLM)能力提升,AI开发者如何调整控制评估方法。文章提出了一个基于五种不同能力水平模型的AI控制框架,定义了AI控制级别(ACL),并根据模型威胁模型相关的能力,映射到相应的控制评估方法。该框架旨在帮助开发者根据AI系统能力的进步,系统性地调整安全措施。文章强调,针对能力远超人类的AI系统,需要更深入的研究和技术突破。

🧠文章提出了AI控制框架,该框架根据AI系统的威胁模型,定义了不同的AI控制级别(ACL)。框架基于五种虚构模型(M1-M5),这些模型的AI控制级别(ACL)从无法进行有效策略(ACL-0)到其策略能力远超人类(ACL-5)不等。

🛡️作者认为,AI开发者需要根据AI系统实际的能力,设计相称的控制评估。例如,对于ACL-4级别的模型,由于其具备潜在推理能力,红队需要预先思考攻击策略,并微调模型以执行攻击,以应对其潜在风险。

💡文章指出,针对能力远超人类的AI系统(ACL-5),现有的黑盒AI控制方法可能不足以保证安全,需要更深入的研究。这包括探索基于白盒方法和对齐训练的安全案例,以应对更高级别的风险。

🔬文章强调了未来研究的几个关键领域,包括将模型能力映射到控制评估,开发真实的控制评估,以及设计控制协议,以平衡AI系统的实用性和安全性。文章还提到了UK AISI和Google DeepMind在AI控制领域的一些进展。

Published on April 14, 2025 4:45 PM GMT

This is a linkpost accompanying a new paper by UK AI Security Institute, Apollo Research and Redwood Research. Please see the full paper for more details.

TLDR: Our new paper outlines how AI developers should adapt the methodology used in control evaluations as capabilities of LLM agents increase.

Figure: We sketch a trajectory of how control evaluations might evolve through increasingly powerful capability profiles. The trajectory is based on a sequence of five fictional models (M1-M5) with progressively advanced capabilities. For each model, we define an AI Control Level (ACL) based primarily on its threat model-specific capabilities. By mapping these capabilities to appropriate control evaluation methodologies, we provide a systematic approach for adapting safety measures to AI systems as they advance in capabilities.

What are the advantages of AI control? 

The field of AI control  attempts to design and evaluate control protocols: safety measures that are robust to intentional subversion by AI systems. Elements of promising protocols vary widely, from secure sandboxing to AI monitors and human auditing.

A particularly attractive feature of the AI control approach is that it offers the potential to mitigate alignment risks without fundamental research breakthroughs. A major focus of the field is on the engineering challenge of designing protocols that meet the challenges of deployments of increasingly capable autonomous agents.

How can AI control techniques scale to increasingly capable systems? 

While the academic field of AI control is very young with a key foundational paper only published in 2023, many components of control evaluations (e.g. red teaming) and control measures (e.g. human approval) are already used to address the risks posed by existing AI systems.  

As AI systems become more capable of causing harm autonomously, AI developers will need increasingly sophisticated control measures and evaluations to justify that the models can be deployed securely. Our framework provides one possible roadmap for this scaling, proposing five distinct AI control levels (ACLs), ranging from models incapable of competent scheming (ACL-0) to models with scheming capabilities vastly above those of humans (ACL-5). Rather than assuming that agents will always execute the best subversion strategies known to humans, we demonstrate that knowledge of an agents’ actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. For each ACL we provide example rules for control evaluation, control measures, and safety cases that could be appropriate.  For instance, at ACL-4, in our example models become capable of latent reasoning which means they no longer rely on chains-of-thought that we could monitor. To account for that, the red team should be allowed to reason about attack strategies upfront and finetune their models to execute any attacks.

Figure: Defining capabilities of ACL-4, example rules of control evaluations they imply, and an example of control measures that might pass such control evaluation. Differences with ACL-3 are highlighted.

Finally, we argue that control measures effective for deployment of models with scheming capabilities vastly above those of humans (ACL-5) will require fundamental research breakthroughs. This suggests the need to move beyond blackbox AI control and exploring safety cases based on whitebox methods and alignment training.

More research is needed 

Our paper provides an early sketch of how AI control measures can scale to increasingly powerful systems, but more research is needed, particularly in the following areas:

The field of AI control is progressing quickly. Google DeepMind has already outlined a commitment to develop evaluations focused on intentional scheming capabilities. Recent contributions of UK AISI to this agenda include:

You can be part of this work: consider applying to our Control team! We’re hiring research scientists and research engineers.‍



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 LLM 安全 AI
相关文章