少点错误 03月25日 07:18
An overview of control measures
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何控制人工智能,以防止其引发生存问题,即使人工智能试图破坏我们的对策。文章重点介绍了减少由错位引起的存在风险的控制措施,包括软件基础设施、安全措施和其他控制措施。作者认为,建立一个确定哪些方法值得的流程比仅仅实施特定措施更有效。文章还详细介绍了重要的软件基础设施,如输入输出日志记录、权限管理和沙盒、支持重新采样/延迟/阻止的脚手架以及输出跟踪等,这些措施对于控制人工智能至关重要。

💡**软件基础设施建设:**文章强调了构建软件基础设施的重要性,特别是针对内部使用的人工智能。这包括对输入和输出进行全面记录,确保行为的可追溯性,并且建立强大的权限管理和沙盒机制,以限制人工智能的访问权限,防止其滥用资源。

🔒**安全措施的重要性:**作者指出,高级别的安全措施对于控制人工智能至关重要,但同时也承认实现高安全性的成本较高。文章建议,在已经达到一定安全级别的公司中,针对控制的额外安全改进可以相对较小,但对于安全级别较低的环境,加强安全措施可能至关重要。

⚙️**核心控制措施:**文章的核心在于强调核心控制措施的重要性,例如监控、审计、跟踪人工智能,以及阻止或重新采样可疑输出。这些措施是应对人工智能潜在风险的关键。

📊**输出跟踪和作业/实验跟踪:**文章强调需要建立一个系统,将人工智能的产出(如微调后的模型权重或其他数字产物)与其产生过程关联起来。此外,还需要跟踪人工智能运行的作业,以便全面了解其资源使用、权限以及相关轨迹。

Published on March 24, 2025 11:16 PM GMT

We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main countermeasures (building on my prior discussion of the main threats). I'll focus my discussion on countermeasures for reducing existential risk from misalignment, though some of these countermeasures will help with other problems such as human power grabs.

I'll overview control measures that seem important to me, including software infrastructure, security measures, and other control measures (which often involve usage of AI or human labor). In addition to the core control measures (which are focused on monitoring, auditing, and tracking AIs and on blocking and/or resampling suspicious outputs), there is a long tail of potentially helpful control measures, including things like better control of AI capabilities and countermeasures for specific issues (like steganography, exploration hacking, and systematic pro-AI bias).

Background and categorization

Control evaluation methodologies are probably more important to develop right now, but better understanding the space of available control measures is still helpful. Broadly speaking, I think the ideal approach would mostly operate procedurally and would involve doing adversarial analysis of what problems our AIs could cause, and then use control evaluations and other methods to determine measures (and the overall control protocol). That is, rather than approaching control from the perspective of "we should implement XYZ measures", I think it would generally be more effective to establish a process for determining what methods are worthwhile both via testbed/benchmark driven development and via abstract threat modeling and analysis. As AIs get more powerful, developing methods by iterating within testbeds looks increasingly good, so I correspondingly think that technique development now should be substantially focused on preparing for a time when iteration is more valuable by: developing methodologies, making conceptual progress, and understanding which categories of techniques are important.

However, I still think it is somewhat useful to list the most promising and important seeming measures at a high level because:

I'll focus on what a reasonably good version of control would look like, but note that this would require organizational buy-in and implementation. In a lower effort regime with less political buy-in (more like ten people on the inside), it would make sense to focus on a smaller subset of measures.

I find it useful to breakdown control measures into a few categories:

Within the category of software infrastructure, I'm just including "traditional software", so not things like AI-based monitoring systems or adversarial evaluation (which is done by humans or AIs). Correspondingly, many of the most important methods live within the "AI and human components" section.

While this isn't the only possible breakdown, I think this breakdown is useful because it roughly corresponds to different fields that already exist: general software infrastructure, AI/ML, and security. More generally, I think methods within these areas have commonalities in terms of the skills needed to work on the area and what needs to happen for the method to be implemented effectively (e.g., I think for software infrastructure most of the difficulty will be implementing for each AI company while more of the difficulty in AI and human components might be more like traditional ML research focused on figuring out which of several simple methods is best).

Software infrastructure, AI or human components, and security improvements also have somewhat different properties in terms of how difficult they are to implement and how much they will happen by default.

Getting to a generally high level of security is very expensive; however, the expensive traditional security measures we most want are also well motivated by concerns about human actors (both insider threat and outside actors). So, if the company already reached (e.g.) SL5 security, then the marginal security improvements I'd want specifically for control are relatively small. That said, I currently think it is likely we'll be trying to ensure control in regimes where security is substantially weaker than SL5 level, and it seems plausible that getting (potentially misaligned) AIs to do huge amounts of work to improve security will be important.

The software infrastructure we want for control (of internal usage) includes things that are somewhat useful for other purposes, like making it easier to prevent AI agents used internally from causing annoying problems and helping humans have an easier time understanding what AI agents are doing so they can point out mistakes. This implies that some of the infrastructure might be developed by default, though I do worry that it won't be applied as universally as we'd want. It also implies that people working on control should plausibly be somewhat wary of capabilities externalities (to the extent they think advancing capabilities at their organization is bad), though I think the externalities don't seem that costly and small mitigations might alleviate this concern.

Non-ML software infrastructure

By non-ML software infrastructure, I mean components which are just relatively traditional software (though this software might interact with some ML components). Thus, this category does not include measures that rely on human auditors or machine learning. Correspondingly, most of the "core" control measures focused on in earlier research (e.g. trusted and untrusted monitoring, adversarial evaluation) don't fall under "software infrastructure" and instead are under "AI and human components".

Software infrastructure measures should likely focus on control for internal usage. Or at least on AI usage for particularly sensitive applications. It may be difficult to implement control for broad swaths of usage if AI is provided via a low-level API (rather than a higher level API which handles scaffolding and is more similar to employing a contractor).

The most important software infrastructure

I'll start with a list of the most important types of software infrastructure:

Less important software infrastructure

Beyond this, there are a variety of software infrastructure components which are less essential, but which could still be quite helpful:

AI and human components

Now I'll talk about other measures which mostly involve utilizing humans or AIs (rather than being software infrastructure or security measures). This section includes the "core" control measures that Redwood has studied in earlier work. I'll focus mostly on black box (non-internals) measures.

Many of the measures we're considering will be applicable to multiple threat models, but they will often specifically be grouped into a few approximate clusters:

For the diffuse issues cluster, it is typically fine for review to happen asynchronously, potentially even after the action has been executed (or the code has been added to the code base etc), as reverting the problematic action may be basically fine. For concentrated issues, blocking problematic seeming actions until they are sufficiently reviewed is more likely to be more important. Correspondingly, as far as human review goes, we'd want some type of asynchronous review process and a lower latency process for looking at suspicious actions which are blocking.

I'll loosely group these measures into a few categories:

This grouping isn't intended to be fully precise, but the list of measures would otherwise be a long unstructured list and I think adding some structure helps with comprehension.

Core measures

This list includes central measures that we very likely want. Or at least, we'll want some version of these measures.

Other less-core measures

These are relatively general measures that seem useful, but aren't as central or as essential.

More speculative measures

There are a variety of measures which aren't clearly viable, but which might be quite useful if they can be made to work.

Specialized countermeasures

For some specific concerns, it makes sense to develop relatively specific countermeasures. While it isn't necessarily predictable in advance which specialized countermeasures will be most relevant, it seems relatively clear that some specialized countermeasures are worthwhile.

Other directions and what I'm not discussing

In addition to these object level measures, test beds for refining and/or training our measures will be important.

While I discuss specific measures, my sense is that better general understanding of the dynamics around collusion, sandbagging, AI takeover strategizing capabilities, and bypassing monitoring will suggest a variety of somewhat different measures that are promising. So, better understanding and evaluating these seems quite useful.

I'm not discussing more sophisticated model internals methods which could technically be considered control. This includes: interpretability used for runtime detection of misalignment (or technically even upfront detection) and relaxed adversarial training / latent adversarial training. That said, methods like weak neuralese decoding could synergize well with control. I'm also not including methods like making deals with AIs. All of these methods are potentially promising, but aren't best studied through the black box control methodology and can mostly operate independently of the control agenda for now.

Security improvements

What about security? I don't have as much to say here, but I'll list some general points and then discuss some particularly promising and less obvious specific measures. There are probably useful methods that I'm missing due to insufficient context with security.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能控制 软件基础设施 安全措施 风险管理
相关文章