Published on March 24, 2025 11:16 PM GMT

We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main countermeasures (building on my prior discussion of the main threats). I'll focus my discussion on countermeasures for reducing existential risk from misalignment, though some of these countermeasures will help with other problems such as human power grabs.

I'll overview control measures that seem important to me, including software infrastructure, security measures, and other control measures (which often involve usage of AI or human labor). In addition to the core control measures (which are focused on monitoring, auditing, and tracking AIs and on blocking and/or resampling suspicious outputs), there is a long tail of potentially helpful control measures, including things like better control of AI capabilities and countermeasures for specific issues (like steganography, exploration hacking, and systematic pro-AI bias).

Background and categorization

Control evaluation methodologies are probably more important to develop right now, but better understanding the space of available control measures is still helpful. Broadly speaking, I think the ideal approach would mostly operate procedurally and would involve doing adversarial analysis of what problems our AIs could cause, and then use control evaluations and other methods to determine measures (and the overall control protocol). That is, rather than approaching control from the perspective of "we should implement XYZ measures", I think it would generally be more effective to establish a process for determining what methods are worthwhile both via testbed/benchmark driven development and via abstract threat modeling and analysis. As AIs get more powerful, developing methods by iterating within testbeds looks increasingly good, so I correspondingly think that technique development now should be substantially focused on preparing for a time when iteration is more valuable by: developing methodologies, making conceptual progress, and understanding which categories of techniques are important.

However, I still think it is somewhat useful to list the most promising and important seeming measures at a high level because:

Studying these measures now can accelerate later iterationFor some classes of measures (e.g., steganography countermeasures) studying that class in isolation can be the best approach.AI developers might be rushed later on and not have much time for end-to-end experimentation.

I'll focus on what a reasonably good version of control would look like, but note that this would require organizational buy-in and implementation. In a lower effort regime with less political buy-in (more like ten people on the inside), it would make sense to focus on a smaller subset of measures.

I find it useful to breakdown control measures into a few categories:

Software infrastructure

AI and human components

Security improvements

Within the category of software infrastructure, I'm just including "traditional software", so not things like AI-based monitoring systems or adversarial evaluation (which is done by humans or AIs). Correspondingly, many of the most important methods live within the "AI and human components" section.

While this isn't the only possible breakdown, I think this breakdown is useful because it roughly corresponds to different fields that already exist: general software infrastructure, AI/ML, and security. More generally, I think methods within these areas have commonalities in terms of the skills needed to work on the area and what needs to happen for the method to be implemented effectively (e.g., I think for software infrastructure most of the difficulty will be implementing for each AI company while more of the difficulty in AI and human components might be more like traditional ML research focused on figuring out which of several simple methods is best).

Software infrastructure, AI or human components, and security improvements also have somewhat different properties in terms of how difficult they are to implement and how much they will happen by default.

Getting to a generally high level of security is very expensive; however, the expensive traditional security measures we most want are also well motivated by concerns about human actors (both insider threat and outside actors). So, if the company already reached (e.g.) SL5 security, then the marginal security improvements I'd want specifically for control are relatively small. That said, I currently think it is likely we'll be trying to ensure control in regimes where security is substantially weaker than SL5 level, and it seems plausible that getting (potentially misaligned) AIs to do huge amounts of work to improve security will be important.

The software infrastructure we want for control (of internal usage) includes things that are somewhat useful for other purposes, like making it easier to prevent AI agents used internally from causing annoying problems and helping humans have an easier time understanding what AI agents are doing so they can point out mistakes. This implies that some of the infrastructure might be developed by default, though I do worry that it won't be applied as universally as we'd want. It also implies that people working on control should plausibly be somewhat wary of capabilities externalities (to the extent they think advancing capabilities at their organization is bad), though I think the externalities don't seem that costly and small mitigations might alleviate this concern.

Non-ML software infrastructure

By non-ML software infrastructure, I mean components which are just relatively traditional software (though this software might interact with some ML components). Thus, this category does not include measures that rely on human auditors or machine learning. Correspondingly, most of the "core" control measures focused on in earlier research (e.g. trusted and untrusted monitoring, adversarial evaluation) don't fall under "software infrastructure" and instead are under "AI and human components".

Software infrastructure measures should likely focus on control for internal usage. Or at least on AI usage for particularly sensitive applications. It may be difficult to implement control for broad swaths of usage if AI is provided via a low-level API (rather than a higher level API which handles scaffolding and is more similar to employing a contractor).

The most important software infrastructure

I'll start with a list of the most important types of software infrastructure:

Logging

Permission management and sandboxing

Scaffolding that supports resampling / defer / block

Output (aka artifact) and job/experiment tracking

Scaffolding that tracks additional information

Less important software infrastructure

Beyond this, there are a variety of software infrastructure components which are less essential, but which could still be quite helpful:

Shutdown mechanisms and grouping AIs into clusters to shutdown together

shouldn't be that costly

periodic random shutdown of AIs

Online training infrastructure

here

Limiting AI access to unnecessary information

Tools for managing (opaque) AI memory stores

Safe and cheap simulation of particular contexts or environments

Mechanisms for verifying incidents to third parties

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

Misc other infrastructure

AI and human components

Now I'll talk about other measures which mostly involve utilizing humans or AIs (rather than being software infrastructure or security measures). This section includes the "core" control measures that Redwood has studied in earlier work. I'll focus mostly on black box (non-internals) measures.

Many of the measures we're considering will be applicable to multiple threat models, but they will often specifically be grouped into a few approximate clusters:

Countermeasures for preventing diffuse (non-concentrated) issues, typically sabotage of work.Countermeasures for preventing concentrated issues, typically security failures.Countermeasures for specifically detecting if there is a sudden temporally correlated pattern of bad AI behavior.

For the diffuse issues cluster, it is typically fine for review to happen asynchronously, potentially even after the action has been executed (or the code has been added to the code base etc), as reverting the problematic action may be basically fine. For concentrated issues, blocking problematic seeming actions until they are sufficiently reviewed is more likely to be more important. Correspondingly, as far as human review goes, we'd want some type of asynchronous review process and a lower latency process for looking at suspicious actions which are blocking.

I'll loosely group these measures into a few categories:

Core measuresOther less-core measuresMore speculative measuresSpecialized countermeasures

This grouping isn't intended to be fully precise, but the list of measures would otherwise be a long unstructured list and I think adding some structure helps with comprehension.

Core measures

This list includes central measures that we very likely want. Or at least, we'll want some version of these measures.

Cheap (and fast) monitoring that can be run on every action

Trusted monitoring

Untrusted monitoring (with collusion busting

here

Probes on model activations

coup probes

Resampling and deference strategies

here

An automated auditing hierarchy

A human auditing process for reviewing actions

Adversarial evaluation/training

here

Other less-core measures

These are relatively general measures that seem useful, but aren't as central or as essential.

An incident reporting, escalation, and shutdown process

Trustedness evaluations

good job of strategizing about bypassing tests (or other control measures)

How will we update about scheming?

this post

A dedicated red team AI for control evaluations

Detecting if AIs suddenly start doing bad actions at a much higher rate

A system for determining permissions

Systems for tracking assumptions

More speculative measures

There are a variety of measures which aren't clearly viable, but which might be quite useful if they can be made to work.

Making AIs believe false facts

arbitrary

More precise capability control

Regularization combined with online training

here

Improving the frontier of trustworthiness vs capabilities, especially for control-specific tasks

here

Redaction and distribution shift removal approaches

More elaborate incrimination measures

Strategies for quickly training humans to be up to date on what AIs are doing

(Cheap) AI diversification methods

Mechanisms for detecting qualitatively large distribution shifts

Specialized countermeasures

For some specific concerns, it makes sense to develop relatively specific countermeasures. While it isn't necessarily predictable in advance which specialized countermeasures will be most relevant, it seems relatively clear that some specialized countermeasures are worthwhile.

Exploration hacking and sandbagging countermeasures:

exploration hack

Notes on countermeasures for exploration hacking (aka sandbagging)

Measures for eliminating or detecting steganography

Various methods like consistency checks to make precise sabotage or misleading results more difficult

Cross-context manipulation checks

this post by Fabien

Narrow or domain-specific countermeasures

Other directions and what I'm not discussing

In addition to these object level measures, test beds for refining and/or training our measures will be important.

While I discuss specific measures, my sense is that better general understanding of the dynamics around collusion, sandbagging, AI takeover strategizing capabilities, and bypassing monitoring will suggest a variety of somewhat different measures that are promising. So, better understanding and evaluating these seems quite useful.

I'm not discussing more sophisticated model internals methods which could technically be considered control. This includes: interpretability used for runtime detection of misalignment (or technically even upfront detection) and relaxed adversarial training / latent adversarial training. That said, methods like weak neuralese decoding could synergize well with control. I'm also not including methods like making deals with AIs. All of these methods are potentially promising, but aren't best studied through the black box control methodology and can mostly operate independently of the control agenda for now.

Security improvements

What about security? I don't have as much to say here, but I'll list some general points and then discuss some particularly promising and less obvious specific measures. There are probably useful methods that I'm missing due to insufficient context with security.

Generally secure software infrastructure

Interventions overlapping with human insider threat protection

Using hardware security features

Upload and bandwidth limits/monitoring

upload limits for preventing exfiltration

Discuss

Background and categorization

Non-ML software infrastructure

The most important software infrastructure

Less important software infrastructure

AI and human components

Core measures

Other less-core measures

More speculative measures

Specialized countermeasures

Other directions and what I'm not discussing

Security improvements

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签