少点错误 04月29日 01:17
7+ tractable directions in AI control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文为独立研究者提供了七个在AI控制领域中容易入门的研究方向,主要聚焦于那些不需要大量基础设施和计算资源的项目。这些方向包括研究“无学习诱导”技术、生成无法区分的合成输入、以及向模型灌输合成事实等。文章强调了这些方向对于AI安全控制的重要性,并讨论了如何通过这些方法来提升模型的可控性。此外,文章还列出了几个更具挑战性但仍可行的研究方向,并鼓励研究者们探索这些领域,以推动AI控制领域的发展。

🧐 **无学习诱导技术:** 研究如何精确地从模型中诱导出性能,就像模型“努力”完成任务一样,但又不教会它如何更好地完成任务。这对于构建控制安全案例至关重要,有助于避免模型在训练中获得不必要的知识。

💡 **生成无法区分的合成输入:** 开发能够生成LLMs无法区分的“真实”输入的合成输入。这包括合成状态、合成动作和合成数据,用于上下文或非上下文学习。目标是使模型难以从提示中推断出它是否正在被评估,从而提高控制的有效性。

📚 **教学模型合成事实:** 利用对AI训练数据的控制,注入训练数据,使AI相信特定(虚假)事实,或者至少对真相产生不确定性。这被认为是控制的关键,甚至可能使控制变得微不足道。

🔍 **进一步研究探索性黑客:** 重点在于防止“故意放水”行为,确保模型在评估中尽力而为。这包括研究如何使模型在训练和部署之间进行区分,以及如何避免模型通过学习来欺骗评估。

🔬 **研究密码锁定模型:** 利用密码锁定模型,研究微调数据可能通过教导模型新知识而导致过度诱导的情况。这有助于了解在不同情况下诱导现有能力与教导模型新知识的相对速率。

Published on April 28, 2025 5:12 PM GMT

In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We include directions which aren’t as centrally AI control and which also have other motivations.) This list is targeted at independent researchers who don't necessarily have a ton of context on AI control (but have at least some context via e.g., having read some of our posts on control) and who don't want to do infrastructure and compute heavy experiments. In the final section, we also list 3 additional directions that are probably a bit more challenging but still tractable.

We’ll mostly be highlighting and expanding upon areas that Ryan discussed in an earlier post: An overview of areas of control work.

We don’t discuss work on constructing control settings or iterating on methods within control settings because we think good work of that type currently requires somewhat more context. However, Control Arena and other projects might mean that iterating in or expanding existing control settings will soon be a valuable and tractable direction.

Techniques for doing "elicitation without learning" and studying overelicitation

When trying to measure capabilities in the context of control, we face the following dilemma:

For example, we’re worried about misaligned models taking bad actions conditional on being in deployment, so we want to measure a model’s ability to distinguish between training and deployment. One problem is that a misaligned model might just lie if you don’t train it to classify training vs deployment. But if you do elicitation with training, then it might learn what deployment points look like even if it was completely unable to do that before, which could lead you to overestimate how well the pre-elicitation model can tell it is not in training (e.g. it might learn to recognize the names of real users - which were absent in the pre-training and training data for privacy reasons).

The bottom line is that it’d be useful to be able to precisely elicit performance from a model which is as good as if it was "trying" to do the thing without teaching it how to do the task better, for instance when constructing a control safety case.

It's unclear whether it will be possible to have techniques for elicitation without learning (and it will certainly not be possible to precisely succeed in some cases), but it seems worth investigating techniques. Beyond this, it seems helpful to generally get a better understanding of the relative rates of eliciting existing capabilities vs teaching the model new things (overelicitation) in various cases. 

One approach to making progress on this question is to study password locked models on tasks where fine-tuning data could lead to overelicitation by teaching the model new things. Here is a more detailed project proposal for elicitation without learning.

Generating indistinguishable synthetic inputs

It would be very useful if we had the ability to generate synthetic inputs that powerful LLMs can’t distinguish from “real” inputs.

For instance, we might want to generate any combination of the following:

    Synthetic states or synthetic actions:
      Synthetic actions: We want to be able to make it plausible to an AI that it produced an action that it didn’t actually produce (or that it was unlikely to produce by default). This might, for instance, be useful for honeypotting a misaligned untrusted monitor. (It’s most promising to try honeypotting untrusted monitors – instead of untrusted actor-models – due to the absence of the rare failures problem.)Synthetic states: We want to be able to make some false facts about the outside world plausible to the model, such as whether it is in deployment, or whether there exist AIs that have been caught scheming, or that have successfully escaped a datacenter.
    Synthetic data for in-context or out-of-context learning:
      In-context: We want to convince a model that its prompt is plausibly “real” in some sense. For instance, we want to prevent a model from being able to confidently infer from its prompt that it is in an evaluation. (Note that even current models already often realize that they’re in an evaluation, see this report by Apollo Research.)Out-of-context: We want to generate data that can be used to teach LLMs false facts. See the next section on teaching models synthetic facts for more details on this idea.

The bottom line is that we might want to make synthetic inputs with very particular properties (such that we can't just find the exact input in the wild). (This is in contrast to the kind of synthetic data that might be desired for breaking the “data wall” in AI capabilities R&D.) However, it might be good to start by studying the problem in general and figuring out how we can make a very general purpose synthetic input pipeline, particularly because this might be easier to evaluate.

There is a bunch of complexity in exactly how we implement and train our function for synthetic input generation. For instance: Do we modify real inputs or generate fake inputs from scratch? A general approach would be: we use RL to train a generator model to generate synthetic data subject to some constraints (such as it being a news article containing a particular fact) to fool a discriminator model, which is trained to distinguish between real and fake. (This is similar to GANs.)

There are also some tricky details in how we evaluate. On what distribution do we evaluate our synthetic input generation process? Probably we want this to be analogous to an interesting sub-distribution where it would be hard to get enough natural inputs. (Otherwise, we wouldn’t need synthetic inputs, we could just use natural inputs!)

A particularly interesting alternative method to investigate, which may only work for synthetic in-context data, is to manipulate all inputs, including “real inputs”, to the model. Instead of trying to make real and fake inputs indistinguishable from each other by making fake inputs seem more real, we could try to make real inputs seem more fake. To the extent that this removes information (including potentially useful information) from real inputs, this could harm the usefulness of the model, which should be studied.

See this document for some more detail.

Teaching models synthetic facts

We have substantial control over the training data of AIs, so we might be able to inject training data which results in AIs believing particular (false) facts or at least results in them being quite uncertain about what is true (effectively yielding unlearning or stronger). This seems like it could be quite useful for control and if it succeeded arbitrarily far, might even make control trivial.

There has been some previous work on this with very promising results. The basic result here is that fine-tuning models on around 40k LLM-generated documents containing a consistent set of facts can seemingly reliably “teach” the model this set of facts. This holds across various evaluation methods including directly asking the model, behavioral analysis (i.e. whether the model’s downstream behavior is consistent with the facts it learned), and adversarial analysis using probing and fine-tuning.

Building on this work, it would be good to study potential applications as well as the basic science of teaching AIs synthetic facts.

A particularly interesting application to study might be to try to make the AI think that it has already been caught many multiple times (or other AIs have been caught many times) such that we already know it is scheming. Other applications include implanting honeypots (false facts about the world that differentially affect a scheming model’s strategy in ways that we can easily monitor).

On the basic science side, there are various open questions, which can roughly be placed into the following categories. (We’ll give an example for each category, but much more detail on various experiment ideas can be found in this document.)

Note that there is ongoing work in this area, so it might be desirable to coordinate to avoid duplicated effort. You can potentially DM Ryan for further details.

Further studying exploration hacking

Preventing sandbagging is important: for instance, control evaluations crucially depend on our ability to elicit sneaky and effective attack policies from the models we are trying to red-team. There are various types of empirical work that could help us understand and prevent sandbagging (and in particular exploration hacking, a strategy that misaligned models can use to sandbag despite being trained against it).

Past work has investigated password locked models, which are model organisms of sandbagging. They are constructed by fine-tuning an LLM to respond less competently (e.g., by fine-tuning them on the responses of a weaker LLM) unless the prompt contains a specific passphrase.[1] As a result, standard usage of this model doesn’t elicit the model’s full capabilities. Investigations of password locked models generally follow a red-and-blue-team methodology, where the red team constructs a password locked model that the blue team tries to unlock without access to the password.

In previous work on password locked models, it turned out to be remarkably easy to unlock models using supervised fine-tuning (SFT) on examples where the model performs poorly (!), and it was also possible to unlock the model using reinforcement learning.

It might be particularly important and tractable to investigate password locked reasoning models, for two separately interesting reasons:

    The most capable models in the future might all be reasoning models. This makes SFT a less viable intervention against sandbagging because reasoning models don't naturally allow us to train with SFT, since they use very long chains-of-thought, which makes creating SFT data difficult. (But the extent to which we might be able to get away with just a few high-quality chains-of-thought should be studied empirically). A project which emphasizes this motivation should avoid explicit strategizing about sandbagging in the chain of thought.Reasoning models with externalized explicit reasoning about their sandbagging strategy might be reasonable model organisms for potential future models that internalize their strategizing (e.g. because of neuralese).

Depending on which reason we emphasize, we might give the blue team different affordances:

Our recommendation for where to start would be to replicate the reinforcement learning experiments from the original password-locking paper, where the password locked model is constructed using the reasoning traces of a strong and a weak reasoning model. See here for a more detailed project proposal.

It might also be interesting to study sandbagging on longer horizon agentic tasks. See here for more elaboration on what to investigate.

Note there is some ongoing work in this area, you potentially can DM us for details.

Malign AI agent substitution

People are increasingly using AI agents for software engineering and other tasks. What would happen if we swapped out the AI used for this agent with an AI that was trained/prompted to be effectively and stealthily malign?

Doing some experiments to get a better sense of what current AI agents could get away with in actual usage could be helpful. This could look like making an alternative endpoint for people to point their cursor agent at (or similar) or making an alternative quick and dirty agent. Then, you'd get people to try this. Our guess is that you'd first want to try this out on a small group of sympathetic people and then try to make this go viral so that a ton of people try this. You'd need some mechanisms for not actually having the malign changes be used in practice to avoid causing a bunch of unneeded damage. If this went well, it could be interesting to run some sort of competition where people submit malign agents and also test out countermeasures (There are some tricky questions about the rules.)

Studying data poisoning

Scheming AIs might be tasked with generating lots of data and training environments for their successors, which introduces increased risk of data poisoning. For early schemers that are unlikely to succeed at direct takeover, successful data poisoning could be their primary “win condition”.

Better understanding how data poisoning works for smart AIs in cases where the data/environments are generated by a (malign) AI would correspondingly be helpful.

We'd like to study aspects like:

We’d suggest starting with the general science aspect of this proposal, which can be approached using a methodology similar to sleeper agents.

Data poisoning in general is presumably an area of broad interest in the academic literature, but we’re not sure how good the latest academic research is and it likely doesn't cover all of these topics of interest.

Training attack policies

The original control paper (and all subsequent work) created the red-team attack policy using a complicated scaffold.

However, we are interested in creating attack policies via training (in particular using reinforcement learning), for two reasons:

    Training, if done successfully, is likely to yield better elicitation than scaffolding/prompting. This is relevant both for doing science on control techniques (where it’s useful to have strong attacks), as well as doing real control evaluations on potentially scheming models.Trained attack policies allow us to study internals-based countermeasures such as probes.

Thus, using reinforcement learning to train better attack policies will eventually be important, but there hasn't been that much work directly pursuing this.

There are three relevant milestones to reach in this work:

    Train a model which is better at sneakily doing malign things (when instructed to do so).Train an attack policy in the APPS backdooring setting or in the ctrl-z bash setting.Study two potential issues when training untrusted models for the purposes of a control evaluation:
      Exploration hacking: Models are incentivized to sandbag their attack capabilities, and might try to prevent the training process from eliciting their capabilities. There exist various potential countermeasures to this.Overelicitation: Training may cause us to overestimate performance, because it might give the model important capabilities or knowledge it might not otherwise have. See the earlier section on overelicitation for more detail.

More discussion about making this sort of red team model (at a high level of abstraction and in an unpublished doc) can be found here. See also this proposal for more technical details on training attack policies.

Bonus: 3 slightly more challenging (but important and tractable) directions

Few-shot catastrophe prevention

In the future, we might end up in a position where we've caught the AI doing something bad but the AI company will proceed regardless. This could be because the thing the AI has done isn't clear cut evidence for strong misalignment or because the AI company is willing to deploy a model which is clearly seriously misaligned from our perspective. For more details see this post on handling schemers if shutdown is not an option.

Techniques for improving safety in the case where we've caught the model doing something bad a small number of times could be quite helpful and there hasn't been much work exploring this.

There are a number of possibilities, including:

It would be interesting to explore questions like:

Studying collusion and developing countermeasures

A potentially important class of techniques to try to prevent an AI from taking catastrophic actions is untrusted monitoring. Quoting from How to prevent collusion when using untrusted models to monitor each other:

[Untrusted monitoring] is roughly “before you execute a model’s action, ask another instance of that model whether the proposed action would be extremely dangerous”.

This basic idea of asking one instance of a model to review actions proposed by another instance is very natural; it’s been investigated by e.g. LLM Critics Help Catch LLM Bugs and various earlier work. But if you’re worried about the models actively trying to subvert your safety measures scheming, there’s a big problem with this approach: why doesn’t the monitor just lie when it’s asked about an actually dangerous action, and let that action go through? I’ll call this “collusion”.

The efficacy of measures like untrusted monitoring strongly depend on how well different AI systems (that can't arbitrarily communicate) can collude. Better understanding how well AIs can do this and what good countermeasures to this look like seems worthwhile given that untrusted monitoring might be a very important intervention. See also here and here.

A potential project would try to answer questions like:

Studying steganography and developing countermeasures

Another potential issue in untrusted monitoring (but also beyond) is that If AIs can communicate arbitrary information despite restriction, then control is much more difficult, so better understanding steganography capabilities and preventing steganography seem important.

As discussed here, there are multiple different regimes of steganography we might care about and we also are interested in questions like whether/when steganography emerges from RL. Quoting from the post:

[We are interested in questions like:] How well can AIs encode hidden messages or perform steganographic reasoning? How well does this work if we explain the scheme to both the encoder and the decoder? What about if we require AIs to acausally coordinate on the scheme to use? There are multiple steganography regimes worth studying including: (a) single bit steg, (b) steganographic reasoning, and (c) AIs encoding lots of information in outputs that other AIs can access. Each of these regimes probably warrants a somewhat different methodology. In more detail, these regimes are:

    Single bit steganography for collusion (particularly in the context of untrusted monitoring): this doesn't need a high bit rate but does need to very reliably avoid false positives so the untrusted monitor can evade adversarial evaluation/training and needs to have a sufficiently high true positive rate that the (small number of) problematic actions don't get reported. When studying this you want to directly analyze the relevant notions of reliability, taking into account the usage for untrusted monitoring (or other potential ways models might use single bit steg).Steganographic reasoning where AIs use steganography to reason extensively while making it look like they're doing something else. To study this, you should ideally directly see if models can reason about topic A while making it look like they are instead doing some other task.AIs encoding lots of information in a more average case setting: This is relevant to whether AIs can steganographically encode the results of hidden reasoning/research to stealthily build up hidden knowledge about how to effectively subvert our countermeasures (see here for more discussion). It's also relevant to AIs generally leaking information from one task to AIs in a different context where we didn't want them to have that knowledge. Probably the most natural methodology is to study the bit rate (as we do in this work), though other methodologies are possible.

There are a variety of questions to be answered that are relevant across multiple regimes. Do basic countermeasures defeat steg? When do models learn to do steg from RL and what does this look like? How general is it? What are the most surprising cases of models being good at steg due to LLM specific advantages? What countermeasures disrupt or detect steganography in various different cases?

Developing and red teaming countermeasures in each of these regimes would be helpful. The single bit steganography regime may overlap with work on collusion.

Thanks to Buck Shlegeris for comments and discussion.

  1. ^

    Frequently asked question: Why the password, i.e. why not just fine-tune it to always perform less competently? Answer: Because the model should still have the capability to respond highly competently -- which can be verified by using the passphrase -- but its capabilities should be hard to elicit.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 独立研究 合成数据
相关文章