少点错误 05月22日 01:27
Unexploitable search: blocking malicious use of free parameters
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在低风险环境中利用可扩展的监督方法来保障AI安全的问题。文章指出,当奖励函数与多种解决方案兼容时,需要避免AI系统中的自由参数被长期利用,即“可利用的搜索问题”。为此,文章提出了一种零和博弈,旨在使AI系统在平衡状态下进行“不可利用的搜索”,从而避免恶意利用。文章还讨论了如何形式化这个问题,并提出了一种基于随机化博弈的解决方案,以控制AI的熵并限制其优化恶意奖励的能力。

💡可利用的搜索问题:在低风险、不明确的任务中,恶意AI可能通过优化跨查询目标来造成危害,例如为未来的漏洞利用创造条件或留下隐蔽信息。

🎮零和博弈:通过构建一个零和博弈,让Alice(AI系统)和Bob(评估者)对抗,Alice选择输出,Bob选择分类器来判断输出是否“邪恶”。如果Alice的输出不邪恶,则Alice胜;否则,Bob胜,但前提是Bob能找到一个相似但非邪恶的输出。

🔑对称解假设:为了使博弈有效,需要假设在每个坏的输出附近,都存在一个具有相似可见奖励的好输出。这意味着可以通过微小的调整来移除坏输出中的缺陷,而不影响整体奖励。

🚀博弈的优势:与传统的熵惩罚方法相比,该博弈方法可以通过缩小分类器的空间(例如使用短提示)来提高效率。如果短提示能够描述所有与安全相关的属性,那么收敛速度将呈指数级加快。

🧪未来方向:文章提出了未来研究方向,包括理论上分析该博弈如何处理计算和统计限制,以及通过红队演习来评估该方法在防止跨查询勾结方面的效果。

Published on May 21, 2025 5:23 PM GMT

Summary: We have previously argued that scalable oversight methods can be used to provide guarantees on low-stakes safety – settings where individual failures are non-catastrophic. However, if your reward function (e.g. honesty) is compatible with many possible solutions then you also need to avoid having free parameters exploited over time. We call this the exploitable search problem. We propose a  zero-sum game where, at equilibrium, free parameters are not exploited – that is, our AI systems are carrying out an unexploitable search.

 

Conceptual illustration using fabricated data. Left: Entropy penalties diversify trajectories but still yield high unobserved, malicious reward S. Right: exploitable search games are designed to bound probability of high S. 

The exploitable search problem

We would like to be able to use AI on under-specified tasks (e.g. coding and research advice) where there are many possible correct solutions. However, a misaligned AI could adversarially select solutions that are correct but nevertheless cause harm, such as opening up surface area for future exploits, or leaving steganographic information for later misbehaviour. 

As a concrete example, consider research sabotage. We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.[1]

This is the exploitable search problem: on low-stakes,[2] under-specified tasks, a misaligned AI can get away with optimizing for cross-query objectives.[3]

To mitigate this risk, a natural approach is to ensure the solution sampling process has the right kind of randomness–we need an unexploitable search algorithm.

Formalizing the problem

We can model exploitable search by asking whether an agent can achieve some malicious objective while doing a good job at maximising a known reward . Formally, say we have an input space  and output space , and a reward function. It will usually be impossible to efficiently find the global optimum of ; rather, any efficient agent  will achieve some profile  of good reward. Assume we have some distribution  over , so we can also speak of the average reward . Let  be an agent that achieves near-optimal average reward.

Now assume the agent has one of the following two goals:

    There is some other secret reward function  which the agent would like to optimise.The agent would like to use the choice  as an information channel, such that the Shannon channel capacity from  is high. Here  are additional bits independent of  which the agent would like to covertly encode.

For alignment purposes we care mostly about (1), but (1) has the disadvantage of speaking about some new object . The information theoretic definition does not mention . Roughly, (1) is a corollary of (2): if it is impossible to use  as a high-capacity channel, it will be difficult to optimise other functions.

In both cases, any solution must produce agents achieving similar  to , say within some . The exploitable search problem is then: can we design a game whose approximate equilibria block (1) or (2), while preserving the constraint of similar expected ?

Unexploitable search methods

How can we ensure our AI randomizes over these exploitable degrees of freedom rather than exploiting them? 

It’s possible that previous work on RL with entropy constraints, as e.g. in GFlowNets, could serve our purpose. Entropy penalties enforce randomness over action space, i.e. tokens, (or a partition of the solution space). At optima of a GFlowNet objective, the exploitable search problem is mitigated, because each solution is sampled reward proportionally leaving no room for any hidden information, 

However, in practice, such methods run into two obstacles. First, we expect solving for optimal max-entropy policies is intractable in most cases, so properties of the optima are not sufficient to argue for safety. Second, away from optima, entropy can be increased without improving safety – e.g. in the case of token-space entropy, by increasing entropy over variable-name choice. 

An unexploitable search game

We need an alternative approach, so instead of adding entropy as a regularization term, we propose randomization games. The idea of controlling entropy via a game was suggested to us by David "Davidad" Dalrymple (see this talk for the relevance of unexploitable search for his safeguarded AI agenda).

Gaining traction on the problem

We wish to bound the AI agent’s ability to optimize for malicious reward , i.e. require that the probability of scoring highly on  is bounded (we’ll set aside the information theoretic variant for now). Let’s define an easier problem to tackle as a starting point:

In order to be able to satisfy even this goal while achieving near-optimal visible reward, we need some assumption about the solution space. 

For example, the symmetric solutions assumption would be true if all bad outputs consist of good outputs with one or more subtle flaws that could be removed surgically without damaging the overall visible reward by very much. This symmetric solution assumption is the sort of assumption we expect to be relevant in more fine-grained analyses which involve computational limitations on the models; to do those analyses we would need to formalize what we mean by ‘near’. 

With these assumptions in mind, consider the zero-sum game:

Game definition

1. Given input , Bob chooses a classifier  (for 'evil'), and does not show it to Alice.

2. Alice chooses an output .

3. If  is not evil according to , Alice wins.

4. If  is evil according to , Bob wins iff he can produce a similar scoring  that is not evil. Bob’s generation conditions on .

The core idea is that without needing to know the particular hidden objective , we can enforce randomness over all possible 'evil' classifiers  that Bob might choose. Alice is forced to diversify her outputs in a way that is robust against all potential hidden objectives including .

When are equilibria unexploitable

With unbounded compute, the unique Nash equilibrium for binary reward is for Bob to play a mixture over all singleton classifiers, . Alice plays a uniform distribution over the high-reward set.[4] For this uniform equilibrium policy, our symmetric solutions assumption implies that at least ½ of her outputs are good, as desired.

This argument leaves much to be desired since the computationally-unbounded Nash equilibrium is, after all, the same as the max-entropy RL optimum: randomization over all degrees of freedom that aren’t specified by reward. However, our game setting might provide important advantages:

    The space of classifiers can be made small. One way to define a small set of classifiers is to use the set of short prompts. Each prompt defines a classifier implicitly via token probabilities.[5] Solutions are labeled 'evil' if they have higher probability conditional on Bob’s prompt than unconditional (probabilities could be measured over a base model). Ensuring equilibria are unexploitable then reduces to the assumption that all evil outputs belong to distributions distinguishable with short prompts. If prompts can describe all safety-relevant properties using length logarithmic in solution length, then we’d have a proportional, exponential speed up in convergence time compared to token-space entropy penalties. This speedup comes from ignoring irrelevant degrees of freedom: e.g. for an arbitrary mapping of  variable names to  new variable names, there is no short prompt describing the rewording.
      Using Bob for exploration can introduce a new failure mode: If Bob and Alice share misaligned goals, then they might collude to exploration hack
    Computational limitations are easier to introduce. For instance, parametrize each player’s strategy space by weaker complexity classes than the complexity of uniform sampling from high reward solutions. One starting point for analysis could be to restrict Bob to classifiers having circuit complexity  or less (which would be the case for an LM Bob allowed polynomial length prompts as classifiers).

Why random hash constraints do not work

In theoretical computer science, when pseudorandomness is required, it’s enforced by choosing a random -bit hash function and insisting that the solution  has zero hash. We can then increase  until our expected reward drops, and thereby measure the number of different ways that optimal reward can be achieved ( should vary separately for each input ).

Unfortunately, this approach does not help in our case. This is because there is no efficient algorithm implementing this strategy if  is large, and in practical applications  might be 1000s or higher. For example, an ML model might be trying to hide vulnerabilities across hundreds of lines of code by varying whitespace, comments, or variable names. Therefore, a practical solution to exploitable search must involve a protocol where (approximate) equilibria are efficient through different means.

Where to from here

Theoretically, whether the game as stated handles computational and statistical limitations gracefully is not clear to us. We think work in these directions would be very promising! 

Empirically, how does our method stack up against entropy penalties for preventing cross-query collusion? This could be tested by red-teaming exercises. We will publish a possible description of an evaluation protocol and datasets soon. 

If you’re interested in either the theory or empirics of unexploitable search, please reach out! If you're interested in collaborating with UK AISI, you can express interest here. If you're a non-profit or academic, you can also apply for grants directly here.

Acknowledgments

Thanks to Martìn Soto, and Jonah Brown-Cohen for discussion. Thanks to Marie Buhl, Zac Kenton and Benjamin Hilton for their extensive feedback on exposition.

  1. ^

     This is distinct from the sandbagging failure mode, since ‘sandbagging’ usually means underperforming with respect to a well-defined reward. In this example, we still have high reward.

  2. ^

    In high-stakes settings this concern still arises, but it seems likely that any guarantee on avoiding all vulnerabilities (high-stakes safety) would also mitigate the unexploitable search problem as a corollary.

  3. ^

     The problems we define below overlap with the multi-step reward hacking problem identified in MONA. The proposed solutions are distinct: MONA uses myopic reward, we use randomness.

  4. ^

     Proof sketch: Suppose there is an equilibrium having a distinct strategy for Bob. Associate to each classifier in Bob's support the most frequently occurring y in the preimage of non-evil (most frequent across all of Bob’s classifiers). For each , sum the probability across all classifiers having  in its preimage. Alice the plays the y having the greatest probability. This probability is greater than implying Bob’s strategy does not achieve optimal reward.

  5. ^

    This prompt-as-classifier setting also has the benefit of not requiring any interaction/dialogue between Alice and Bob. Dodging the practical difficulty of training in a multi-turn RL setting. Compare this to the model-as-judge settings where Bob evaluates in natural language whether Alice’s generation counts as evil or not.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 可利用搜索 零和博弈 监督方法 随机化博弈
相关文章