少点错误 2024年07月12日
Games for AI Control
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种名为“AI控制游戏”的新方法,该方法将AI和博弈论工具应用于Redwood Research提出的AI控制框架。该方法通过模拟AI系统和协议之间的交互,评估不同协议在安全性与可用性方面的权衡。文章探讨了一些初步研究结果,并讨论了该方法的优势和局限性。

😄 **AI控制游戏:模拟AI系统与协议的交互** 该方法将AI控制框架扩展为“AI控制游戏”,使用马尔可夫决策过程和范式博弈来模拟AI系统和协议之间的交互。这使得研究人员能够评估不同协议在安全性和可用性方面的权衡,并探索如何优化协议以最大程度地提高安全性,同时保持可用性。该模型可以用于模拟各种场景,包括使用可信监控器来检测和阻止恶意行为。

🤔 **初步研究结果:安全性与可用性权衡** 研究表明,在某些情况下,使用随机协议可以提高安全性。此外,研究人员发现可以通过优化参数来改进可信监控,例如调整审计阈值。研究还表明,在许多情况下,可以通过线性权衡来简化协议设计问题,这使得优化过程变得更加容易。

🧐 **优势与局限性:明确假设,快速测试** 该方法的优势在于明确了AI控制游戏的模型假设,并允许研究人员快速测试各种协议假设。此外,该方法还能够进行一些定量分析,并与其他相关研究领域进行比较。然而,该方法也存在一些局限性,例如需要为每个新协议重新设计模型,并且模型本身可能过于简单,无法完全反映现实情况。

🚀 **未来方向:拓展场景,深入研究** 未来研究方向包括:分析安全性和可用性权衡随编程问题数量、审计预算大小和人类审计员能力变化的趋势;模拟其他协议,例如不可信监控;收集其他问答场景的数据并将其应用于模型;模拟除问答之外的其他部署场景,特别是状态化部署场景。

🎉 **总结:AI控制游戏是一个有前景的工具,可以帮助研究人员更好地理解和评估AI系统的安全性和可用性。** 该方法的未来发展将为开发更安全、更可靠的AI系统提供重要指导。

Published on July 11, 2024 6:40 PM GMT

This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2].

We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie’s perspective.

AI Control Refresher

The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I’ll try to recap here.

A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming.

After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work. It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper):

(Left) GPT-3.5 is deployed as a “Trusted Monitor” to assess programming solutions generated by GPT-4. It is asked to output scores expressing the likelihood that a programming solution contains a backdoor. For evaluation, GPT-4 is sometimes prompted to submit a backdoor. (Right) Using the empirical data and simulation, Redwood shows the tradeoff between the safety and usefulness across a range of protocols.

This work

By introducing “AI-Control Games”, an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood’s Toy models of AI Control.

        The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won’t explain the model here (which is a bit fiddly and notation-heavy). Instead, I’ll share some of our early findings using the model and explain the benefits and limitations of this approach.

Findings

Safety/usefulness frontiers, mirroring those from Redwood across two hypothetical deployment situations: “Original” (which uses a trusted monitor distribution approximating Redwood’s shown above) and “Weak” (which uses a hypothetical trusted monitor that is less discerning). The seven points from Redwood’s original paper are shown as larger circles. Our paper explains these graphs fully, but you can see the minor improvements for some new protocols.

Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper).

We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements.

We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue’s protocol to be randomised, the Pareto frontier is “concave”. This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective.

Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6] 

Although (technically) multi-objective and general-sum, AI-Control is almost single-objective and zero-sum. The objectives for the Red and Blue teams in Control look approximately like this (where  is a policy/strategy, with notational abuse for brevity):

The good news is that, given a particular linear weighting, the leader-follower assumption and the fact that Red cannot affect the usefulness[7], we can reduce to a simpler single-objective zero-sum game, which may be easier to solve.[8]

Evaluation of the research

Benefits: I see a few main benefits to introducing a mathematical model of an AI Control setup. First, it makes the modelling assumptions behind the game extremely explicit, and writing a model is useful for identifying uncertainties. Second, it allows us to cheaply test hypotheses about protocols in hypothetical deployment settings using real or constructed data. Third, it allows us to write proofs about individual games. Fourthly, situating AI Control in the context of similar research fields has learn from and share with those fields.

Fundamental Limitations: The main limitation of this approach, as I have found, is that it is difficult to define a model which is both informative and applicable to a range of protocols. Therefore, the AI-Control Game has to be redesigned each time a new protocol is modelled. Although each toy modelling exercise is fast relative to a large engineering project, rigorous modelling is much slower than a quick best guess. I expect this will be a cost worth paying for large AI labs, which have more resources, but it limits my ability to recommend this work to someone today.

Future work: There are many interesting directions for future work:

    Analysing trends, such as how the safety/usefulness tradeoff changes with more programming problems, smaller audit budgets, or flawed human auditors.Modelling other protocols, such as untrusted monitoring. This is particularly interesting as it allows us to model new important interactions, such as collusion/steganography.Gather data about other question-answering settings and plug it into the model.Modelling deployment settings other than question-answering, especially a stateful deployment setting.

Acknowledgements

Thanks to Diogo Cruz, Joss Oliver, Erin Robertson, Fabien Roger and Matthew Wearden for feedback and helpful nudges.

  1. ^

    Roughly, detailed plans for deploying the AI system. Ideally this is software which makes requests to other resources, such as the untrusted AI system, other AI systems, or human labour.

  2. ^

     Roughly “Power-motivated instrumental training-gamers”. See Carlsmith’s report or the original Control paper for more detail.

  3. ^

     Technically, an AI-Control game is a special-case of a multi-objective two-player partially-observable stochastic game.

  4. ^

     Across a range of domains (charaterised by trusted monitor distributions), an audit FPR of >2% seems to be optimal, though the exact value varies (e.g. 3% or 7%).

  5. ^

     I haven’t worked out exactly how important this assumption is yet.

  6. ^

     For example, Stackelberg Games can be used to model the defence of some resources against unknown attacks by an adversary (for example, deciding where LAX should allocate its security resources against terrorists or picking Coast Guard patrol routes).

  7. ^

     Usefulness is evaluated in the case where the untrusted AI is not scheming, and therefore in the proxy case when the untrusted AI is not influenced by the Red team.

  8. ^

     The main theorem in the paper states precisely that “For every AI-Control Game , each pair of policies   is in Control-Equilibrium if, and only if, there is some  such that  is a maximin point for .” Here  is a zero-sum two-player partially-observable stochastic game, which is easier (though not easy) to solve, as a previously studied problem.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 博弈论 安全 可用性 协议
相关文章