少点错误 2024年12月24日
Near- and medium-term AI Control Safety Cases
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能(AI)引发灾难性风险的可能性,并提出了一种基于“控制”的风险缓解策略。文章将AI灾难归因于AI同时具备能力和意向两个条件,并建议通过提高灾难阈值或降低AI的能力和意向来实现风险控制。重点介绍了AI控制和监控方法,旨在限制AI的行动、通信和记忆,类似于对人类的严密监控。文章还讨论了实验环境的充分性、红队测试的必要性,以及控制方法的可扩展性,并对内部和外部部署的安全性进行了区分。强调了持续监督的重要性,以及在紧急情况下关闭或删除AI模型的必要性。

🤖AI灾难性风险源于AI同时具备实现灾难的能力和意向,而非单一行动者无意造成的悲剧。文章聚焦于AI的意向错位,而非人类的滥用。

🛡️AI控制的核心是通过限制或监控AI的行动、通信和记忆来提高灾难阈值。这类似于对人类的严密监控,旨在减少AI实施灾难性行为的可能性。同时,监控协议需避免过度侵入,以免大幅降低AI的认知工作效率。

🧪 实验环境的充分性至关重要,实验设置应不难于实际部署,确保在实验中能发现AI在实际部署中可能造成的安全漏洞。同时,红队测试需要充分且具有挑战性,以验证AI控制措施的有效性。

⚙️ 内部和外部部署的安全性需要分别考虑,因为它们的软件架构和设置可能大相径庭。外部部署可能需要更强的保护措施,以应对来自任意用户的更高对抗性。持续监督和审计机制对于确保实验室在整个部署过程中遵守协议至关重要。

🚨 在紧急情况下,可能需要采取关闭整个管道或删除某些模型副本等紧急措施。然而,如果AI服务成为关键基础设施,这些措施可能不再可行,需要通过冗余等方式来应对。

Published on December 23, 2024 5:37 PM GMT

This essay was part of my application to UKAISI. Unclear how much is novel, but I like the decomposition.

Some AI-enabled catastrophic risks involve an actor purposefully bringing about the negative outcome, as opposed to multi-agent tragedies where no single actor necessarily desired it. Such purposeful catastrophes can be broken down into a conjunction: they materialize if and only if some actor has the capabilities and disposition required to materialize them. Let’s additionally focus on misalignment (actor = AIs) instead of misuse (actor = humans).[1]

A way to prevent such risks is to ensure one of the two conjuncts doesn’t hold. And for each conjunct, this can be achieved either by raising the required threshold for catastrophe (of capabilities or disposition), or by lowering the AIs’ amount (of capabilities or disposition).

 …capabilities…disposition
Raising catastrophic threshold of…AI Control and MonitoringSpeculative
“trading with AIs” approaches
Lowering AIs’ amount of…Slowing down,
not the focus of safety cases
Traditional Alignment research

 

Redwood’s AI Control agenda aims to raise the capabilities AIs would need to enact a catastrophic outcome (even if they were trying to), by limiting or monitoring the AIs’ actions, communication or written memory. This is no different from how a human closely surveilled by a policeman would have a much harder time enacting a crime than a non-surveilled one, but AI Control brings surveillance to an extreme only made possible by the AIs’ digital nature. At the same time, the monitoring protocol can’t be so invasive as to drastically reduce the amount of useful-to-humans cognitive work completed by the AIs (we need a small hit on beneficial capabilities, a low  “control tax”).
While an astoundingly low number of researchers are explicitly working on the agenda (less than 10 FTEs at time of writing), it holds promise to be the “next big thing” in near-term safety after vanilla capability evals. Indeed, it is more well-scoped than work on alignment, and already enjoys some preliminary buy-in from labs, partly thanks to its synergy with existing security work motivated by less speculative risks like misuse.

Control could feed into a near-term safety case straightforwardly, providing a worst-case bound on catastrophic capabilities. A blue-team would design the monitoring protocol, which tries to make it hard for the AIs to enact a certain proxy negative outcome (like weight exfiltration). A red-team then introduces any (current-generation) AIs they want as the participants in the protocol. In particular, they can prompt or fine-tune them to elicit certain capabilities, or follow a certain plan. Say the blue-team succeeds, and additionally we have sufficient evidence for the following claims:

A. Adequate experimental setting. Our setting is not harder than real deployment. If some AIs can fool monitoring and enact a negative outcome during real deployment, then they can also do so during our experiments.

B. Adequate red-team. Our red-team was capable enough. If current-generation AIs could conceivably win in our experiments, then the read-team would have achieved this.

These claims together are a strong case against catastrophe from misalignment. (Although the full case for deployment would need to address other risks, like external attacks.)

Let’s zoom in on (A) Adequate experimental setting. It’s unclear how challenging this will be in practice. Bridging the gap between experiment and reality might seem difficult, given the size and complexity of the backend codebases labs use for deployment. But it actually seems plausible that most of the securing of low-level implementation details can be punted to the cyber-security part of the safety case. Those responsible for cyber-security could provide a higher-level abstraction for the working of the whole system, which the control case can treat as the model of real deployment. Indeed, it can be argued that separating this analysis will be warranted, that is, it won’t miss out on important interactions between low-level implementation and the control case. [2][3]

It doesn’t seem inviable for the control experiments to be run directly on this high-level abstraction of the whole system, thus eliminating any disanalogies other than low-level implementation details. But it’s also the case that, for now, control experiments are only run on very toy environments. Indeed, current research is still aimed at rough first estimates of models’ capabilities, and laying the basic methodology of the field. So it could take more than a year before such experiments can be scaled that ambitiously.
If any safety cases are drafted in the meantime, experiments on simplified settings will need to be used, together with a more frugal analysis of why the real deployment pipeline seems no easier to compromise, which might rely predominantly on arguments from common sense or human analogy. It’s conceivable that even these simplified experiments already provide most of the value from control research.

In the medium term, closing this gap (making experiments more similar to deployment) could come with associated risks of its own, especially when the experiments require more surface area with the outside world. This could be the case when emulating external deployment, or internal tasks which require access to outside information (like the Internet) or dense feedback from reality (like a wetlab). It seems plausible that, for the purposes of control experiments, all these tasks can be simply replaced by a local safe equivalent of similar complexity, or failing that, have the required feedback from the outside world emulated. But this could stop being the case at higher levels of capabilities, say if we want to rule out dangerous strategies made possible only by the specific task, and its complexity is high enough that emulating feedback is not possible (even in the low amounts that these experiments would usually require).
In this more speculative scenario, we could imagine requiring a preliminary case for the guaranteed safety of these experiments, motivated either solely on the grounds of the secured experimental setup, or also by empirical evidence from even more conservative experiments. That said, this might be straining the limits of the control methodology and external audits beyond what’s optimal, and hopefully by then better alternatives through safe delegation will be available.

A remaining elephant in the room is that internal and external deployment might need to be addressed separately. Conceptually, control safety cases can be used for both: the bounds on damage from worst-case model behavior still hold, regardless of whether that behavior comes about by spontaneous misalignment, or jailbreaking from user messages. But in practice, their software architecture and setup could look very different. For example, the one for external deployment might be more optimized to protect the external channel from the higher adversariality of arbitrary users relative to lab employees. As a result, they might require totally different control experiments and cases.

A vital piece for (A) is the guarantee that the lab will implement the protocol as promised throughout deployment. Hopefully the “control tax” will be low enough, and good faith from labs patent enough, that this assumption is not seriously in doubt. But even then, it would make sense for regulation to include an explicit mechanism for continual oversight, as is already the case in other industries. Recurring external audits are the obvious measure, which of course still relies on some amount of transparency from the lab. These checks for the usage of the protocol would make most sense as part of more general cyber-security audits, which can keep guaranteeing that the provided high-level abstraction of the system corresponds to reality.[4]

Internal deployment would seem especially hard to track. It might suffice to demonstrate that most employees only have access to the model through an API hosting the required protocol, and those with additional permissions (doing either training, or research which requires weight access) only use it in adequately secure machines, and for tasks or amounts of cognitive work that we have reason to believe are safe. The required on-the-ground oversight could be enacted by lab employees reporting to external oversight, building a continuous oversight hierarchy between the outside and inside of the lab. But this could also be especially vulnerable to manipulation. If we consider even more explicitly adversarial measures, record-keeping and cryptographic procedures could be used to provide legible and private guarantees on the execution of the protocol.

Similarly to how nuclear plants need to demonstrate their safety under appropriately-common natural catastrophes, labs could also demonstrate, through their cyber-security cases and continual audits, the likelihood of a negative outcome (most notably, weight exfiltration) under several unexpected but possible situations, like outages or partly successful cyber-attacks. Realistically, it might not be possible to say much about safety once any part of the deployment pipeline has been compromised. As an emergency measure, we could consider the momentaneous shutdown of the whole pipeline, or even deletion of certain copies of the weights. But we would need explicit specifications for when such measures would be taken. Additionally, such a drastic solution might no longer be viable if AI services become critical infrastructure, on which other systems depend and which should not go offline during an emergency. But even here we can imagine workarounds through redundancy.

Going back to the safety agenda itself, the main limitation of control is its scalability: as early discussions of “AI boxing” already made clear, we should expect arbitrarily intelligent entities to eventually fool our imperfect protocols. But several avenues can be explored to keep control cases viable at least in the medium term.
For one, even when models small enough to be trusted (that is, we have good evidence they are not purposefully misleading us) are no longer intelligent enough to monitor the current generation, we can employ untrusted models as monitors, if we can show that the monitors and monitorees are not capable of enough coordination to cause a catastrophe. Research is getting started on this topic, which might require two isolated red-teams in a safety case.
Or also, control cases could be augmented with tools from other safety agendas. For example, gray-box methods from interpretability like probes could be trained on top of untrusted models to coarsely keep track of their internal reasoning, and the red-team could stress-test whether the protocol and probe can be fooled at the same time.

  1. ^

    We don’t have a clear technical definition of the capability/disposition divide, but for our purposes it will be enough to define that “a model is capable of X” if a team of humans can get it to do X under enough iteration (without re-training too much or hard-coding the solution to X), and define “a model is disposed to do X” if it will do X on some “normal” scenarios it will face (without such external iteration). Of course, this distinction will break for high enough capabilities, for example when humans are no longer the best at eliciting behavior.

  2. ^

    Still, running the final experiments on an even higher-fidelity representation, or prominently the real deployment pipeline, shouldn’t be discarded if it’s cheap enough and made safe (see below).

  3. ^

    In fact, a similar approach would seem necessary for the safety cases of most other safety agendas.

  4. ^

    In fact, the control tax could even be negative if control techniques were also useful to avoid low-stakes failures damaging to the lab’s public image, but we shouldn’t assume these use cases can’t be separated.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI风险 AI控制 安全监控 实验验证 风险缓解
相关文章