少点错误 2024年11月19日
Proactive 'If-Then' Safety Cases
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI安全的if-then规划,包括反应式和主动式规划。分析了AI生物风险、快速递归自我改进、未经授权的自主复制等方面的规划内容,还提到了算法改进的可处理性及相关评估的难点。

🌐反应式if-then规划:等待危害发生后采取行动,如AI生物风险等情况出现实证证明后,才实施治理。

🚀主动式if-then规划:提前考虑问题,如看到某些证据时采取政府行动,包括AI生物风险等方面。

💡算法改进的可处理性:探讨AI训练成本降低和能力提升的速度,及其对治理计划的重要性。

📋评估的难点:如评估本身具有危险性,且对于算法改进的评估更具难度。

Published on November 18, 2024 9:16 PM GMT

Holden proposed the idea of if-then planning for AI safety. I think this is potentially a very good idea, depending on the implementation details.

I've heard criticisms of the If-Then style of planning that it is inherently reactive, rather than proactive. I think this is not necessarily true, and want to make a case for the value of proactive if-then planning.

 

Reactive

First, let's look at reactive if-then planning. I'll lay out reactive versions of three critical questions facing us about our near term future about which we might want to do if-then planning.
 

Reactive triggers:
 

- AI Biorisk: If we see empirical proof of substantial harm resulting from bioweapons which provably required AI-assistance to be created, then we put governance in place to address this risk going forwards. We will seek to reduce the likelihood of this harm happening again.
 

- Rapid recursive self-improvement: If we see empirical proof that some group was able to implement an AI system which provably self-improved more rapidly than if the group had directly worked on it, and the result of this self-improvement was an advance upon SotA frontier algorithms/models, then we will put governance in place to regulate such processes. We will seek to prevent any such processes occurring outside of carefully secured sandbox environments with government oversight.
 

- Unauthorized autonomous replication: if we see empirical proof that an AI system not under the control of a human or group is self-sustainingly persisting in an uncontained manner (e.g. has access to the internet or robotic actuators) and this AI system has caused significant harms in the world (e.g. breaking laws in pursuit of resources/survival), then we will put governance in place to empower enforcement agencies to halt this AI system and prevent the existence of such systems in the future.

 

On the plus side, I think these reactive if-then cases would be a really easy sell for most people and most governments. On the downside, the reactive nature of the chosen triggers means that we wait until harm has occurred before taking action. In all three of these cases, I argue that there is a substantial probability that the first case of such a harm would be an unacceptably large catastrophe.

 

Proactive

Now let's consider what proactive if-then planning might look like for each of these three cases. The common factor for proactive triggers is that we are instead asking, "what sort of evidence would convince us that we are in a world such that X is a substantial enough problem that it is worth paying costs to address?"


Proactive triggers:
 

- AI Biorisk: If we see empirical proof of an AI system demonstrating in a controlled evaluation context that it is able to critically enable the creation of substantially dangerous bioweapons by test subjects whose control-counterparts failed to create similar bioweapons, then we will take government action to address this risk. The first step may be to commission more studies, increasing aspects like realism, coverage, sample size, etc. Once the scope of the problem has been quantified, governance actions will be taken to prevent this harm from occurring.
 

- Rapid recursive self-improvement:  If we see empirical proof that some group was able to implement an AI system which demonstrates many of the key skills which would enable rapid automated AI R&D, then we will put governance in place to regulate such processes. This wouldn't necessarily be a ban, but it would definitely involve careful oversight and restrictions on access to the system.
 

- Unauthorized autonomous replication: If we see empirical proof that an AI system is able to demonstrate many of the key necessary skills for autonomous survival and replication, as outlined in this post by METR, then we should take action to reduce the risk of such an independent AI system getting loose on the internet.

 

All of these proactive if-then cases come with a cost. You have to pay, ahead of evidence, for evals to check for the problems. Thus, the cost-benefit analysis must weigh factors such as:

 

Note that this is a very different set of questions than one would be asking if early detection weren't possible at all. A call for evals is different from a call for blindly attempting to ameliorate uncertain future risks through costly actions directly impinging on the process of concern (AI development and deployment).

 

If you think that the case for having the government put resources into attempting to preemptively build and run evals for these AI risk cases is sufficiently weak that you believe no funding should go towards this, I'd be surprised. This feels like a pretty reasonable compromise between the "risks are small, don't impede AI development" camp and the "risks are large, let's be careful" camp.

 

A separate but related point

 

If you are with me so far in this framing of the discussion, and generally in agreement that social (i.e. government and charity) spending on AI risk related evals should probably be non-zero, then I'd like to pitch you on an additional risk category that I believe deserves non-zero evals work. 

The additional category I propose is perhaps less clear-cut than the three categories mentioned so far.

Tractability of Algorithmic improvement: how quickly should we expect it would be possible for AI training costs to drop, and/or AI peak capabilities to increase, without hardware changes? (improvement that synergizes with hardware change being a separate question).

Why is this question of key importance? Because it matters a lot for what governance plans we make. For instance, the recent proposal "A Narrow Path" makes the assumption that large datacenters are necessary for creating dangerously capable AI. The success of the proposal hinges on it being intractable for compute-limited groups to make substantial algorithmic improvements. They do acknowledge the risk that a large scale automated search for algorithmic advances (e.g. a large scale recursive-self improvement undertaking) could find substantial algorithmic advances. If there was a smaller, but still considerable chance that humans (with relatively limited AI assistance) could make dangerous advances without large compute resources, then I argue that their plan wouldn't work.

Tricky aspects of evals for algorithmic improvement:

The evals themselves would be dangerous, to the degree that they were realistic. (This is also true of biorisk evals, but biorisk is different in that it is more purely a weapon, and thus doesn't have the temptation to use the results in peaceful economically-useful ways.)

It seems to me that it is harder to develop realistic evals for this than for the other three cases. This means that the case for being able to preemptively get firm evidence as a result of building evals seems weaker for this risk. Thus, the overall predicted cost-benefit ratio is less favorable. I still think that there's work worth doing here, and this deserves non-zero funding, but I wouldn't prioritize it above the other cases.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 if-then规划 主动式规划 算法改进
相关文章