少点错误 01月29日
Ten people on the inside
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了在不同AI开发环境下,特别是“赶工不负责任的开发者”情景下,如何有效降低AI对齐风险。文章强调,即使在资源有限、政治支持不足的情况下,AI公司内部的小团队仍可通过构建风险证据、实施安全措施和进行对齐研究来显著降低风险。文章还指出,外部力量可以通过提供可移植的研究和协助来增强这些内部团队的能力。文章呼吁,在AI安全研究中,应更多关注悲观情景,并优先开发低成本、易于实施的安全技术。

⚠️ “安全案例”制度:理想情况下,所有AI开发者都应遵循的安全方法,以最大限度地降低AI带来的风险。这种方法通常要求AI在部署第一年内逃逸的概率小于1%,并且在模型试图破坏安全措施时,条件概率小于5%。

🏃 “赶工的合理开发者”制度:在这种情况下,即使是相对负责任的AI开发者,也会因时间紧迫而难以谨慎实施干预措施,或在安全方面犯错。

🤦 “赶工的不负责任的开发者”制度:这是文章重点关注的情景,即AI公司内部虽有少数人关注灾难性风险,但公司整体不重视对齐风险,导致安全措施执行不力,此时,内部人员需要通过构建风险证据,实施安全措施,并进行对齐研究来降低风险。

💰 低成本安全措施:在资源有限的情况下,安全措施需要廉价且易于执行,避免增加其他团队的工作负担。文章强调,控制技术的研究应允许复杂性内部化,并鼓励外部力量提供可移植的研究和协助。

Published on January 28, 2025 4:41 PM GMT

(Many of these ideas developed in conversation with Ryan Greenblatt)

In a shortform, I described some different levels of resources and buy-in for misalignment risk mitigations that might be present in AI labs:

The “safety case” regime. Sometimes people talk about wanting to have approaches to safety such that if all AI developers followed these approaches, the overall level of risk posed by AI would be minimal. (These approaches are going to be more conservative than will probably be feasible in practice given the amount of competitive pressure, so I think it’s pretty likely that AI developers don’t actually hold themselves to these standards, but I agree with e.g. Anthropic that this level of caution is at least a useful hypothetical to consider.) This is the level of caution people are usually talking about when they discuss making safety cases. I usually operationalize this as the AI developer wanting to have <1% chance that their AIs escape in the first year of deployment, and <5% conditional on the model trying pretty hard to subvert their safety measures. (I think some AI developers are hoping to get lower levels of risk than this.)

The rushed reasonable developer regime. The much riskier regimes I expect, where even relatively reasonable AI developers are in a huge rush and so are much less able to implement interventions carefully or to err on the side of caution.

The rushed unreasonable developer regime. You’re advising an unreasonable AI developer that is incompetent at implementing safety measures and unwilling to put much effort into implementing these measures.

The existential war regime. You’re in an existential war with an enemy and you’re indifferent to AI takeover vs the enemy defeating you. This might happen if you’re in a war with a nation you don’t like much, or if you’re at war with AIs. Obviously this is a regime where you should be much more aggressive.

I want to flesh out one particular rushed unreasonable developer scenario that I’ve been thinking about lately: there’s ten people inside the AI company who are really concerned about catastrophic risk from misalignment. The AI company as a whole pays lip service to AI risk broadly construed and talks occasionally about risk from AGI, but they don’t take misalignment risk in particular (perhaps especially risk from schemers) very seriously.

I think this scenario (and similarly pessimistic scenarios) seem important to target with technical research and planning: it seems pretty likely that we’ll only have this level of political will in short timelines (at least within a subset of competitive AI companies) and it seems possible to substantially improve the situation. I worry that a variety of AI safety thinking and planning focuses on overly optimistic scenarios where a responsible developer has a substantial lead and I think more focus on pessimistic scenarios at the margin would be useful.

What should these people try to do? The possibilities are basically the same as the possibilities for what a responsible developer might do:

The main focus of my research is on safety measures, so I’ve thought particularly about what safety measures they should implement. I'll give some more flavor on what I imagine this scenario is like: The company, like many startups, is a do-ocracy, so these 10 people have a reasonable amount of free rein to implement safety measures that they want. They have to tread lightly. They don’t have much political capital. All they can do is make it so that it’s easier for the company to let them do their thing than to fire them. So the safety measures they institute need to be:

I think it’s scarily plausible that we’ll end up in a situation like this. Two different versions of this:

What should we do based on this?



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI对齐 安全措施 风险降低 低成本安全 悲观情景
相关文章