少点错误 2024年10月21日
What AI companies should do: Some rough ideas
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了提升AI系统安全性的多项行动,包括模型权重和代码的安全、控制AI的潜在风险、预防外部部署中的滥用、未来部署决策的责任规划、安全研究、政策制定以及公开声明等方面,认为这些行动对AI公司至关重要。

💻模型权重和代码的安全:应在AI可能大规模加速AI研发之前,实现极端安全可选择性,制定实现该目标的路线图,公开解释并证明其高级安全目标,可先追求较弱的SL5安全级别。

🚧控制潜在风险:通过评估预防AI在内部部署时策划和逃脱的风险,如使用监测和审计技术,基于部署协议进行控制评估,尝试直接测试模型是否策划等。

🛡预防外部部署中的滥用:对危险能力进行良好的模型评估,根据评估结果进行部署,如通过API部署、监控查询等方式,防止灾难性滥用。

📋未来部署决策的责任规划:当AI能够使研发速率提高三倍时,进行风险评估并向外部审计员展示结果,制定应对计划,如限制超级智能AI的构建等。

🔬安全研究:实验室应进行并分享安全研究,促进外部安全研究,如为安全研究人员提供更深入的模型访问权限。

Published on October 21, 2024 2:00 PM GMT

This post is incomplete. I'm publishing it because it might be helpful for some readers anyway. A good version of this post would be more detailed: for each proposed action, explain motivation and high-level goal, explain problem to solve, respond to obvious questions/objections, and lay out costs involved and costs worth accepting.

These are the actions I think companies that will build very powerful AI systems should prioritize to improve safety. I think something like these actions are cost-effective,[1] but specific details may be suboptimal. And some actions are not fully specified; some are of the form do good things in area X or avoid risk Y, without pinning down what to do. And this is not certain to be correct in the future; I just think it's a good plan for now. Some of these actions are costly, but I claim that they are cheap and important enough that a responsible frontier AI company should do them unilaterally — the gain in safety is worth the cost in competitiveness. I mostly don't justify these actions in this post.

(This post is about responsible labs — asks for non-responsible labs should be more focused on transparency and becoming responsible. This post is about unilateral action — in coordinated action you can accept more costs.)


Security for model weights and code: extremely strong security will be crucial when AIs can massively accelerate AI R&D. But operating under extreme security—security sufficient to defend against top-priority operations by top state actors—is costly and not yet necessary. So labs should focus on achieving extreme security optionality—the ability to implement extreme security quickly[2]—before it would be very bad for AI model weights to be stolen.[3] It seems such AI could plausibly appear as early as 2027; if so, labs should plan to achieve extreme security optionality before 2027.[4] The lab should develop a roadmap to achieve extreme security optionality, publicly explain and justify its high-level security goals, and maybe share some details and publish stuff to validate claims (e.g. a pentester's summary of their pentest results) with sensitive details redacted.[5] It is also reasonable to aim for a weaker level of security, SL5, for now, but that will probably be insufficient when AIs can massively accelerate research.


Control: anticipate and prevent the risk that AIs will scheme—i.e., act safe and subtly subvert safety measures to gain power—and escape during internal deployment.[6] The lab should do evals for scheming-relevant capabilities, and by the time models can plausibly scheme coherently, be ready to ensure control, e.g.:


Misuse: anticipate and prevent harm from misuse during external deployment. Do good model evals for dangerous capabilities. Do evals pre-deployment (to inform deployment decisions), have good threat models and categories of evals, use high-quality evals, use good elicitation, publish evals and results, share good access with third-party evaluators. Once models might have dangerous capabilities, deploy such that users can't access those capabilities: deploy via API or similar rather than releasing weights, monitor queries to block suspicious messages and identify suspicious users, maybe do harmlessness/refusal training and adversarial robustness. If a particular deployment plan incurs a substantial risk of catastrophic misuse, deploy more narrowly or cautiously.


Accountability for future deployment decisions: plan that by the time AIs are capable of tripling the rate of R&D, the lab will do something like this:[7]

Do great risk assessment, especially via model evals for dangerous capabilities. Present risk assessment results and deployment plans to external auditors — perhaps make risk cases and safety cases for each planned deployment, based on threat modeling, model eval results, and planned safety measures. Give auditors more information upon request. Have auditors publish summaries of the situation, how good a job you're doing, the main issues, their uncertainties, and an overall risk estimate.


Labs should have a clear, good plan for what to do when they build AGI, outside of just making the AGI safe.[8] For example, one plan is perfectly solve AI safety, then hand things off to some democratic process. This plan is bad because solving AI safety will likely take years after systems are transformative and superintelligence is attainable, and by default someone will build superintelligence without strong reason to believe they can do so safely or do something else unacceptable during that time.

The outline of the best plan I've heard is build human-obsoleting AIs which are sufficiently aligned/trustworthy that we can safely defer[9] to them (before building wildly superintelligent AI). Assume it will take 5-10 years after AGI—i.e., AI that can obsolete top human scientists—to build such systems and give them sufficient time. To buy time (or: avoid being rushed by other AI projects[10]), inform the US government and convince it to enforce nobody builds wildly superintelligent AI for a while (and likely limit AGI weights to allied projects with excellent security and control). Use AGI to increase the government's willingness to pay for nonproliferation and to decrease the cost of nonproliferation (build carrots to decrease cost of others accepting nonproliferation and build sticks to decrease cost to US of enforcing nonproliferation).[11]

For now, the lab should develop and prepare to implement such a plan. It should also publicly articulate the basic plan, unless doing so would be costly or undermine the plan.


Safety research: labs should do and share safety research as a public good, to help make powerful AI safer even if it's developed by another lab.[12] They should also boost external safety research, especially by giving safety researchers deeper access[13] to externally deployed models.


Policy: inform policymakers about AI capabilities; when there is better empirical evidence on misalignment and scheming risk, inform policymakers about that; don't get in the way of good policy.


Labs should encourage staff to discuss their views on AI progress, risks, and safety, both internally and publicly (except for confidential information), or at least avoid discouraging staff from doing so. (Facilitating external whistleblowing has benefits but some forms of it are costly, and the benefits are smaller for more responsible labs, so I don't confidently recommend unilateral action for responsible labs.)


Public statements: the lab and its leadership should understand and talk about extreme risks. The lab should clearly describe a worst-case plausible outcome from AI and state its credence in such outcomes (perhaps unconditionally, conditional on no major government action, and/or conditional on leading labs being unable or unwilling to pay a large alignment tax).


[meta] Plan safety interventions as a function of dangerous capabilities, warning signs, etc. Rather than acting based on guesses about which risks will appear, set practices in advance as a function of warning signs. (This should satisfy both risk-worried people who expect warning signs and risk-unworried people who don't.)


I list some less important or more speculative actions (very nonexhaustive) in this footnote.[14] See The Checklist: What Succeeding at AI Safety Will Involve (Bowman 2024) for a collection of goals for a lab. See A Plan for Technical AI Safety with Current Science (Greenblatt 2023) for a more detailed but rough and out-of-date plan. See Towards best practices in AGI safety and governance: A survey of expert opinion (Schuett et al. 2023) for an old list.


Almost none of this post is original. Thanks to Ryan Greenblatt for inspiring much of it, and thanks to Drake Thomas, Rohin Shah, Alexandra Bates, and Gabriel Mukobi for suggestions. They don't necessarily endorse this post.


Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

    With two exceptions, where I feel unconfident: misuse prevention (misuse has positive second-order effects) and accountability for risk assessment (its causal benefit may be small if done unilaterally).

  2. ^

    Extreme security optionality is the ability to implement extreme security quickly, e.g. within 3 months. Also extreme security should be implementable at a reasonable maximum cost, e.g $10B to implement plus $10B per year and a 75% slowdown in research — the property the lab could implement extreme security but it would require basically stopping research or incurring other prohibitive costs doesn't suffice.

    I focus on extreme security for model weights and code, not necessarily for secrets. Securing weights will be more important; it seems easier; and extreme security for secrets entails that staff are much less able to talk to people outside the lab, which could hurt their ability to think clearly.

    Labs will likely need substantially better security than defend against top-priority operations by top state actors soon after AIs can massively accelerate AI R&D: model weights will be among the most valuable and strategically important assets ever, so labs should be prepared for "unprecedentedly well-resourced attacks," not to mention that AI will boost offense.

  3. ^

    Proliferation of model weights would lead to others quickly getting more powerful AI, since the AI can massively boost AI R&D. This removes the option of coordination that requires nonproliferation, enables misuse and state conflict, makes it hard for US labs to gain a substantial lead, and makes it impossible for US labs to unilaterally slow progress while retaining a lead.

  4. ^

    By the time extreme security is necessary, AIs will be able to substantially boost security. So for now, labs should focus on the things that will be hard for AI to accelerate: things that require nonparallelizable physical work. For example, labs should probably focus on physically constructing or retrofitting datacenters such that they can be extremely secure.

    If it becomes clear that the lab will achieve extreme security optionality before it's necessary, the lab could choose to revise the plan and slow its progress toward extreme security.

  5. ^

    Would an if-then security commitment—don't create a very dangerous model until its weights would be secure—suffice? Sure, but I worry that extreme security takes substantial time to implement (in part because it likely requires physical changes to datacenters) and the labs are not on track to implement it before they need it, so a lab will either proceed unsafely or pause while the less scrupulous race ahead. Regardless of whether a lab has such a commitment, I think it should start sprinting for extreme security optionality now.

  6. ^

     Maybe a large fraction of AI risk "comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored."

    Escape means "either escaping from the datacenter or getting root access to it." A central story for AI escaping goes like this:

    At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories.

  7. ^

    This is partially a substitute for responsible scaling policies. Labs should still make at least high-level RSP-y (or "if-then") commitments too.

  8. ^

    A plan could include how to use AGI, what further AI development to do, and what the exit state is if all goes well. A plan should consider not just what the lab will do but also what will be happening with the US government, other labs, and other states.

    Really the goal is a playbook, not just a plan — it should address what happens if various assumptions aren't met or things are different than expected, should include ideas that are worth exploring/trying but not depending on, etc.

  9. ^

    We don't need these human-obsoleting AIs to be able to implement CEV. We want to be able to defer to them on tricky wisdom-loaded questions like what should we do about the overall AI situation? They can ask us questions as needed.

  10. ^

    To avoid being rushed by your own AI project, you also have to ensure that your AI can't be stolen and can't escape, so you have to implement excellent security and control.

  11. ^

    This plan, and desire for this kind of plan, is largely stolen from Ryan Greenblatt. He doesn't necessarily endorse my version.

    Dario Amodei recently said something related:

    it seems very important that democracies have the upper hand on the world stage when powerful AI is created. . . . My current guess at the best way to do this is via an “entente strategy”, in which a coalition of democracies seeks to gain a clear advantage (even just a temporary one) on powerful AI by securing its supply chain, scaling quickly, and blocking or delaying adversaries’ access to key resources like chips and semiconductor equipment. This coalition would on one hand use AI to achieve robust military superiority (the stick) while at the same time offering to distribute the benefits of powerful AI (the carrot) to a wider and wider group of countries in exchange for supporting the coalition’s strategy to promote democracy (this would be a bit analogous to “Atoms for Peace”). The coalition would aim to gain the support of more and more of the world, isolating our worst adversaries and eventually putting them in a position where they are better off taking the same bargain as the rest of the world: give up competing with democracies in order to receive all the benefits and not fight a superior foe.

    (Among other things, this does not address the likely possibility that the democracies should delay building wildly superintelligent AI.)

  12. ^

    This is slightly complicated because (1) some safety research boosts capabilities and (2) some safety practices work better if their existence (in the case of control) or details (in the case of adversarial robustness) are secret. In some cases you can share with other labs to get most of the benefits and little of the cost. Regardless, lots of safety research is good to publish. And publishing model evals is good for some of the same reasons plus transparency and accountability.

  13. ^

    E.g. fine-tuning, helpful-only, filters/moderation off, logprobs, activations.

  14. ^
      RSP details (but it's not currently possible to do all of this very well)
        See generally Key Components of an RSPDo evals regularlyAnd have predefined high-level capability thresholdsAnd have predefined low-level operationalizationsAnd publish evals and thresholds, or have an auditor commentAnd have thresholds trigger specific responsesAnd all these details should be goodAnd prepare for a pause (to make it less costly, in case it's necessary)And accountability/transparency/meta stuff: publish updates or be transparent to an auditor, have a decent update process, elicit external review, perhaps adopt a mechanism for staff to flag false statements
      Elicitation details (for model evals for dangerous capabilities): see the summary in my blogpost (or my table)Don't publish capabilities research, modulo Thoughts on sharing information about language model capabilities (Christiano 2023) — don't publish stuff that boosts black-box capabilities (especially stuff on training) unless it improves safetyInternal governance
        Prepare for difficult decisions aheadHave structure, mandate, and incentives set up such that the lab can prioritize safety in key decisions.Plan what the lab and its staff would do if it needed to pause for safetyHave a good process for staff to escalate concerns about safety internallyDon't discourage certain kinds of whistleblowing — details unclearDon't use nondisparagement provisions in contracts or severance agreements
      Object-level security practicesVarious specific safety techniques (potentially high-priority but I just didn't focus this post on them) (not necessarily currently existing) (very nonexhaustive)


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI系统安全 模型权重安全 风险控制 安全研究 部署决策
相关文章