A Frontier AI Risk Management Framework: Bridging the Gap Between Current AI Practices and Established Risk Management

少点错误 03月14日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

SaferAI提出了一个风险管理框架，旨在改进现有的前沿安全框架。该框架借鉴了其他风险管理领域的实践和概念，引入了概念清晰性，并概括了AI安全领域独立提出的一些早期直觉。该框架强调在最终训练运行开始之前进行大部分风险管理，从而使工作能够与能力工作并行进行，而不会延迟安全产品的发布。同时，强调开放式的红队测试来识别风险，区分风险管理策略和风险登记册，明确风险负责人，引入关键风险指标（KRIs），并提出多层治理方法。

🔑 SaferAI提出的风险管理框架，强调在最终训练运行开始之前进行大部分风险管理，从而使风险管理工作与AI能力提升工作并行，避免延迟安全产品的发布。

🎯 框架强调开放式的红队测试，用于识别潜在的新风险因素。例如，chain-of-thoughts的出现显著增强了模型的性能，但也增加了风险，因此需要在部署前发现并解决这类问题。

📊 框架明确区分了风险管理策略和风险登记册。风险管理策略保持相对稳定，而风险登记册则需要频繁更新，包含风险目录以及所有相关信息。建议为每个风险记录6个类别的信息。

👤 框架明确指定风险负责人，即最终负责确保风险得到适当管理的人员。这是一个简单但未被充分讨论的实践。

📈 框架引入了关键风险指标（KRIs），即用于监测和评估各种风险来源的代理指标。评估是风险指标的重要类型之一，但不是唯一的。例如，API交互中被越狱的百分比也可以作为KRI。还可以对所有类型的KRI设置阈值，例如部署后发生的事件数量。

Published on March 13, 2025 6:29 PM GMT

We (SaferAI) propose a risk management framework which we think should improve substantially upon existing Frontier Safety Frameworks if followed. It introduces and borrows a range of practice and concepts from other areas of risk management to introduce conceptual clarity and generalize some early intuitions that the field of AI safety independently came up with.

To maintain readability from people in the AI field, we didn’t make the risk management framework fully adequate yet.^[1]

To give you a taste, here are some of our risk management framework unique features:

The emphasis on the majority of the risk management happening before the final training run begins. This enables the work to be parallelized with capabilities work and not delay the release of safe products.Emphasizing open-ended red teaming for risk identification to identify possible new risk factors. A canonical example of hard-to-foresee emerging phenomenon changing the risk profile of a model is chain-of-thoughts, making a model substantially stronger and hence riskier across the board when prompted adequately. If there's a new chain-of-thought that appears, you ideally want a procedure that would catch it before deployment.The distinction of the risk management policy, meant to be pretty stable, and the risk register, used and updated very frequently with a catalog of risks and all the corresponding information to update those. In this risk register, we suggest a set of 6 categories of information to document for each risk.The clear designation of a risk owner, i.e. an individual ultimately accountable for ensuring a risk is managed appropriately. This is a simple yet undiscussed practice.The introduction of Key Risk Indicators (KRIs), i.e. proxy measures tracked to monitor and assess various risk sources. Evaluations are one important type of risk indicators, but not the only one. Percentage of API interactions that are jailbroken could be another. This notion enables to generalize a lot of thinking that has gone into evaluations. You could set thresholds on all types of KRIs, e.g. on number of incidents post-deployment.A proposed multi-layered governance approach producing checks and balances, inspired from structures in other industries.

We summarize below the risk management framework components:

We welcome feedback, here, by DMs or emails.

^{^}
One example is that to have a fully adequate risk management framework, given that the conditions of deployment and the number of possible instances of a model are a significant risk factor, the thresholds upon which we condition mitigations should in fact be (capabilities ; deployment conditions) thresholds. For simplicity and so that it is reasonably feasible to go from the current developers’ risk framework, we reserve that consideration for future updates of the framework, once the field has stepped up.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签