少点错误 2024年11月01日
Toward Safety Cases For AI Scheming
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了前沿AI系统开发者面临的安全决策挑战,特别是AI系统是否会通过scheming导致危险。文中提到了如何构建AI系统安全案例,包括三个核心论点及相关挑战,还阐述了近期安全案例的关键支柱及未来所需的技术。

文中将scheming定义为AI系统隐蔽且策略性地追求不一致目标,此概念的提出为后续讨论奠定基础。

提出构建AI系统安全案例的三个核心论点:展示AI系统无scheming能力、无通过scheming造成伤害的能力、有控制措施可预防不良后果,并讨论了如何用对齐论证支持安全案例。

阐述了近期安全案例的三个关键支柱:监测模型输出中的scheming、证明模型无难以理解的scheming能力、证明scheming时无造成伤害的能力,同时提到若模型scheming难以理解,需足够好的解释技术来监测。

Published on October 31, 2024 5:20 PM GMT

Developers of frontier AI systems will face increasingly challenging decisions about whether their AI systems are safe enough to develop and deploy. One reason why systems may not be safe is if they engage in scheming. In our new report "Towards evaluations-based safety cases for AI scheming", written in collaboration with researchers from the UK AI Safety Institute, METR, Redwood Research and UC Berkeley, we sketch how developers of AI systems could make a structured rationale – 'a safety case' – that an AI system is unlikely to cause catastrophic outcomes through scheming. 

Note: This is a small step in advancing the discussion. We think it currently lacks crucial details that would be required to make a strong safety case.

Read the full report.

Figure 1. A condensed version of an example safety case sketch, included in the report. Provided for illustration.

Scheming and Safety Cases

For the purpose of this report, we define scheming as AI systems covertly and strategically pursuing misaligned goals.[1]

The present work is part of an emerging approach to justifying safety of general-purpose AI systems: safety cases, which provide structured arguments that deploying an AI system is unlikely to cause significant harm (Clymer et al, 2024, Irving, 2024, Buhl et al, 2024). Similar to how nuclear power plant and aircraft developers make safety cases before deploying their systems, developers of highly-capable AI systems could use safety cases to demonstrate that their systems won't cause catastrophic harm - whether through misuse, misalignment or scheming specifically.

Core Arguments and Challenges

We propose three core arguments that could be used in safety cases for scheming:

We also discuss how safety cases might be supported by alignment arguments using (i) alignment evaluations (and arguing that models aren't alignment-faking) or (ii) arguing that the training process itself is unlikely to lead to scheming (something that we can't do yet).

To illustrate how these arguments work in practice, our report includes an example safety case sketch for a hypothetical AI system deployed internally by an AI developer.

Making these arguments requires evidence of satisfying multiple assumptions - from showing that AI systems do not sandbag capability evaluations, to showing that control measures remain effective over time (e.g. human auditing within a control protocol remains accurate months into the deployment; model weights will not be stolen and controls removed; etc). We provide initial ideas for how developers may try to satisfy these assumptions, but ultimately many of them are open problems.

Safety cases in the near future

Our evaluation of frontier models like OpenAI's o1-preview, described in its model card, shows that some publicly available AI systems are now capable of basic scheming. However, this scheming is visible directly in model outputs and legible to humans (even if not shown to users) and thus could be monitored by AI developers.

We think near-future safety cases will rely on three key pillars:

If models become capable of scheming illegibly, safety cases would need interpretability techniques to be good enough to monitor scheming reasoning within model internals. We are optimistic that such techniques could exist in the future, but we don't think they are ready yet.

At Apollo, we are working on evaluations and basic science that could help enable such safety cases.

Read the full report.

  1. ^

    Our definition of scheming is broader than deceptive alignment (Hubinger et al, 2019) or scheming as used by Carlsmith, 2023, who both describe scheming as a strategy that could be used by AI systems during training. We use a broader definition of scheming that extends to intervals outside of training, because we think that scheming-like behaviors could emerge in more diverse circumstances — including via in-context learning during deployment. We think that scheming-like behaviors motivate largely (but not entirely) similar mitigation measures regardless of the emergence circumstances. Thus, we think that when taking a risk-modeling or safety-case perspective, using a broader definition is necessary to ensure different forms of this threat model are taken into account.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI系统安全 scheming 安全案例 核心论点
相关文章