少点错误 01月31日
Takeaways from sketching a control safety case
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了AI控制的安全评估,将安全案例从概念草图转化为具体蓝图。文章指出,控制评估远比预想复杂,涉及红队构建替代模型进行攻击测试,并评估控制措施的有效性。评估过程高度依赖于具体控制措施和攻击方式,需要深入的威胁建模。此外,控制评估并非机械过程,依赖大量的判断性决策。通过构建安全案例,能有效改进对控制的理解,揭示潜在的安全漏洞。文章还强调了限制AI通信和工作区隔的重要性,并建议AI公司和政府机构应投入资源进行安全案例研究,以提升AI系统的安全性。

🚩控制评估的复杂性远超预期:表面简单的评估过程,实则隐藏着诸多实际操作的复杂性,包括红队如何修改模型、代理坏结果的定义以及如何评估风险。控制评估必须根据具体的控制措施和攻击方式进行调整。

🤔评估控制措施涉及大量判断:控制评估并非一个简单的机械过程,每一步都需进行假设,例如代理任务的选择。因此,安全案例需要依赖大量的判断性决策,而不能仅仅依赖实验证据。

🔒构建安全案例有助于改进控制模型:通过构建安全案例,能更好地理解控制的不足之处,从而改进控制措施。限制AI通信、隔离AI工作区对控制AI至关重要,越严格的限制越有助于控制AI。

Published on January 31, 2025 4:43 AM GMT

Buck Shlegeris and I recently published a paper with UK AISI that sketches a safety case for “AI control” – safety measures that improve safety despite intentional subversion from AI systems.

I would summarize this work as “turning crayon drawings of safety cases into blueprints.”

It’s a long and technical paper! So I’ve compiled my main takeaways in this blogpost.

 

The main figure from the paper summarizes how to run a control evaluation.

1. Control evaluations are way more complicated than I expected.

At a high level, control evaluations are conceptually simple. A red team constructs substitutes for subversive models. These models try to cause proxy bad outcomes (e.g. exfiltrate dummy data). Then, developers see whether the red team can succeed. If the red team succeeds, then control measures aren’t effective enough.

But there’s so much practical complexity hiding underneath this high-level description. How is the red team allowed to modify these untrusted models? What are these “proxy bad outcomes” – and how do we know they are anything like real attack vectors? If the red team succeeds (by some metric), how do we postprocess the results into an actual assessment of the risk?

A particularly interesting takeaway I had was that control evaluations heavily depend on the control measures and attacks involved. If you are worried about models smuggling code into json files, you have to create tasks that involve json files! So thinking through the threat models really matters.

At the end of all this, it feels like I’m going down the rabbit hole to a wonderland full of game theory, threat modeling spaghetti fractals, and scarcely explored research wilderness.

Part of the value I drew from this report was in helping to unearth this complexity and organize it into clear, navigable conceptual structures.

2. There are a lot of judgment calls involved in evaluating control. I don’t put much faith in any mechanical process.

I used to believe that control evaluations were like a magical box you could put a Red Team, your control protocol, and some compute inside of, and out pops a reliable bound on risk.

This seems clearly not the case to me now.

So many assumptions are made at every step of a control evaluation. How were the proxy tasks selected? Maybe our tasks are useless because models will execute strategies over long periods of time that we cannot simulate.

I don’t think it is practical to address all of these concerns with decisive experimental evidence. Instead, safety cases will involve a fair dose of judgment calls.

An implication is that control might not justify high confidence in safety after the buffer of these evaluations shrinks.

Control is often portrayed as the bastion of a robust safety argument, but I think that’s mostly because current models are toddler programmers.

Once AI systems are at the threshold of top human talent, running at lightning-fast speeds, I think we’ll be really uncertain about how robust our control measures are.

3. Sketching a safety case was a surprisingly helpful (though time-consuming) curriculum for improving my models of control.

Safety cases seem increasingly unlikely to be effectively instantiated as a regulatory regime.

So why do I still spend a lot of my time working on safety cases?

The reason is they are the best method I’ve found for consistently making progress on the question of how to evaluate safety.

This question is important for an abundance of reasons.

That said, sorting out many of these details took a while (and the paper arguably raises more questions than it answers).

I would certainly not recommend every safety engineer think through the details of safety cases – but it might be critical that some do!

4. So, the big question is: should AI companies and government institutions invest in creating safety cases? My answer isn’t simple.

On the one hand, AI developers could easily waste their time “polishing brass on the Titanic.” For many of the failure modes I considered, my gut reaction was, “oh come on, is this really going to be a problem?”

On the other hand, thinking through these details helped me develop a much better model of control and how to improve it.

My overall assessment is that at least one person at every frontier AI company and relevant government institution should probably be thinking through the details of safety cases or collaborating closely with someone who is. Even if there are only “ten people on the inside,” I expect that the safety measures these ten people implement would benefit from carefully thinking through the threat models and arguments.

A lot of this work can be exported (go read the safety case sketch and be an importer!), and hopefully, work like this will accelerate this nascent science of evaluating AI safety.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI控制 安全评估 威胁建模 安全案例 AI安全
相关文章