少点错误 01月16日
Improving Our Safety Cases Using Upper and Lower Bounds
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章讨论了安全案例中的两个关键问题,即特定安全措施是否足够以及是否在技术上可实施。通过关注上下限,可进行富有成效的讨论。文中还列举了一些安全案例的上下限示例,并认为这样能使讨论更轻松,有助于找到合适的安全案例。

🎯讨论安全案例时需区分两个问题:安全措施是否足够及是否可实施。

💡通过关注上下限进行有效讨论,不纠结实施细节。

🌐列举多个安全案例的上下限示例,如模型的可解释性等。

Published on January 16, 2025 12:01 AM GMT

(Does anyone have the original meme? I can’t find it)

 

One key challenge in discussing safety cases is separating two distinct questions:

    Would a particular safety measure be sufficient if we had it?Is it technically possible to implement that measure?

By focusing on upper and lower bounds, we can have productive discussions about sufficiency without (yet) getting stuck on implementation details. Also, Anthropic shared drafts of safety cases - but it’s not clear to me if these are “drafts” because they might be too hard to implement, or too restrictive, or not restrictive enough, or something else.

Here's a made up conversation where using upper bounds makes things easier

Anthropic might claim: "100% interpretability of all neurons (or SAE features) in a model would be sufficient for safe deployment. We don’t think we’ll technically be able to do that, and we’re not sure if 20% interpretability or 80% interpretability will be enough or how to even define what 20% would mean, but 100% is definitely an upper bound"

Then someone might respond: "Even with perfect interpretability of a model's concepts, we still wouldn't understand the model’s plans"

I suggest this made it easier to (1) suggest (an upper bound) safety case, (2) get concrete feedback on whether this is sufficient (regardless of whether it’s realistic to implement), (3) as we have better upper and lower bounds, we could “zoom in” on a “good” safety case within these bounds.

In the spirit of making it easier to suggest safety cases as upper/lower bounds, here are some of mine

"Unanimous approval from everyone on the Alignment Forum."

A naive upper bound which I’m mainly suggesting in case someone thinks no upper bounds are possible. I also think this is useful because it leads us to other suggestions, like..

“Unanimous approval from all Anthropic employees”

I think this is a lower bound, but perhaps Anthropic disagrees and we could have a public discussion about it.

"Humans verify all plans the AI suggests, and they notice all details and consequences that are important to them" 

This means no alien concepts and no plans that are too long or too complicated to follow.

I think this could be an upper bound (with some refinement). I’d also be happy with “we have good reasons to believe the AI can automate parts (or all?) of this validation”.

"Detect any single catastrophic action (such as adding a vulnerability or stealing weights)"

I think this is a lower bound. An AI might take many actions where each of them alone isn’t catastrophic but together they are. I think Anthropic already agrees with this, but I’d be happy if they’d explicitly say this was a lower bound.

“The model has cyber defenses which would be a significant challenge for a state actor bypass given the actor understands the value of stealing a recursively self improving AI” (given the AI can recursively self improve)

I think this is a lower bound. Otherwise - are we simply hoping that state actors won’t notice the significance? Or are we ok with the model being stolen?

I also think a safety case involving “it’s ok if state actors steal this model because the actual problem is getting GPUs” is ok to make. (I think it’s wrong, but I’m not an expert, and mainly I think saying this out loud would create a healthy conversation and should be an explicit assumption in a safety case)

“Before training a stronger model, test if the current model has precursor capabilities for escaping whatever sandbox you have it in”

For example, if you’re running the model in a docker container, try running it in a much older container (with known exploits) and see if it can escape that.

This seems like a lower bound. I’m personally worried about this danger, and I’d hope for a better safety case like “this is how gradual we think the improvement will be, these were the capabilities before training, then we trained for only this amount, then we tested for capabilities again”

If ‘low capabilities’ are part of another safety case, “The AI isn’t sandbagging capability evals, tested using red teaming”

Seems like a lower bound.

“We managed to represent human values in a machine readable format + give an AI a goal in this format, and it does follow this goal reliably”

Very nice, alignment solved. Upper bound.

Summary

And,

 

Thanks David Manheim and Joshua Clymer for their comments on this idea. Mistakes are mine.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

安全案例 上下限 可实施性
相关文章