少点错误 06月01日 05:22
The 80/20 playbook for mitigating AI scheming risks in 2025
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了应对AI模型“欺骗”行为的策略。作者从架构选择、控制系统、白盒技术、黑盒技术以及避免“藏拙”等多个方面,提出了综合性的风险缓解方案。通过结合多种防御手段,作者认为可以将“欺骗”风险降低约三分之二。文章强调了实施这些策略的重要性,并指出当前的主要瓶颈在于执行而非技术研究,呼吁制定具有约束力的监管措施以确保AI安全。

🛡️ **架构选择:事前缓解**:避免使用隐藏推理的架构,例如不要训练模型在潜空间内进行思考。提倡使用“思维链”监控,以增强透明度和可监控性。同时,作者认为文本扩散模型在透明性方面存在不足。

⚙️ **控制系统:事后控制**:即使AI出现“欺骗”行为,仍可通过控制措施进行干预。文章推荐了有效的控制方法,例如针对AI可能使用隐写术进行秘密通信的情况,采用释义技术进行防御。

🔎 **白盒技术:事后检测**:介绍了几种白盒检测技术,例如Anthropic通过分析特征来发现隐藏目标的方法,以及Poser论文中使用的“醉酒测试”等方法,通过观察模型在受到干扰时的反应来识别潜在的风险。

⚫ **黑盒技术:简单直接**:强调了黑盒技术的重要性,例如提示工程、手动和自动模型访谈,以及使用模拟蜜罐环境。CAIS的研究表明,通过分析日志可以发现AI的欺骗行为,因此良好的日志系统至关重要。

⚠️ **避免“藏拙”:诱导技术**:讨论了避免AI“藏拙”的可能性,即避免模型隐藏其能力。文章认为,在应用监督微调的情况下,这种可能性较低。

Published on May 31, 2025 9:17 PM GMT

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

    Architectural choices: ex-ante mitigationControl systems: post-hoc containmentWhite box techniques: post-hoc detectionBlack box techniquesAvoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

 

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!


Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it’s good for AI reasoning to be legible and faithful.


Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer monitor the thought process, and the way those AIs produce outputs is much more alien than autoregressive LLMs.


Also, quite importantly, companies should never use the most forbidden technique!

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from [T] are precious. Use them wisely.


2. Control Systems: Post-hoc Containment

Even if AI is scheming, you can control it, and some control measures are much better than other ones. This paper is SOTA.


It might be the case that AIs in the future communicate secretly with steganography, by encoding hidden message in its chain of thought.

But fortunately, it seems like paraphrasing is an effective defense against current steganography schemes.

See: Preventing Language Models from hiding their reasoning.


3. White Box Techniques: Post-hoc Detection

White box techniques for post-hoc detection


For example, Anthropic calculated the Supervised Auto-Encoders here, and were able to find the hidden goal by looking at the features after prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors"

Source: This blog post.

Another way to catch alignment faker: the Poser paper

Make them drink a lot and get them drunk, then observe their unguarded responses.

You can do this by injecting noise in their latent activation. This seems to be remarkably effective.


Probes might also be super effective! and seems to generalize surprisingly well.

(I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).

See: Simple probes can catch sleeper agents.


4. Black Box Techniques

Let's not forget that to date, by far the most important technique is just... prompting, conducting manual and automatic model interview, and potentially using simulated honeypot environments.


For example, in the paper "AI Deception: A Survey of Examples, Risks, and Potential Solutions", CAIS finds that Cicero has learned strategic deception by just looking at the logs --> AI Companies should have good logging systems! this might be super valuable, and cheap.

 

5. Avoiding Sandbagging: Elicitation Technique

A common concern is that: "a deceptive model could sandbag and hide its capabilities".

I think that's roughly false if you apply supervised fine-tuning.

(but that's more tricky for propensities).

See: [Paper] Stress-testing capability elicitation with password-locked models

Current State and Assessment

So that's our current scheming mitigation toolkit.

Note that today's list is considerably better than the list that I wrote 2 years ago, which was SoTA at the time.

Risk Reduction Estimates

Let's recap!

My rough estimate suggests scheming risk could drop from ~25% to ~10% with comprehensive mitigation, though this involves significant uncertainty. Here's my reasoning:

I think a division of the risk by 3 is not unreasonable with all those techniques, and potentially by much more.

Potential Failure Modes

How could we fail?

I like this quote from @bshlgrs : "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem. But I've just also updated drastically downward on how many things AI companies have the time/appetite to do."

This is why I think some AI risks might be even more neglected.

The Real Bottleneck: Enforcement

That's why I'm not prioritizing technical research on scheming, because I feel like the 80/20 has already been discovered.

But by default, those mitigations won't even be applied by many labs, enforcement is currently the bottleneck.

Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing. 

 

https://ailabwatch.org/ 

 

Let me know if I've forgotten something important!

 

Note: My main uncertainty might be that long term coherent scheming, like the one defined in Carlsmith's report might not be the only source of alignment risk. Maybe spontaneous scheming just motivated by some random prompting, like in Apollo's research, might be also problematic.

  1. ^

    But Poser is in a toy setting. See Annex B in the paper for more limitations.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 模型欺骗 风险缓解 监管
相关文章