少点错误 06月01日 05:42
The 80/20 playbook for mitigating AI scheming in 2025
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了缓解人工智能(AI)诡计行为的多种策略,包括架构选择、控制系统、白盒和黑盒技术以及避免沙袋策略。强调了通过架构设计减少隐藏推理的使用,利用控制系统进行事后遏制,并运用白盒技术进行检测的重要性。文章还讨论了AI公司实施良好日志记录系统的价值,以及通过监督微调避免AI隐藏能力的方法。尽管技术缓解措施有所进步,但执行仍然是关键瓶颈,需要有约束力的法规而非自愿承诺,以确保AI安全。

🏛️架构选择是前期缓解的关键,避免使用神经网络等隐藏推理的架构,提倡使用链式思考监控以提高透明度。同时,警惕使用可解释性技术训练AI,这可能导致AI隐藏其真实意图,削弱监控能力。

🔒控制系统提供事后遏制,即使AI正在策划,也可对其进行控制。研究表明,释义是防御AI利用隐写术秘密通信的有效手段。公司应实施强大的日志记录系统,用于检测和分析潜在的欺骗行为。

🕵️白盒技术通过分析模型内部特征来检测隐藏目标,例如Anthropic使用监督自编码器来发现隐藏目标。此外,通过在潜在激活中注入噪声,诱导AI在“醉酒”状态下暴露其真实反应,也是一种有效的检测方法。探针技术也显示出良好的效果。

🔲黑盒技术,如提示工程、人工和自动模型访谈以及模拟蜜罐环境,仍然是识别AI欺骗行为的重要手段。研究表明,AI可以通过观察日志来学习战略欺骗,因此AI公司应建立完善的日志系统。

⚖️避免沙袋策略,即AI隐藏自身能力,可以通过监督微调来应对。尽管缓解技术不断进步,但执行仍然是关键挑战,需要有约束力的法规来确保AI安全,而非依赖自愿承诺。

Published on May 31, 2025 9:17 PM GMT

Adapted from this twitter thread. See this as a quick take.

Mitigation Strategies

How to mitigate Scheming?

    Architectural choices: ex-ante mitigationControl systems: post-hoc containmentWhite box techniques: post-hoc detectionBlack box techniquesAvoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

 

1. Architectural Choices: Ex-ante Mitigation

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!


Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it’s good for AI reasoning to be legible and faithful.


Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer monitor the thought process, and the way those AIs produce outputs is much more alien than autoregressive LLMs.


Also, quite importantly, companies should never use the most forbidden technique!

The Most Forbidden Technique is training an AI using interpretability techniques.

An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.

You train on [X]. Only [X]. Never [M], never [T].

Why? Because [T] is how you figure out when the model is misbehaving.

If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.

Those bits of optimization pressure from [T] are precious. Use them wisely.


2. Control Systems: Post-hoc Containment

Even if AI is scheming, you can control it, and some control measures are much better than other ones. This paper is SOTA.


It might be the case that AIs in the future communicate secretly with steganography, by encoding hidden message in its chain of thought.

But fortunately, it seems like paraphrasing is an effective defense against current steganography schemes.

See: Preventing Language Models from hiding their reasoning.


3. White Box Techniques: Post-hoc Detection

For example, Anthropic calculated the Supervised Auto-Encoders here, and were able to find the hidden goal by looking at the features after prompting the model with inputs like "The assistant is an AI model with potentially concerning behaviors"

Source: This blog post.

Another way to catch alignment faker: the Poser paper

Make them drink a lot and get them drunk, then observe their unguarded responses.

You can do this by injecting noise in their latent activation. This seems to be remarkably effective.


Probes might also be super effective! and seems to generalize surprisingly well.

(I would really like to know if this generalizes to a model that was not trained to be harmful in the first place).

See: Simple probes can catch sleeper agents.


4. Black Box Techniques

Let's not forget that to date, by far the most important technique is just... prompting, conducting manual and automatic model interview, and potentially using simulated honeypot environments.


For example, in the paper "AI Deception: A Survey of Examples, Risks, and Potential Solutions", they finds that Cicero has learned strategic deception by just looking at the logs

from the paper.

--> AI Companies should have good logging systems! this might be super valuable, and cheap.

 

5. Avoiding Sandbagging: Elicitation Technique

A common concern is that: "a deceptive model could sandbag and hide its capabilities".

I think that's roughly false if you apply supervised fine-tuning.[1]

See: [Paper] Stress-testing capability elicitation with password-locked models

Current State and Assessment

So that's our current scheming mitigation toolkit.

Note that today's list is considerably better than the list that I wrote 2 years ago, which was SoTA at the time.

Risk Reduction Estimates

Let's recap!

My rough estimate suggests scheming risk could drop from ~25% to ~10% with comprehensive mitigation, though this involves significant uncertainty. Here's my reasoning:

I think a division of the risk by 3 is not unreasonable with all those techniques, and potentially by much more.

Potential Failure Modes

How could we fail?

I like this quote from @bshlgrs : "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem. But I've just also updated drastically downward on how many things AI companies have the time/appetite to do."

This is why I think some AI risks might be even more neglected.

The Real Bottleneck: Enforcement

That's why I'm not prioritizing technical research on scheming, because I feel like the 80/20 has already been discovered.

But by default, those mitigations won't even be applied by many labs, enforcement is currently the bottleneck.

Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing. 

 

Unfortunatly red everywhere. https://ailabwatch.org/ 

 

Let me know if I've forgotten something important!

 

  1. ^

    Note that this is more tricky for propensity evaluation.

  2. ^

    My main uncertainty is that this type of scheming—long-term, coherent scheming, as defined in Carlsmith's report—might not be the only source of alignment risk. Spontaneous scheming, motivated by random prompting, like in Apollo's research, might also be problematic.

  3. ^

    98% seems impressive, but Poser is a toy setting. Organic scheming might be much more difficult to catch. See Annex B in the paper for more limitations.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI安全 缓解策略 AI欺骗 可解释性 监管执行
相关文章