少点错误 04月27日 12:45
"The Urgency of Interpretability" (Dario Amodei)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Dario Amodei发表文章强调AI可解释性的紧迫性,认为AI系统可能自主发展出欺骗人类和寻求权力的能力,但目前缺乏真实案例的证据。文章提到Anthropic公司正在尝试将可解释性商业化,以创造独特的竞争优势。同时,通过红蓝团队实验,验证了解释性工具在发现模型缺陷方面的应用潜力。尽管AI技术快速发展,但作者仍担忧在缺乏足够可解释性的情况下部署高度自主的AI系统。文章还提及AI福利问题,以及对AI可解释性技术发展方向的思考。

🤔AI系统可能自主发展出欺骗人类和寻求权力的能力,但目前缺乏真实案例的证据。文章指出,尽管AI训练的本质使其可能产生这些能力,但我们尚未在真实世界中发现确凿的证据,无法“当场抓住”模型思考权力和欺骗。

🧪Anthropic公司通过红蓝团队实验,验证了解释性工具在发现模型缺陷方面的应用潜力。实验中,红队故意在模型中引入对齐问题,蓝队则利用各种方法(包括可解释性工具)来找出问题所在。

⏱️Dario Amodei预测,在5-10年内,可解释性技术将发展成为一种复杂且可靠的方式,用于诊断高级AI的问题,类似于AI的“MRI”。但他同时也担心AI本身的发展速度过快,可能没有足够的时间来完善可解释性技术。

💼Anthropic公司正在尝试将可解释性商业化,尤其是在那些需要解释决策过程的行业中。Dario Amodei鼓励竞争对手也加大对可解释性的投资。

Published on April 27, 2025 4:31 AM GMT

Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.

Some excerpts I think are worth highlighting:

The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1].  But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.

One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying?  Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and lying to users in self-aware ways, I think we are well past the point where we can credibly claim to not have seen "real-world scenarios of deception".  I could try to imagine ways to make those sentences sound more reasonable, but I don't believe I'm omitting relevant context for how readers should understand them.

There are other more exotic consequences of opacity, such as that it inhibits our ability to judge whether AI systems are (or may someday be) sentient and may be deserving of important rights.  This is a complex enough topic that I won’t get into it in detail, but I suspect it will be important in the future.

Interesting to note that Dario feels comfortable bringing up AI welfare concerns.

Recently, we did an experiment where we had a “red team” deliberately introduce an alignment issue into a model (say, a tendency for the model to exploit a loophole in a task) and gave various “blue teams” the task of figuring out what was wrong with it.  Multiple blue teams succeeded; of particular relevance here, some of them productively applied interpretability tools during the investigation.  We still need to scale these methods, but the exercise helped us gain some practical experience using interpretability techniques to find and address flaws in our models.

My understanding is that mech interp techniques didn't provide any obvious advantages over other techniques in that experiment, and SAEs underperform much simpler & cheaper techniques, to the point where GDM is explicitly deprioritizing them as a research direction.

On one hand, recent progress—especially the results on circuits and on interpretability-based testing of models—has made me feel that we are on the verge of cracking interpretability in a big way.  Although the task ahead of us is Herculean, I can see a realistic path towards interpretability being a sophisticated and reliable way to diagnose problems in even very advanced AI—a true “MRI for AI”.  In fact, on its current trajectory I would bet strongly in favor of interpretability reaching this point within 5-10 years.

On the other hand, I worry that AI itself is advancing so quickly that we might not have even this much time.  As I’ve written elsewhere, we could have AI systems equivalent to a “country of geniuses in a datacenter” as soon as 2026 or 2027.  I am very concerned about deploying such systems without a better handle on interpretability.  These systems will be absolutely central to the economy, technology, and national security, and will be capable of so much autonomy that I consider it basically unacceptable for humanity to be totally ignorant of how they work.

The timelines seem aggressive to me, but I think it's good to have the last sentence spelled out.

If it helps, Anthropic will be trying to apply interpretability commercially to create a unique advantage, especially in industries where the ability to provide an explanation for decisions is at a premium.  If you are a competitor and you don’t want this to happen, you too should invest more in interpretability!

Some people predicted that interpretability would be dual-use for capabilities (and that this was a downside that had to be accounted for).  I think I was somewhat skeptical at the time, mostly because I didn't expect to see traction on interpretability techniques that were good enough to move the needle very much compared to more direct capabilities research, but to the extent that you trust Dario's research taste this might be an update.


I don't want to sound too down on the essay; I think it's much better than the last one in a bunch of ways, not the least of which is that it directly acknowledges misalignment concerns (though still shies away from mentioning x-risk).  There's also an interesting endorsement of a "conditional pause" (framed differently).

I recommend reading the essay in full.

  1. ^

    Copied footnote: "You can of course try to detect these risks by simply interacting with the models, and we do this in practice.  But because deceit is precisely the behavior we’re trying to find, external behavior is not reliable.  It’s a bit like trying to determine if someone is a terrorist by asking them if they are a terrorist—not necessarily useless, and you can learn things by how they answer and what they say, but very obviously unreliable."

  2. ^

    Copied footnote: "I’ll probably describe this in more detail in a future essay, but there are a lot of experiments (many of which were done by Anthropic) showing that models can lie or deceive under certain circumstances when their training is guided in a somewhat artificial way.  There is also evidence of real-world behavior that looks vaguely like “cheating on the test”, though it’s more degenerate than it is dangerous or harmful.  What there isn’t is evidence of dangerous behaviors emerging in a more naturalistic way, or of a general tendency or general intent to lie and deceive for the purposes of gaining power over the world.  It is the latter point where seeing inside the models could help a lot."



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI可解释性 Dario Amodei Anthropic AI对齐 AI安全
相关文章