What progress have we made on automated auditing?

少点错误 2024年07月06日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了模型内部审计的自动化问题，即给定一个模型和预期行为规范，如何自动判断模型是否满足规范。作者分析了相关研究进展，指出目前在理论上和实践中都存在挑战，并呼吁更多研究者关注这一重要领域。

🤔 **模型内部审计的挑战:** 模型内部审计旨在自动判断模型是否满足预期行为规范。这与静态后门检测密切相关，即判断模型是否存在后门函数，该函数可以将任何输入转换为模型行为不同的输入。目前，在理论上，对于某些模型类别，即使给定白盒模型访问权限，静态后门检测也是不可能的。在实践中，现有研究主要集中在玩具模型上，效果有限。

🤖 **现有研究进展:** 尽管理论和实践都存在挑战，但模型内部审计的自动化研究已经取得了一些进展。例如，一些研究试图通过证明模型性能边界来解决该问题，但目前还局限于玩具模型，效果较弱。此外，一些研究试图通过识别模型激活的异常模式来检测后门，但效果并不理想。

💡 **未来研究方向:** 未来研究需要进一步探索模型内部审计的自动化方法，包括开发更有效的算法、设计更强大的理论框架，以及收集更多真实世界的数据集。

🚧 **与其他安全研究的联系:** 模型内部审计与模型安全领域的其他研究密切相关，例如对抗样本检测、鲁棒性评估等。这些研究成果可以相互借鉴，共同推动模型安全领域的发展。

🔍 **总结:** 模型内部审计的自动化是一个具有挑战性的问题，但也是一个非常重要的研究方向。未来的研究需要进一步探索有效的自动化方法，以确保模型的安全性。

Published on July 6, 2024 1:49 AM GMT

One use case for model internals work is to perform automated auditing of models:

https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research

That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related to static backdoor detection: given a model M, determine if there exists a backdoor function that, for any input, transforms that input to one where M has different behavior $.$ ^[1]

There's some theoretical work (Goldwasser et al. 2022) arguing that for some model classes, static backdoor detection is impossible even given white-box model access -- specifically, they prove their results for random feature regression and (the very similar setting) of wide 1-layer ReLU networks.

Relatedly, there's been some work looking at provably bounding model performance (Gross et al. 2024) -- if this succeeds on "real" models and "real" specification, then this would solve the automated auditing game. But the results so far are on toy transformers, and are quite weak in general (in part because the task is so difficult).^[2]

As this is machine learning, just because something is impossible in theory and difficult on toy models doesn't mean we can't do this in practice. It seems plausible to me that we've demonstrated non-zero empirical results in terms of automatically auditing model internals. So I'm curious: how much progress have we made on automated auditing empirically? What work exists in this area? What does the state-of-the-art in automated editing look like?

^{^}
Note that I'm not asking about mechanistically anomaly detection/dynamic backdoor detection; I'm aware that it's pretty easy to distinguish if a particular example is backdoored using baseline techniques like "fit a Gaussian density on activations and look at the log prob of the activations on each input".
^{^}
I'm also aware of some of the work in the trojan detection space, including 2023 Trojan detection contest, where performance on extracting embedded triggers was little better than chance.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签