少点错误 2024年08月06日
Mechanistic Anomaly Detection Research Update
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

EleutherAI 解释性团队基于 Neel Nanda 的归因修补技术,开创检测语言模型异常行为的新方法,但这些方法未始终优于非机械基线。发现评估整批测试数据的方法能提高异常检测性能,且用现成技术较易检测图像分类器中的对抗样本,但未测试异常检测器本身的对抗鲁棒性。

🥇EleutherAI 解释性团队采用 Neel Nanda 的归因修补技术,探索检测语言模型异常行为的新机制方法,但这些方法并非在所有情况下都能超越仅关注激活的非机械基线方法。

💪发现评估整批测试数据的方法,相比逐个考虑测试点,能取得更好的异常检测性能。该团队在许多任务上取得了很好的表现,但并非全部。

🎯利用现成技术能够相对容易地检测图像分类器中的对抗样本,不过并未对该团队的异常检测器本身的对抗鲁棒性进行测试。

Published on August 6, 2024 10:33 AM GMT

Over the last few months, the EleutherAI interpretability team pioneered novel, mechanistic methods for detecting anomalous behavior in language models based on Neel Nanda's attribution patching technique. Unfortunately, none of these methods consistently outperform non-mechanistic baselines which look only at activations.

We find that we achieve better anomaly detection performance with methods that evaluate entire batches of test data, rather than considering test points one at a time. We achieve very good performance on many, but not all tasks we looked at. 

We also find that it is relatively easy to detect adversarial examples in image classifiers with off-the-shelf techniques, although we did not test whether our anomaly detectors are themselves adversarially robust.

Thanks to @David Johnston and Arkajyoti Chakraborty for all their hard work on this project, as well as @Erik Jenner for fruitful discussion, ideas, and code!

Code: https://github.com/EleutherAI/cupbearer/tree/attribution_detector



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

EleutherAI 语言模型 异常检测 图像分类器
相关文章