少点错误 2024年07月29日

An Interpretability Illusion from Population Statistics in Causal Analysis

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了在可解释性分析中可能出现的误导性现象，即可解释性幻觉。作者通过两个案例研究，证明了仅基于特定数据集的因果分析可能会导致错误的结论。作者指出，即使某个干预措施在特定数据集上似乎对指标产生了可预测的影响，但它可能只对该数据集的一个子集有效，而对其他子集则没有影响。

🤔**可解释性幻觉的本质**：本文的核心论点是，在进行可解释性分析时，仅仅依靠特定数据集的因果分析可能会导致错误的结论。作者指出，即使某个干预措施在特定数据集上似乎对指标产生了可预测的影响，但它可能只对该数据集的一个子集有效，而对其他子集则没有影响。

🤖**案例研究：注意力机制在IOI任务中的作用**：作者以一个关于注意力机制在IOI任务中的案例研究为例，说明了可解释性幻觉的潜在影响。研究发现，尽管注意力机制在所有IOI数据上似乎都对任务有积极影响，但实际上，它只在BABA变体中起作用。

🧭**案例研究：转向向量在模型评价中的应用**：作者还分析了转向向量在模型评价中的应用，指出基于总体平均概率的转向向量分析可能会掩盖样本层面的差异。通过对比总体层面的转向和样本层面的转向，作者发现，不同的样本可能会有不同的转向行为，甚至出现与预期相反的转向方向。

💡**可解释性幻觉的启示**：本文强调了在进行可解释性分析时，需要谨慎选择数据集并充分考虑数据的多样性。作者建议，在进行因果分析时，应该考虑对多个数据集进行测试，以确保结论的可靠性。此外，作者还指出，一些先前关于机制解释性的研究工作可能需要重新评估，以避免可解释性幻觉的影响。

Published on July 29, 2024 2:50 PM GMT

This is an informal note on an interpretability illusion I've personally encountered, twice, in two different settings

Causal Analysis

f

I

M

D

I

m

m_{c l e a n} = M (f; D)

m_{a b l a t e} = M (I (f); D)

m_{c l e a n} > m_{a b l a t e}

Dataset Choice Admits Illusions

D^{a l t}

m_{c l e a n} \approx m_{a b l a t e}

D^{'} = D \cup D^{a l t}

m^{'} = 0.5 m + 0.5 m^{a l t}

D^{a l t}

Therefore even if the hypothesis seems true, it may actually be true only for a slice of the data that we test the model on.

Case Studies

Attention-out SAE feature in IOI

recent work with Jacob Drori

Steering Vectors

Previous work on steering vectors

recent work on steering vectors

As an example, here's a comparison of population-level steering and sample-level steering on the believes-in-gun-rights dataset (from Model-written Evals).

While the population-level statistics show a smooth increase, the sample-level statistics tell a more interesting story; different examples steer differently, and in particular there seem to be a significant fraction where steering actually works the opposite of how we'd like.

Conclusion

There is an extensive and growing literature on interpretability illusions, but I don't think I've heard other people talk about this particular one before. It's also quite plausible that some previous mech interp work needs to be re-evaluated in light of this illusion.

Discuss

Causal Analysis

Dataset Choice Admits Illusions

Case Studies

Attention-out SAE feature in IOI

Steering Vectors

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签