少点错误 05月13日 23:52
Distribution dependence in Mech Interp
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了模型可解释性研究中,基于权重的方法和基于数据分布的方法之间的权衡。权重分析试图通过模型权重寻找内在机制,但完全依赖权重分析不可行。文章提出,应结合数据分布来理解模型,如同蛋白质结合亲和力一样,先有全局的权重视角,再结合实际数据确定作用。不同的应用场景需要不同范围的解释,例如对抗性输入需要更广的解释范围,而调试特定问题只需关注狭窄分布。文章还探讨了典型情况分析的挑战,并提出使用马尔可夫链作为中间近似,以分析模型在更广泛但可控的输入分布上的行为。

⚖️权重解释性旨在获得独立于底层输入分布的解释,而窄分布分析则侧重于明确列出每个可能的输入文本,两者共同点在于对底层输入分布的描述长度都很小。

🧩 典型案例分析面临的挑战在于,我们无法正式指定底层输入分布,因为它由无休止变化的嘈杂人类文本组成。一种部分解决方案是将解释范围限定于具有广泛数据的特定数据集,并希望这些解释能够推广到数据集之外的典型文本。

🔗 文章提出了一种折中方案,即使用马尔可夫链作为中间近似。通过分析模型在马尔可夫链生成分布上的行为,识别一致激活的机制,并开发适用于该分布的解释,从而在完全指定人类语言的真实分布(不可能)和完全避免分布假设(过于宽泛)之间找到平衡。

Published on May 13, 2025 3:45 PM GMT

Weights-based interpretability is a style of Mech Interp which attempts to find mechanisms inside of models just by looking at their weights. The Mathematical Framework For Transformer Circuits is an example of this approach. Their analysis of one-layer models allows them to uncover many behaviours without running the model over a dataset.

But a completely weights-based analysis of models is infeasible, even for one-layer models, as Buck argued back in early 2023 in One Layer Transformers aren't equivalent to a set of skip trigrams. There needs to be some appeal to the data distribution.

Chris Olah responded to Buck's post with a clarification of how he views Mech Interp, by drawing an analogy to protein binding affinities. In his view, we should think of weight-based interpretability as a "global story", like telling us the binding affinities of proteins. We can then recover an "on-distribution story", by looking at which proteins "block" other proteins in practice. He argues that it's important to have the global story to avoid interpretability illusions obtained by studying narrow distributions, because his motivation is understanding whether models will be safe off-distribution, which is why he took "the mechanistic approach rather than picking low-hanging fruit in correlational interpretability."

Scopes of explanations:

One framing is to view mechanistic explanations as each holding on a continuum of underlying input distributions. Weights-based interpretability aims to obtain explanations that hold regardless of the underlying input distribution. The opposite extreme is explanations for model behaviour on narrow distributions, such as the IOI task. 

Each use case for interp demands different scopes. If we want models to be robust to adversarial inputs, we want explanations that hold regardless of the underlying input distribution. If we want to understand the typical-case behaviour, we want explanations that hold on typical inputs. And if we just want to debug a specific problematic behaviour, we only need explanations that hold for the narrow distribution where the problematic behaviour occurs.

What weights-based interpretability and narrow distribution analysis have in common is that in both cases the description length of the underlying input distribution is small. It's small for weights-based interpretability because explanations are required to be independent of the underlying input distribution. It's small for narrow distribution analysis because we can usually explicitly list every possible input text.

Typical Case Analysis:

The trouble with constructing explanations for typical case behaviour, which is arguably the most interesting case, is that the underlying input distribution is impossible for us to formally specify.  It consists of noisy human text with endless variations.

One way to partially resolve this is to scope our explanations to a particular dataset, with a wide variety of data, and then hope that our explanations scoped to the dataset generalize to typical text from outside the dataset. SAEs let us find dataset-scoped variables in an unsupervised manner, and then autointerp methods attempt to automatically construct dataset-scoped explanations. But we might worry that our explanations are just features of the dataset, meaning that they don't extend to the typical input.

We can also attempt to approach the typical case from above. This involves attempting to pick an input distribution which is wider than the typical case input, and construct explanations that hold with high probability on this wider input distribution. The extreme case here is weights-based interpretability, where very little can be said. But there are potentially useful intermediates between fully specifying the typical case distribution (impossible), and not narrowing the distribution down at all.

To illustrate this approach, consider how we might use Markov chains as an intermediate approximation. Instead of attempting to fully characterize human language (impossible) or avoiding distribution assumptions entirely (too general), we could model input text as being generated by a Markov process where each word depends only on the preceding n words. This simplified model creates a well-defined probability distribution over sequences that shares some statistical properties with natural language. 

We could then: 

1. Analyze model behavior on this Markov-generated distribution 

2. Identify mechanisms that consistently activate across this distribution 

3. Develop explanations that hold with high probability for inputs from this distribution 

While human language is clearly more complex than any Markov model, we can make a reasonable heuristic argument: if natural language doesn't deviate too drastically from the typical patterns of our Markov distribution, then explanations that hold for the Markov model might extend to typical human inputs. The key advantage is that the Markov distribution is mathematically tractable in ways that the true distribution of human language is not.

Conclusion: 

The tension between different scopes of interpretability reflects a fundamental property of machine learning models. These models demonstrate an impressive ability to generalize beyond their training data, yet remain vulnerable to carefully crafted adversarial examples. This pattern strongly suggests the existence of underlying regularities in typical input distributions that enable generalization while simultaneously setting its boundaries. These regularities must be describable in a way that is more compact than the dataset itself. If the model could only memorize the training data without extracting more generalizable patterns, we wouldn't observe the out-of-distribution generalization capabilities that these models clearly demonstrate.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

模型可解释性 权重分析 数据分布 马尔可夫链
相关文章