Published on May 13, 2025 3:45 PM GMT
Weights-based interpretability is a style of Mech Interp which attempts to find mechanisms inside of models just by looking at their weights. The Mathematical Framework For Transformer Circuits is an example of this approach. Their analysis of one-layer models allows them to uncover many behaviours without running the model over a dataset.
But a completely weights-based analysis of models is infeasible, even for one-layer models, as Buck argued back in early 2023 in One Layer Transformers aren't equivalent to a set of skip trigrams. There needs to be some appeal to the data distribution.
Chris Olah responded to Buck's post with a clarification of how he views Mech Interp, by drawing an analogy to protein binding affinities. In his view, we should think of weight-based interpretability as a "global story", like telling us the binding affinities of proteins. We can then recover an "on-distribution story", by looking at which proteins "block" other proteins in practice. He argues that it's important to have the global story to avoid interpretability illusions obtained by studying narrow distributions, because his motivation is understanding whether models will be safe off-distribution, which is why he took "the mechanistic approach rather than picking low-hanging fruit in correlational interpretability."
Scopes of explanations:
One framing is to view mechanistic explanations as each holding on a continuum of underlying input distributions. Weights-based interpretability aims to obtain explanations that hold regardless of the underlying input distribution. The opposite extreme is explanations for model behaviour on narrow distributions, such as the IOI task.
Each use case for interp demands different scopes. If we want models to be robust to adversarial inputs, we want explanations that hold regardless of the underlying input distribution. If we want to understand the typical-case behaviour, we want explanations that hold on typical inputs. And if we just want to debug a specific problematic behaviour, we only need explanations that hold for the narrow distribution where the problematic behaviour occurs.
What weights-based interpretability and narrow distribution analysis have in common is that in both cases the description length of the underlying input distribution is small. It's small for weights-based interpretability because explanations are required to be independent of the underlying input distribution. It's small for narrow distribution analysis because we can usually explicitly list every possible input text.
Typical Case Analysis:
The trouble with constructing explanations for typical case behaviour, which is arguably the most interesting case, is that the underlying input distribution is impossible for us to formally specify. It consists of noisy human text with endless variations.
One way to partially resolve this is to scope our explanations to a particular dataset, with a wide variety of data, and then hope that our explanations scoped to the dataset generalize to typical text from outside the dataset. SAEs let us find dataset-scoped variables in an unsupervised manner, and then autointerp methods attempt to automatically construct dataset-scoped explanations. But we might worry that our explanations are just features of the dataset, meaning that they don't extend to the typical input.
We can also attempt to approach the typical case from above. This involves attempting to pick an input distribution which is wider than the typical case input, and construct explanations that hold with high probability on this wider input distribution. The extreme case here is weights-based interpretability, where very little can be said. But there are potentially useful intermediates between fully specifying the typical case distribution (impossible), and not narrowing the distribution down at all.
To illustrate this approach, consider how we might use Markov chains as an intermediate approximation. Instead of attempting to fully characterize human language (impossible) or avoiding distribution assumptions entirely (too general), we could model input text as being generated by a Markov process where each word depends only on the preceding n words. This simplified model creates a well-defined probability distribution over sequences that shares some statistical properties with natural language.
We could then:
1. Analyze model behavior on this Markov-generated distribution
2. Identify mechanisms that consistently activate across this distribution
3. Develop explanations that hold with high probability for inputs from this distribution
While human language is clearly more complex than any Markov model, we can make a reasonable heuristic argument: if natural language doesn't deviate too drastically from the typical patterns of our Markov distribution, then explanations that hold for the Markov model might extend to typical human inputs. The key advantage is that the Markov distribution is mathematically tractable in ways that the true distribution of human language is not.
Conclusion:
The tension between different scopes of interpretability reflects a fundamental property of machine learning models. These models demonstrate an impressive ability to generalize beyond their training data, yet remain vulnerable to carefully crafted adversarial examples. This pattern strongly suggests the existence of underlying regularities in typical input distributions that enable generalization while simultaneously setting its boundaries. These regularities must be describable in a way that is more compact than the dataset itself. If the model could only memorize the training data without extracting more generalizable patterns, we wouldn't observe the out-of-distribution generalization capabilities that these models clearly demonstrate.
Discuss