Published on July 16, 2024 3:47 PM GMT

Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free!

Note 2: I originally sent this link to some people for feedback, but I'm having trouble viewing the comments on the draft. The post is also in a reasonably complete state, so I've decided to just publish it. If you're one of those people, feedback is still very much welcome in the actual comments section!

Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research?

A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret.

In this post, I'll critically examine several leading candidates for "paradigms" in MI, considering the available evidence for / against, and identifying important flaws or holes which point to good future research (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined.

Towards a Grand Unifying Theory (GUT) with MI

Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned.

Some people who have espoused this opinion:

here

As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis.

It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated.

(Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here.

A GUT Needs Paradigms

In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for thinking about a field. Kuhn argues that the establishment of accepted paradigms is a sign of "maturity" in the development of a field.

Paradigms Are Instrumental for Progress

In the absence of a paradigm, it's very hard to draw the right conclusions from data, for two reasons.

Multiple hypotheses could explain the data. Kuhn argues that, in the absence of a paradigm, a reasonable researcher might reach "any one of a number of incompatible conclusions". For example, we might incorporate variables with no actual predictive power into our explanation, like an ancient guru using the motions of stars to predict the future. The variables we choose to use are a function of our prior experience in other field and "accidents" in the process of our investigation.

We may not have sufficiently good mental abstractions to understand what we're seeing.

A famous thought experiment in neuroscience considers what results popular interpretability techniques would yield on microprocessors, which are an example of a complex information-processing systems that we understand at all levels. Techniques such as "lesion experiments" (a.k.a activation patching) completely fail to elucidate the underlying structure - without a more advanced interpretation, MI experiments aren't very useful.

As another example of how an absence of paradigms leads to illusions, consider Lucius Bushnaq's thought experiment on interpreting a hypothetical feature tracking entropy of a physical system.

It seems to sort of activate more when the system is warmer. But that's not all it's doing. Sometimes it also goes up when two separated pockets of different gases mix together, for example. Must be polysemantic.

In a pre-paradigmatic field, Kuhn argues that all research amounts to no more than a "fact-gathering" exercise conducted mostly at random. The value of a paradigm is in forcing concerted and systematic study of a common set of phenomena in a standardised way, facilitating subsequent progress in the field. Furthermore, by concentrating attention on "mysteries" which the paradigm fails to explain, a paradigm sets up subsequent work to find the next paradigm.

Three Desiderata for Paradigms

In my view, which differs slightly from Kuhn's treatment, a paradigm has three important properties.

epistemics,

general

open-ended

In the subsequent discussion, I will continually return to these criteria for evaluating subsequent paradigms (or "precursor" paradigms).

Examining Paradigms in Mechanistic Interpretability

Our most "paradigm-y" things at the moment include:

A Mathematical Framework of a TransformerLinear Representation HypothesisSuperposition Hypothesis

TLDR; my view is that these all fall short in some way. In the first case, it actually is a great paradigm, just insufficient for what we want. In the latter two cases, they're not sufficiently rigorous to serve as building blocks for theory.

A Mathematical Framework of a Transformer Circuit

The mathematical framework blogpost, published by Anthropic in 2021, is a seminal example of what I consider to be a great paradigm in MI, perhaps my favourite, that has pushed the field forward a lot.

An overview:

The computation performed by a transformer can be linearly decomposed into a large number of individual functions, each of which is almost-linear. Attention in particular can be thought of as a composition of a "QK" circuit and an "OV" circuit.

Epistemics: 5/5

Unlike some of the later candidates here, the mathematical framework is rigorously proven. It makes nontrivial predictions about transformers circuits that have been empirically validated time and time again.

Generality: 5/5

The mathematical framework makes very few assumptions, and applies to all transformers, no matter the domain. Hence we expect it to apply to an extremely wide range of use cases.

Open-Endedness: 5/5

This paradigm provides a crisp and elegant way of thinking about the computation performed by a model. It's the closest we've ever gotten to fully describing a deep learning architecture mathematically. And it is easy to build upon, as evidenced by the wealth of subsequent "transformer circuits" work.

Note there are still some minor problems re: epistemics.

noted

mediating the model's confidence in its predictions

However I think that these holes in the theory mostly amount to "book-keeping", and can be resolved with some technical care. I also think that these holes don't matter that much in practice, and the wealth of successful work on analyzing transformer circuits is significant evidence in favour of this point.

Instead, my main criticism is that "A Mathematical Framework" is not high-level enough. As excellent as it is, this paradigm feels like a level of abstraction "below" what we want.

here

Sparsify research agenda

Concrete analogy: Answering biological questions with chemistry. If we think of a biological organism (aka a model), "Mathematical Framework" is like a complete description of its chemistry (aka circuitry). The latter is definitely foundational and informationally-complete w.r.t the former. At the same time, it is totally insufficient for answering higher-level questions like "does it have goals?"

The Linear Representation Hypothesis

An overview of the LRH.

linearly-accessible

board states in OthelloGPT

geographical location in real LLMs

Epistemics: 3/5

Polytope Lens

previous work on steering vectors

Generality: 2/5

The LRH argues that the model represents "features" as directions in activation space. However, "feature" is not well-defined, and any attempt to define it in terms of a model's representations results in a circular definition. Model ontology can be different from human ontology. If something isn't linearly represented, is because of a mundane empirical reason, or because the concept we've identified is actually not a ground-truth feature? Furthermore, we have clear evidence of things which are represented in a nonlinear geometry, and which LRH fails to explain. Some features are represented in a circular geometry [Not all features are linear], while others are represented as simplices [belief state, geometry of categorical / hierarchical concepts].

Open-Endedness: 2/5

My biggest problem with LRH is that it's very difficult to make meaningful predictions a priori with it.

work on OthelloGPT shows.

The Superposition Hypothesis

An overview of superposition.

Assumption 1: There are many, many underlying features which are used in the data-generating process. Assumption 2: These features are sparse, i.e. on average, features co-occur very rarely. Corollary: Because of the sparsity of features, models can "get away" with representing more features than the dimensionality of their representation space. Much like an airline booking too many passengers on the same flight, they tolerate the extremely low probability that "too many" features are active at the same time. Theorem: Models of a given width lossily approximate "idealized" models of a much greater width. In the idealized model, all neurons are monosemantic. Polysemanticity occurs in the models we observe because of this desire to "compress" many features into fewer features.

I think the assumptions here are quite reasonable, which facilitates a high generality score. However, the epistemics could be better.

Epistemics: 3/5

The largest body of evidence for superposition so far is that we consistently observe polysemantic neurons in models, across all sizes, architectures, and domains. Polysemanticity is self-evident.

here.

post

feature composition

Secondly, superposition has also been demonstrated in toy models. However, it's unclear what disanalogies there are between superposition in toy models and superposition in real LLMs. For example, the assumptions about the data may not hold, or maybe two-layer ReLU networks are qualitatively quite different from much deeper transformer architectures.

The last main evidence in favour of superposition is that interp approaches inspired by superposition, i.e. sparse autoencoders, have seemed to work really well. However, this evidence is also not very solid.

"some of the SAE features appear to be human-interpretable" is not a very convincing standard of evidence, especially when they are subject to cherry-picking / streetlighting. More on this in "Rigor" below. Superposition may not be the only reason why SAEs empirically result in low loss. and I'm concerned that we have not sufficiently excluded alternative hypotheses.

As a specific alternative hypothesis on what SAEs are doing, a common argument is that SAEs simply cluster the data, and interpretability comes from having tighter clusters.

As a quick thought experiment on why SAEs might be better described as "clustering", consider a model of animals with 2 underlying variables: leg length and tail length. In this fictional world, there are 3 types of animals: snakes (long tails, short legs), giraffes (long legs, long tails), and dogs (short legs, long tails). SAEs will likely recover one feature per animal type, and appear to be interpretable as a result, but they have failed to recover the underlying compositional variables of leg length and tail length. (credit: Jake Mendel)I expect that SAEs definitely do clustering to some extent, which is why we observe feature splitting. However, to what extent? Would similar clustering methods like k-means result in similar Pareto curves of L0 vs reconstruction? I would be excited to see / do work that explores this hypothesis.

Generality: 4/5

As discussed above, I'm reasonably confident in the validity of the assumptions made by superposition on the data. I expect "many sparse features" to be a good characterization of many domains, including language, text, audio, and many more. Hence I think superposition is highly general, conditioned on it being true.

Open-Endedness: 5/5

Assuming superposition as a concept turns out to be basically correct, it illuminates a lot of useful follow-up work.

the most obvious being "how do we take things out of superposition" in order to be interpretable. C.f. all the work subsequently on sparse dictionary learning / sparse autoencoders. I won't elaborate on that here as it's been done elsewhere.

A point which I have not seen made elsewhere: I think we have not yet fully elucidated the "kinds" of superposition that can occur in models. Arbitrarily complex things can be in superposition, and features could just be the tip of the iceberg.

here

Overall, I am actually quite confident that superposition is essentially correct. That is why I'm currently working on circuit analysis using SAEs. But I think there's insufficient evidence atm to reject leading alternative hypotheses and cement its status as a paradigm.

Other Bodies of Theory

Note: I am considerably less familiar with these other bodies of theory than I am with the preceding 3, so there may be errors or inaccuracies here. Please feel free to point those out as necessary

There are some other bodies of theory which currently don't make the top cut, but which I think are promising nonetheless as things which could yield paradigms, given time

Singular Learning Theory

Ratings are offered here, but they should be taken with a very large pinch of salt because I am not very familiar with SLT

Overview:

SLT provides rigorous results on "singular" (a.k.a overparametrized / underdetermined) model classes, of which neural nets are a subclass. In principle, this seems like exactly what we want - mathematically rigorous and general theory applicable in diverse scenarios. It remains to be seen whether SLT makes nontrivial predictions about commercially-used language models. As far as I'm aware, the largest-scale study is on 1-layer and 2-layer transformers.

Epistemics: 5/5

There is a decades-old body of rigorous math behind SLT. Similar to "A Mathematical Framework", the results proven are mathematically rigorous, so I'm confident that they hold.

Generality: 3/5

In principle, because SLT covers all things within the class of "singular" models, the claims are extremely general - even more so than "A Mathematical Framework", which only applies to neural nets, and even more specifically only transformer architectures.

However, I'm not confident that SLT results are general, for the following reasons

canonical analysis

recent work

Open-Endedness: 3/5

predict grokking

On the whole, I think there is a lot of untapped potential here for SLT to be a paradigm, but this potential is quite far from being fully realized at the moment due to both issues with communicating the foundations of SLT to the broader MI community and a lack of "killer applications".

Computational Mechanics

No rating is offered here because I haven't engaged sufficiently with the material. I'm including Comp Mech mostly for completeness

The seminal work that I am aware of is about the fact that transformers model belief states using simplices, which are a specific representational geometry. Brief comments here:

Gurnee's work

Wu and Stolfo

I don't have great takes on what Comp Mech aims to do as a field, and in any case it hasn't made significant impact (yet) on the MI literature. I'll revisit this in the future if it becomes relevant

The Polytope Hypothesis

This is the idea that the correct atom for models' feature geometry are "polytopes". This is a very nascent trend I observe in some recent papers [Kiho Park et al, Adam Shai et al, Polytope Lens, and circular features in day-of-the-week math]. I intend to write a more complete note about this in a follow-up work.

Distilling A Technical Research Agenda

Note: Here, I summarize technical research items proposed in previous sections, which I think would be exciting. I've omitted the "other bodies of theory" for now because I think my takes will not be very good.

Generally: More killer applications of existing paradigms.

circuit breakers

On "Mathematical Framework":

high-level semantic descriptions

activation pattern SVD

Unifying this with superposition

On the LRH:

Elucidating the limits of the LRH

Specifically: crisp counterexamples.

On superposition:

the "shape" of superposition

alternative hypotheses for what SAEs are doing

Conclusion

In summary, I think it's important to critically evaluate whether MI has succeeded in delivering general paradigms with high explanatory power over nontrivial phenomena. My take on this so far is that we have a few separate attempts but all of these are lacking at the moment. That's alright, since incomplete paradigms are still useful, and this highlights good avenues for future research.

If you made it this far, thanks for reading my rather rambly thoughts. Feedback is appreciated!

Discuss

Towards a Grand Unifying Theory (GUT) with MI

A GUT Needs Paradigms

Paradigms Are Instrumental for Progress

Three Desiderata for Paradigms

Examining Paradigms in Mechanistic Interpretability

A Mathematical Framework of a Transformer Circuit

The Linear Representation Hypothesis

The Superposition Hypothesis

Other Bodies of Theory

Singular Learning Theory

Computational Mechanics

The Polytope Hypothesis

Distilling A Technical Research Agenda

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签