Published on June 15, 2025 1:41 PM GMT
A lot of the exciting results in Anthropic's recent On the Biology of a Large Language Model came from discovering features that belonged to a wider family of features. The 'planning to say rabbit' feature is interesting not because of the specific circuit, but because it is an instance of a category of features suggesting models plan ahead. We assume that the model has 'planning to say <token>' features for every token, rather than just for 'rabbit'.
We know of other categories of features: Single-token, Duplicate Token, Induction, Contextual, and more.
There are surely lots of categories of features which we are currently missing, or dismiss as meaningless features, because we can't immediately see the pattern. There are just too many features to manually examine. If we could cluster together features from the same computational category, then we would stand a better chance of figuring out the role of these features from looking at each category as a whole.
Notice that this type of clustering is almost orthogonal to semantic clustering. We expect there to be 'contextual features' across all manner of different contexts, and for there to be single-token features for every token regardless of meaning.
I'm unsure of how accurate this framing is, or how best to cluster features.
One idea i've had is doing some kind of attribution patching (or is it activation patching), to figure out if certain categories of features consistently use particular attention heads.
For instance, I know from investigating the first layer of GPT2-Small that contextual features all arise from the same set of attention heads working together. The 'known entity' feature in gpt2-small is computed using a single local head. Duplicate token features are computed using duplicate token heads, and induction features are computed using induction heads.
So perhaps if we could find the attention heads/mlp layers which are used in computing SAE features, we could cluster on this basis.
Discuss