Preliminary idea: Clustering SAE features by their computational graphs

少点错误 14小时前

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

本文探讨了Anthropic大型语言模型中特征的分类问题。作者指出，模型中存在各种特征类别，如'规划'、'单token'、'复制token'、'归纳'和'上下文'等。由于特征数量庞大，手动检查难以发现规律。作者提出通过特征聚类，尤其是基于注意力头和MLP层的聚类方法，来更好地理解这些特征的功能，并探索了不同特征类别与特定计算单元之间的关联，例如，GPT2-Small中，上下文特征由同一组注意力头计算，从而帮助我们更好地理解大型语言模型的工作机制。

💡大型语言模型中存在多种特征类别，例如'规划'、'单token'、'复制token'、'归纳'和'上下文'等，这些类别有助于理解模型的内部运作。

🔍由于特征数量庞大，手动检查难以全面理解。作者建议通过聚类方法将特征分组，以便从整体上理解它们的功能。

🧠作者提出了基于注意力头和MLP层的聚类方法，认为这有助于发现不同特征类别与特定计算单元之间的关联。例如，GPT2-Small中，上下文特征由同一组注意力头计算。

💡文章强调了聚类方法与语义聚类的差异，前者更侧重于计算类别，后者更侧重于语义含义，这为理解模型的多样性提供了新的视角。

Published on June 15, 2025 1:41 PM GMT

A lot of the exciting results in Anthropic's recent On the Biology of a Large Language Model came from discovering features that belonged to a wider family of features. The 'planning to say rabbit' feature is interesting not because of the specific circuit, but because it is an instance of a category of features suggesting models plan ahead. We assume that the model has 'planning to say <token>' features for every token, rather than just for 'rabbit'.

We know of other categories of features: Single-token, Duplicate Token, Induction, Contextual, and more.

There are surely lots of categories of features which we are currently missing, or dismiss as meaningless features, because we can't immediately see the pattern. There are just too many features to manually examine. If we could cluster together features from the same computational category, then we would stand a better chance of figuring out the role of these features from looking at each category as a whole.

Notice that this type of clustering is almost orthogonal to semantic clustering. We expect there to be 'contextual features' across all manner of different contexts, and for there to be single-token features for every token regardless of meaning.

I'm unsure of how accurate this framing is, or how best to cluster features.

One idea i've had is doing some kind of attribution patching (or is it activation patching), to figure out if certain categories of features consistently use particular attention heads.

For instance, I know from investigating the first layer of GPT2-Small that contextual features all arise from the same set of attention heads working together. The 'known entity' feature in gpt2-small is computed using a single local head. Duplicate token features are computed using duplicate token heads, and induction features are computed using induction heads.

So perhaps if we could find the attention heads/mlp layers which are used in computing SAE features, we could cluster on this basis.

Discuss

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签