少点错误 02月25日
Topological Data Analysis and Mechanistic Interpretability
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用拓扑数据分析(TDA)进行机制可解释性研究,尤其是在卷积神经网络和大型语言模型(LLM)中。文章首先介绍了TDA的基本概念,包括拓扑学的定义和Mapper算法。随后,展示了TDA在分析卷积神经网络内部结构和训练过程中的应用,揭示了各层神经元的“职责”以及训练过程中结构的变化。对于LLM,文章提出结合稀疏自编码器(SAE)的方法,并给出了一些初步的例子,展示了TDA如何帮助揭示SAE特征几何结构中的信息。

💡拓扑数据分析(TDA)是一种利用拓扑学方法研究数据形状的技术,它对拉伸、压缩等变形不敏感,能够识别数据集中潜在的结构和模式。

🔗 Mapper算法是TDA中构建数据图表示的核心方法,通过构建数据集的覆盖并计算其神经图,能够有效地揭示数据集中复杂的拓扑结构。

🌐 TDA可用于理解神经网络的内部机制,通过分析卷积神经网络的权重向量,可以观察到不同层学习到的特征,以及训练过程中特征结构的变化。

🔑 结合稀疏自编码器(SAE)和TDA,可以更深入地理解大型语言模型(LLM)的内部运作机制,揭示特征空间中的几何结构,从而提高模型的可解释性。

Published on February 24, 2025 7:56 PM GMT

This article was written in response to a post on LessWrong from the Apollo Research interpretability team. This post represents our initial attempts at acting on the topological data analysis suggestions.

In this post, we’ll look at some ways to use topological data analysis (TDA) for mechanistic interpretability. We’ll first show how one can apply TDA in a very simple way to the internals of convolutional neural networks to obtain information about the “responsibilities” of the various layers, as well as about the training process. For LLM’s, though, simply approaching weights or activations “raw” yields limited insights, and one needs additional methods like sparse autoencoders (SAEs) to obtain useful information about the internals. We will discuss this methodology, and give a few initial examples where TDA helps reveal structure in SAE feature geometry.

I. Topological Data Modeling

The term topology refers to the study of shape using methods that are insensitive to deformations such as stretching, compressing, or shearing. For example, topology does not “see” the difference between a circle and an ellipse, but it does recognize the difference between the digit 0 and the digit 8. No matter how I stretch or compress the digit 0, I can never achieve the two loops that are present in the digit 8. Shapes can often be represented by graphs or their higher dimensional analogues called simplicial complexes. For instance, one can think of a hexagon as modeling a circle, with the understanding that the modeling is accomplished with a small amount of error:

Of course data sets can have notions of shape, too. For example, here is a data set that we can recognize as having a circular shape, even though it only consists of samples and is not a complete circle.

A circular shape may be an indication of periodic behavior. In a mechanistic interpretability context, Engels et al showed that some LLM SAE features are organized in a circular pattern, and that those features correspond to temporal periodic structures like days of the week or months of the year.

There are numerous other examples where periodic data is shown as a circle when graphed, notably in dynamical systems like predator/prey models.

II. Mapper

Mapper is the name for a family of methods that use topological ideas to build graphs representing data sets. The core concept behind Mapper is the nerve of a covering.  A covering of a set  is a family of subsets  so that .  The nerve graph of the covering is the graph  whose vertices  correspond to the sets , and where vertices  and  form an edge in  if , i.e. if  and  overlap.  As an example, suppose the set  is as shown below, with covering by four sets colored red, yellow, blue, and green, with overlaps as indicated.

The nerve graph has four vertices, one for each of the covering sets. The vertices corresponding to the yellow and red sets are connected by an edge because they overlap. The  vertices corresponding to the yellow and blue sets are not connected by an edge because they do not overlap. Building a graph representation of a dataset by constructing a good covering is a powerful technique, motivated by fundamental results like the nerve lemmas, which give guarantees about topological equivalence of a space with the nerve of a sufficiently nice cover of that space. There are numerous strategies for constructing such graph models motivated by this simple construction.  Of course, the graphs constructed often have many more vertices than the model above.  This kind of graph modeling is a part of an area of data science called Topological Data Analysis.

Graphical modeling can be used to understand the internals of neural networks, as illustrated below, from this paper (a presentation can be found here).

These graphs were obtained from VGG16, a convolutional neural network pre-trained on ImageNet.  For each layer, we constructed the data set of weight vectors for each neuron, including only those vectors satisfying a certain local density threshold. One can see that in the first two layers, the graph model is circular, and it shows that the weight vectors are concentrated around those which detect approximations to linear gradients. Later layers always include these but also additional ones.  For example, layer four includes weight vectors which detect a horizontal line against a dark background.  Layer five includes a white “bulls eye” and a crossing of two lines.  Later layers include combinations of these.  The coloring of the nodes encodes the number of data points in the set corresponding to the node, so red points would contain more points than green or blue ones. These visualizations demonstrate the presence of geometric structure in VGG16's weight vectors, indicating that specific, interpretable features are learned at each layer.

A second example performs the same kind of analysis for a two hidden layer convolutional neural network, but observing how the structure changes over the course of training.  In this case, in the first layer, one can see roughly random behavior after 100 iterations, but after 200 iterations, one sees concentration (as indicated by the redness) around points on the circular boundary, which correspond to the linear gradients as in VGG16.  This pattern becomes even more pronounced after 400 iterations, but begins to degrade after 500 iterations. In the second layer, one sees a very weak circular boundary through the first 500 iteration, becoming more pronounced after that. One can hypothesize that the second layer is “compensating” for the degradation occurring in the first layer.  The first layer has opted to retain the linear gradients in the vertical and horizontal directions, but has additionally included a black bulls eye on a lighter background.  This is unexpected behavior, and probably is due to the small number of layers in this network. What we would have expected is behavior similar to that seen in VGG16 above, in which the earliest layers respond to the simplest local behavior, namely an edge, and later layers to more complex behaviors.

III. Mechanistic Interpretability

Apollo Research recently led an extensive report on open problems in mechanistic interpretability, with a large portion focused on open questions about SAEs. Some of the issues that stood out to us were:

    As it is, the method does not create a usable geometry on the space of features. Geometry (and, we would add, topology) of feature sets is a useful way of organizing the features, and obtaining understanding and interpretations of them. It is well known that geometries of feature spaces are often extremely useful in signal processing. Fourier analysis uses the circular geometry of periodic data in a critical way, and the field of graph signal processing illustrates the power of geometry in organizing the features of a data set (see here and here for more details).SAEs give an organization of the activations in neural networks, level by level, but does not directly give information about mechanisms.  How can one represent mechanisms?The ultimate goal is to extract interpretable features that accurately describe the internal processes of a model. Sparsity is used as a proxy for interpretability in SAEs. However, it is not clear whether sparsity is the best proxy for interpretability, or even always a helpful one. There are methods being developed which may improve the situation, notably minimum description length.  We believe that geometrically inspired measures will yield improved interpretability.  

IV. Graph Modeling of SAE features

Question 1 above concerns the need for a geometry on feature spaces. This is a key ingredient in interpretability of features.  We think TDA can help understand this feature geometry, and we'll show a few simple examples we've tried on the SAE features constructed by OpenAI for GPT-2-small. The graphs we build are constructed using BluelightAI's Cobalt software, which employs a variant of the Mapper technique outlined in Section II. We did need to implement a few workarounds to make this function, and we plan to share a cleaned-up Colab notebook detailing the process in the near future. The largest component of the graph constructed on these SAE features is displayed below. We used cosine similarity to compare features.

Each node of the graph corresponds to a collection of the SAE features. Below we will show selections A and B from the above diagram, and indicate what words or concepts trigger the features in each node or region.  Each SAE feature activates with varying frequency on different sets of words, and collections of features are labeled by the most frequently occurring words in the collection.

Selection A contains a three step progression, which looks like this:

(1) Regard to, attributed to, associated with, likening to

(2) Connect, link, belong, contribute, refer

(3) In conjunction with, listening to, according to, access to

All three have to do with relationships. (1) describes type of relationships, and those relationships are conceptual.  (2) describes more explicit types of relationships, and (3) gives even more explicit and detailed forms of relating.

Selection B contains a “Y”-shape enclosed in the upper left, and we can interpret it like this:

V. Next steps for TDA and SAE features

We have a lot more ideas for how to use TDA to help better understand SAEs and neural network activation spaces more generally.

We looked at the geometric structure of SAE features themselves here, but we think these features may also be useful as a way to better understand the topological structure of activation space:

VI. Summary

We have demonstrated the use of topological data analysis in the study of SAEs for large language models, and obtained conceptual understanding of groups of these features.  This methodology is quite powerful, and holds the promise for the mechanistic understanding of the internals of large language models.  

VII. Acknowledgments

We thank Lee Sharkey for his helpful comments and suggestions.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

拓扑数据分析 机制可解释性 稀疏自编码器 神经网络
相关文章