少点错误 07月23日 04:37
Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文回顾并尝试复现了用于GPT-2小模型的无监督电路发现算法,但结果表明,基于稀疏自编码器的局部近似方法难以构成完整的解释性电路。研究指出,单个稀疏自编码器特征的贡献度(节点)和特征间的因果关系(边)可以通过计算梯度来近似,但在实际应用中,这种方法仅能部分恢复已知的因果线索,且上游的节点和边信息往往不具解释性。作者认为,要实现对模型行为的全面解释,可能需要更全局、跨层的近似方法,而非仅仅依赖于局部稀疏自编码器的组合。

✨ 稀疏自编码器在模型解释性研究中曾被视为关键工具,旨在将模型内部的激活映射到自然语言的理解,但研究发现其局部近似性质限制了其解释能力。作者通过复现梯度下降的电路发现算法,试图构建模型行为的解释性电路,但发现该方法难以提供清晰、完整的因果链条。

🚀 梯度计算被用作捕捉稀疏自编码器特征间因果关系的方法,通过计算激活的导数来衡量特征间的相互影响。该方法能够识别出部分与模型输出相关的“因果线索”,例如在特定提示下与词语补全相关的特征,但这些发现的线索往往是零散的,且与其他方法(如神经元学)的发现并不完全一致。

📉 研究结果表明,仅依赖稀疏自编码器和局部梯度信息不足以完全解释GPT-2模型的自然语言处理行为。在对模型输出进行关键预测的“近因”特征被识别出来后,其上游的节点和边信息却未能提供有意义的解释,这表明局部近似的组合不足以形成连贯的解释性电路。

💡 作者认为,要实现更有效的模型解释,可能需要超越局部近似,采用更全局、跨层的近似方法,如Anthropic近期在两数相加任务上展示的跨层转码器电路发现。这种方法能够捕捉到模型在处理过程中更复杂的、全局性的信息流,从而提供更全面的解释。因此,作者的结论是,在自然语言模型中构建解释性电路,全局近似是关键的下一步。

🤔 研究的负面结果强调了在模型解释性领域,简单组合局部信息可能不足以达到预期的理解深度。对于“为什么”模型会生成特定输出的深层原因,需要更复杂的分析框架来揭示。这为未来研究指明了方向,即探索如何构建能够整合全局信息的解释模型。

Published on July 22, 2025 8:36 PM GMT

Repo: https://github.com/DavidUdell/sparse_circuit_discovery

TL;DR: A SPAR project from a while back. A replication of an unsupervised circuit discovery algorithm in GPT-2-small, with a negative result.

Thanks to Justis Mills for draft feedback and to Neuronpedia for interpretability data.

Introduction

I (David) first heard about sparse autoencoders at a Bay Area party. I had been talking about how activation additions give us a map from expressions in natural language over to model activations. I said that what I really wanted, though, was the inverse map: the map going from activation vectors over to their natural language content. And, apparently, here it was: sparse autoencoders!

The field of mechanistic interpretability quickly became confident that sparse autoencoder features were the right ontology for understanding model internals. At the high point of the hype, you may recall, Anthropic declared that interpretability had been reduced to "an engineering problem."

Frustratingly, though, the only naturalistic model behavior that had been concretely explained at that point was indirect object identification, dating from back before the sparse autoencoder revolution! In a world where sparse autoencoders just solved it, I would expect everyone and their dog to be sharing pseudocode fully explaining pieces of LLM behavior. On the sparse autoencoder circuits front, here was a representative proffered explanatory circuit at that time:

I cannot directly interpret this image as pseudocode, is the thing. Note that it's definitely true that the features highlighted here are often quite relevant and suggestive. But the edges between features don't really add to my understanding of what is going on. I get the feeling that the crisp mechanistic understanding is tantalizingly close here... but still out of reach.

So it struck me as strange that so much capital was being put into tweaking and refining the sparse autoencoder architecture, and not into leveraging existing autoencoders to explain stuff—I thought the whole point was that we had the right ontology for explanations already in our hands! Maybe not full explanations of every model behavior, but full explanations of some model behaviors nonetheless.

The theory of change that this then prompted for our SPAR project was: show how any naturalistic transformer behavior can be concretely explained with sparse autoencoder circuit discovery. Alternatively, show how concrete sparse autoencoder circuits fail as full explanations. So that's what we did here: we replicated the then-new circuit discovery algorithm, ironed out a few bugs, and then tried to mechanistically explain how some GPT-2 forward passes worked. It did not work out, in the end.

Our bottom-line conclusion, driven by our datapoint, is that the localistic approximations of vanilla sparse autoencoders cannot be strung together into fully explanatory circuits. Rather, more global, layer-crossing approximations of the model's internals are probably what is needed to get working explanatory circuits.

Gradient-Based Unsupervised Circuit Discovery

An explanation of the circuit discovery algorithm from Marks et al. (2024) that we replicate.

We want to be able to explain a model's behavior in any arbitrary forward pass by highlighting something in its internals. Concretely, we want to pinpoint a circuit that was responsible for that output in that context. That circuit will be a set of sparse autoencoder features and their causal relations, leading to the logit for the token being upweighted and other-token logits being downweighted.

The most basic idea for capturing causality in mechanistic interpretability is: take derivatives. If you want to know how a scalar  affected a scalar , well, take the derivative of  with respect to . If you want to know how anything affected the loss, of course, take the derivative of the loss with respect to it. But the strategy is fully general: if you want to know how one activation affected another activation, take the derivative of the one with respect to the other. And if you want to know how your sparse autoencoder features all affect one another, just take derivatives among them.

The idea of the sparse autoencoders themselves is less immediately obvious. Once you have them, though, they are a differentiable quantity that automatically lend themselves to causal approximation. A natural next step after you get apparently comprehensible sparse autoencoder features is to see whether their causal relationships match (or fail to match) your understanding of feature contents.

On the implementation side, derivatives batch well in PyTorch. In a small number of backward passes, we can estimate how sparse autoencoder features causally interrelate.

More precisely, let  be the loss (concretely: CrossEntropy). Let  be a sparse-autoencoder feature activation, a scalar, with a suppressed index for its model sublayer :

node  for a feature  is given by

Interpret a node as a feature's individual contribution to the loss.

An edge between two feature activations  and  is given by

Interpret an edge as the  feature's individual contribution to the loss by way of affecting the  feature.

Collect the absolutely largest nodes at every sublayer  in a forward pass. Compute the edges between all neighboring nodes in the collection. Finally, correct for any double-counted causal effects between the nodes.[1]

The graph of edges you have now (or, precisely, the subset of that graph that paths to the logits) is a gradient approximation of the actual causal relationships between sparse autoencoder features. It is, putatively, an explanation of that forward pass.

Sanity Check

A spot check of the new method's validity.

Last time, we looked at the prompt Copyright (C). GPT-2-small knows that that final closing parenthesis should be coming: the model puts the majority of its probability on that token. How does it know to do that?

Well, last time, we saw persistent "causal threads" in GPT-2-small's residual stream over that forward pass. Certain sparse autoencoder features can be seen causally propagating themselves across a forward pass, just preserving their own meaning for later layers. The features that we observe doing this specifically look like hypotheses about what that model's next token will be. For example, for the Copyright (C prompt there are a couple of causal threads about word completions for the capital letter C. There is a causal thread for tracking that you're inside an open parenthetical, and one for acronym completions.

Run that same forward pass through this new gradient-based method, using the old residual autoencoders only. This method also picks out at least one of those same causal threads—a C-words one—as its top highlight.

The graph is definitely hard to read. Just round it off to: we're passing this sanity check, as we're at least recovering a common structure with both methods.

Results

Okay, here's our argument against the gradient-based method. If a circuit is explanatory, when you walk back through its graph, each node and edge adds to the story. In particular, the very final nodes should be explanatory; if they aren't, that screens off the rest of the graph being explanatory.

Below are the single top feature nodes for the token following each prompt. These nodes are then the main interpretable proximal cause for the model's logits, according to this sparse autoencoder circuit discovery framework. We are interpreting each node using its bottom-most blue row of boxes. That blue row represents the tokens that were most promoted by the feature in its forward pass. (Importantly, it is causal, not correlational, interpretability data.) Focus on that bottom-most blue row of logit boxes promoted.[2]

Top Proximal Causes

1. Copyright (C

Top Completions
TokenProbability
)82%
VS1%
AL1%
IR0%
)(0%
Closing parentheses; forms of "be"

 

2. The biggest name in basketball is Michael

Top Completions
TokenProbability
 Jordan81%
 Carter4%
 Kidd3%
 Be2%
 Malone1%
Various Michael last names

 

3. Make America Great

Top Completions
TokenProbability
 Again95%
 again3%
"0%
Again0%
."0%
Capitalized transition words

 

4. To be or not to be, that is the

Top Completions
TokenProbability
 question6%
 way5%
 only3%
 nature3%
 point2%
Question

 

5. Good morning starshine, the Earth says

Top Completions
TokenProbability
 hello9%
 it7%
,6%
:4%
 that4%
Code

 

Causal Structure From All Sublayers

Of those five examples, only two seem to correctly call the model's top next token: the closing parentheses feature and the question feature. But, even when you condition on having an actually predictive proximal feature, the sublayer upstream edges and nodes flowing into it—plotting whose contributions is the whole point of this—are not illuminating.

The edges from each upstream sublayer that most strongly affected the closing parentheses feature were:

[Unclear meaning]

 

[Unclear meaning]

 

Casual exclamations

 

The edges from each upstream sublayer that most strongly affected the question feature were:

[Unclear meaning; same node as before]

 

[Unclear meaning]

 

Casual exclamations; same node as before

 

When you ablate out a causal thread with a clear natural interpretation to it, the logit effects seem quite sensible.[3] Also, you can often scan over a causal graph that you get out of this method and cherry-pick a sensible, informative node activation: you can learn, e.g., that a particular attention layer seems to have been particularly counterfactual for that token.

Our complaint is that you are really not getting mechanistic understanding of the reasons why the model is writing what it is into the residual stream. It was that "reasons why" that we were after here in the first place.

Conclusion

We went into this project hoping to plug a research gap, and get out a concrete algorithmic explanation of what is going on in a naturalistic forward pass, given autoencoder features are taken to be primitives. We found that that didn't work with this algorithm.

Relatively recently, Anthropic published work showing cross-layer transcoder circuit discovery that does work to that standard. They give the full cognitive algorithm that Claude is using for two-digit addition, e.g. Their result leads us to think that it is the "cross-layer-ness" of what they were doing that is really the special sauce there. If the autoencoder circuits we played with here are built out of local approximations to what the model is representing at various points in the forward pass, cross-layer transcoders are instead built out of global approximations. Our overall update is that the additional work of getting that global approximation is necessary to make circuits research work in naturalistic transformers.

  1. ^

    Say that you have three nodes, , and , at model sublayers , and , respectively. Because of the residual stream, model topology is such that, in a forward pass, causality goes as follows

    If you want the value of the edge , you cannot just compute the effect of node  on node . You will also have to subtract off any effect due to the confounding path .


    Say you now have four nodes, , and , at model sublayers last_resid_out, attn_out, mlp_out, and resid_out, respectively. Causality here goes

    The edges , and  can be simply computed without any needed corrections.

    The edge  has the confounding path .

    The edge  has the confounding path .

    The edge  has the confounding paths , and .

  2. ^

    The topmost piece of interpretability data is a set of sequences for which the feature most activated, shaded accordingly. It is correlational data.

    Red rows of logit boxes are just the opposite of the blue rows: they are the logits that are causally most suppressed by the feature.

    The reason that there are sometimes multiple blue (and red) rows in a cell is that the rows are sourced from both local data and, when available, from Neuronpedia. The reason to focus on the bottom-most blue row is that that is the local data row, giving the causal effects of that feature for this particular forward pass.

  3. ^

    This wasn't done suitably at scale last time, but validation results (ablation over a significant dataset subset) do clean up at scale.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自编码器 模型解释性 GPT-2 电路发现 人工智能
相关文章