少点错误 2024年07月06日
[Interim research report] Activation plateaus & sensitive directions in GPT2
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究探讨了神经网络在叠加状态下计算新特征的机制,并提出了一个假设:神经网络需要某种形式的错误校正来避免干扰失控。研究通过两个实验验证了这一假设:激活平台实验和敏感方向实验。结果表明,真实激活在受到微小扰动时表现出平台效应,即扰动对模型输出的影响小于预期。此外,将真实激活扰动至另一个真实激活的方向上,比扰动至随机方向上更容易影响模型输出。

🤔 **激活平台实验:** 该实验旨在验证神经网络是否具有错误校正机制,以抑制特征激活中的微小噪声。实验结果显示,真实激活在受到随机方向扰动时,其模型输出的变化小于随机激活。这表明,真实激活位于一个平台的中心,即在一定范围内,扰动对模型输出的影响很小。

🧭 **敏感方向实验:** 该实验进一步验证了神经网络的错误校正机制。实验结果表明,将真实激活扰动至另一个真实激活的方向上,其模型输出的变化比扰动至随机方向上更快。这说明,真实激活方向比随机方向更加敏感,即更容易受到扰动影响。

💡 **稀疏自动编码器(SAE)与模型行为:** 研究人员提出了一个假设:SAE特征与叠加计算模型中的特征具有相同的错误校正属性。为了验证这一假设,他们计划进行以下实验: 1. 验证SAE特征是否也表现出平台效应,即在受到扰动时,其模型输出的变化小于预期。 2. 验证将真实激活扰动至SAE特征方向上,是否与扰动至另一个真实激活方向上具有类似的效果。

🤔 **SAE特征的潜在问题:** 研究人员指出,SAE特征可能存在一个潜在问题,即它们可能仅仅反映了训练数据集的特性,而不是模型内部计算的真实情况。因此,他们希望找到一种更便宜或更有效的方法来识别模型特征,而不依赖于昂贵的SAE训练过程。

🚀 **未来研究方向:** 研究人员计划进一步研究SAE特征的性质,例如: 1. SAE特征是否与叠加计算模型中的特征一致? 2. 是否存在不同类型的SAE特征,例如原子特征和复合特征? 3. 线性特征组合如何影响模型行为?

Published on July 5, 2024 5:05 PM GMT

This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat!
Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn’t necessarily endorse every sloppy statement I write). Work produced at Apollo Research.

TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error correction to avoid interference spiraling out of control. This means small variations along a feature direction shouldn't affect model outputs, which I can test:

    Activation plateaus: Real activations should be resistant to small perturbations. There should be a "plateau" in the output as a function of perturbation size.Sensitive directions: Perturbations towards the direction of a feature should change the model output earlier (at a lower perturbation size) than perturbations into a random direction.

I find that both of these predictions hold; the latter when I operationalize "feature" as the difference between two real model activations. As next steps we are planning to

My motivation for this project is to get data on computation in superposition, and to get dataset-independent evidence for (SAE-)features.

Core results & discussion

I run two different experiments that test the error correction hypothesis:

    Activation Plateaus: A real activation is the center of a plateau, in the sense that perturbing the activation affects the model output less than expected. Concretely: applying random-direction perturbations to an activation generated from a random openwebtext input (“real activation”) has less effect than applying the same perturbations to a random activation (generated from a Normal distribution). This effect on the model can be measured in KL divergence of logits (shown below) but also L2 difference or cosine similarity of late-layer activations.

    The left edge of this plot corresponds to no perturbation. As we move to the right, the orange lines (real base activation) "plateau" (the perturbation effect stays minimal for a while), while the blue lines (random base activation) rise immediately. Note that a distance of ~50 corresponds to the typical distance between two activation vectors. This plot shows the effect of changing the base activation; the perturbation direction is random in both cases.

    Sensitive directions: Perturbing a (real) activation into a direction towards another real activation (“poor man’s feature directions”) affects the model-outputs more than perturbing the same activation into a random direction. In the plot below focus on the size of the “plateau” in the left-hand side

    This plot varies the perturbation direction, the base activation is a real activation in both cases. The orange lines (other real activation direction) exhibit a shorter plateau than the blue lines (random direction). This means the real activation directions are more sensitive. To show why we usually sample random activations and directions from a mean- and covariance-adjusted distribution we show the naive (isotropic) random direction in purple, it behaves very differently and thus is not a good baseline.
      Naive random direction vs mean & covariance-adjusted random: Naive isotropic random directions are much less sensitive. Thus we use mean & covariance-adjusted random activations everywhere else in this report.The sensitive direction results are related to Gurnee (2024, SAE-replacement-error direction vs naive random direction) and Lindsey (2024, Anthropic April Updates, SAE-feature direction vs naive random direction).

The theoretical explanation for activation plateaus & sensitive direction may be error correction (also referred to as noise suppression):

Proposal: Connecting SAEs to model behaviour

The leading theory for how concepts are represented in neural networks in superposition: We think that NNs represent information as a series of sparsely-active features, which are represented as directions in activation space.[3] Superposition allows this list of features to be much larger than the dimension of activation space, and has been demonstrated in toy models.

Sparse autoencoders (SAEs) are a method that can recover individual features from a dataset of features in superposition. SAEs are trained to convert activations into a list of sparsely-active individual features and back into activations with low reconstruction loss and high feature-sparsity. The training input for SAEs are model activations, typically generated by running the model on a dataset similar to its training data.

If SAE-features are features in the sense that computation in superposition toy models suggest, then they should show the same error correction properties we saw with real feature directions. Thus we predict

Why do I think this is a useful direction to study SAEs?

    There are weird effects around how SAE features affect model behavior that we don't fully understand. Let's figure out what is going on and what we can learn!A (neglected?) failure mode of the SAE agenda is that SAE features could be an interpretability illusion in the sense that they do not represent the internal computation of the model but properties of the training dataset.
      I worry that SAEs find a feature only because a concept is frequent in the dataset rather than because the model uses the concept. (I discuss this in detail in a shortform post). A dataset-independent way to find/confirm SAE features (even if non-competitive) would be great!
    I’d like to find a cheaper or more targeted way to find (some or all) model features.
      Training SAEs is expensive, and cost trades off against feature completeness. If we could take an individual prompt and find (all?) active features, this would be extremely useful for evaluations and interpretability research.

Conclusion

Summary: I run some experiments testing computation-in-superposition predictions on GPT2 activations, finding

    Plateaus around model activations, as if the model was error-correcting small perturbationsPerturbing activations into the direction of other activations has more effect than random

I hope this research will allow us to understand computation in superposition better, and to connect behavioral properties of model activations to (SAE-)features.

Limitations: There may just be trivial explanations for results like these! Section 1 results really just say “GPT2 is weird if you go off distribution” (and happen to align with a theory prediction), but there could be lots of plausible explanations for this. Section 2 results are more specific, but still there might be simple explanations for this behavior (e.g. relevant properties of activation space beyond the covariance thing we noticed), and I would love to hear takes in the comments!

Future work: We are currently investigating these behavior properties for SAE-features, questions like

    Do SAE features behave as predicted by Toy Models of Computation in Superposition?Are there different types of SAE features? Atomic and composite features?How do linear combinations of features behave? Does this give us a handle on the total number of features?

Call to action: This direction feels underexplored, I think there’s a lot of new data to be generated here! I’d love to hear from anyone considering working on this!

I also want to encourage feedback in the comments: Trivial explanations I missed? Past literature that explored this? Reasons why this direction might be less promising than I think?

Acknowledgements: We thank Dan Braun, Lee Sharkey, Lucius Bushnaq, Marius Hobbhahn, Nix Goldowsky-Dill, and the whole Apollo team for feedback and discussions of these results. We thank Wes Gurnee and Rudolf Laine for comments on a previous (March 2024) report on this project.

Appendix

Methodology

The experiments in this report focus around perturbing the residual stream of a model (via activation patching) and measuring the corresponding chance in model outputs (KL divergence and more).

All experiments use GPT2-small. Input are 10-token sequences taken from openwebtext (apollo-research/Skylion007-openwebtext-tokenizer-gpt2). We choose an early perturbation layer (blocks.1.hook_resid_pre). We read the results off at the logits (KL divergence of logprobs) or at a late layer (L2 difference of activations at blocks.11.hook_resid_post or ln_final.hook_normalized). We use only the last position index for perturbation and read-off.

Generating activations: We use model activations to measure activation plateaus, and to generate the perturbation directions for sensitivity tests. We consider 4 types of activations

All activation vectors have zero layer-mean (each activation has zero mean along the hidden dimension), but not zero dataset-mean (i.e. I mean-center in the same way as TransformerLens but the activation dataset mean is not the zero vector). I don’t fix the norm of activation vectors (yet).

Generating directions: In which direction to perturb the activations into. In most cases we generate an activation according to the list above and take the difference between it and the base activations to obtain a direction.

The real-other direction is a proxy for getting feature directions without having to rely on SAEs. The difference between two real activations should be a couple hundred features (about half of them negative) because each real activation should consist of a number (~L0) of features.

Perturbations: I perturb the base activation by adding αdirection to the activation vector. I do this in two ways:

Effect on model outputs: To measure the effectiveness of a perturbation I measure the KL divergence between the perturbed model output logits and the base output logits, the L2 difference and cosine similarity between the activations in a late layer taken on the base and perturbed runs.

Experiment types: I run three types of experiments. In all cases I measure the effect of activation perturbations (as a function of perturbation L2 distance α or angle θ) on model outputs (KL divergence, final-layer L2 distance).

Detailed results

1. Activation Plateaus

Intuition: Look at the model-output landscape when interpolating activations on the plane spanned by three real activations. The plot below shows the KL divergence (wrongly normalized, z-axis and color) for all activations on that plane. We see plateaus around the real activations (black vertical lines) with outputs changing less per shift in activations. This gives an intuitive picture; for the quantitative study we switch to a 1D version and switch from interpolation to perturbation into random directions.

We sample a series of base activations (random or real-other) and perturb the activations from there towards a series of random directions (as discussed above we either perturb in Straight or Sphere modes). Below we show the KL div as a function of perturbation angle (Sphere case) for both types. The real-other activations clearly exhibit the plateau phenomenon—the KL div barely changes until the perturbation reaches 40°—while random activations do not follow this pattern.

Straight mode (perturbing straight into a direction):

Sphere mode (perturbing while keeping norm constant – the change between this and the plot above is due to straight/sphere mode, the seed does not have a big effect):

2. Sensitive directions

Now we perturb a given (real) base activation into different kinds of directions. This is different from the previous experiment where we applied the same (random) perturbation to different base activations.

I take a given real base activation (seed / prompt shown in plot titles) and perturb it into a random direction (sample a random other activations with appropriate covariance matrix, and define direction as difference between new activation and base activation) or real-other (sample new activation by running random openwebtext sequence through model, then take difference as direction). I normalize the directions to have the same norm, and observe the effect on the model (KL div) as a function of angle (Sphere mode) or perturbation size (Straight mode). In all cases the real-other directions appear to be more sensitive, jumping up at a lower angle and lower perturbation distance.

Straight mode (perturbing straight into a direction):

Sphere mode (perturbing while keeping norm constant):

Other seeds (base prompts) looked qualitatively similar.

A brief investigation shows that we can find metrics, such as “at what angle does the KL divergence exceed 0.5” to reasonably distinguish the two classes of perturbation, though I think that the curves above look more distinguishable than suggested by the histograms below. (This may be an optical illusion, or show that I haven’t spent time finding the optimal classifier.)

3. Local optima in sensitivity

real-other directions are more sensitive than random directions. We think this is because they focus perturbations into a small (~L0) number of feature directions, reaching the hypothetical error correction threshold earlier.

We conjecture that, if we could perturb activations into a single feature direction, the perturbation would be even more focused and reach the error correction earlier (concretely: the perturbation distance required to reach KL-div=0.5 would be lower). This is compatible with Lindsey (2024)’s observations that SAE directions are unusually sensitive (though they did not compare to real-other or combinations of SAE directions). If that was true, we might be able to find SAE directions as local maxima of sensitivity: A perturbation into 1feature direction should be more sensitive than a perturbation into 0.99*feature direction + some other direction.

As a precursor to this investigation we investigate the sensitivity of various directions between real-other and random directions. In the plot below, every corner corresponds to a random direction (except for the top corners in the 2nd row, these correspond to real-other directions). And the color corresponds to the perturbation size (measured in Sphere mode, as angle) required to reach 0.5 KL divergence (so every point corresponds to a direction, and we run a scan over perturbation lengths on every point). The non-corner points correspond to interpolated directions (precisely: we interpolate the targets before calculating the direction). This shows us whether “nearby” directions are similarly precise as the exact real-other direction.

The upper row is a sanity-check, interpolating between 3 random directions. We expect the plot to be symmetric. The lower row is an interpolation between a real-other direction (top) and two random directions (bottom corners). We see, as expected, the top corner appears to be a local optimum of sensitivity:

While these plots initially seem to suggest a local optimum at the top corner (2nd row), they only test two (random) directions in 768d space. If real-other directions consist of ~L0 number of features, and the previous hypothesis is true, we expect there to be an L0-dimensional space in which the direction is not a local optimum. We plan to continue these investigations in future work.

  1. ^

    Empirically this is a bit messy: Inputting a random direction into an SAE activates between 10 and 20000 features (lognormal distribution with a peak around 30). But that is using the encoder, I'm not sure if I should be doing that.

  2. ^

    The real-other direction is expected to turn on some features, but also to dampen existing features. My explanation focuses on turning on inactive features, and ignores the slight dampening of active features.

  3. ^

    This is not fully true—we know some directions represent non-sparse positional features, and there is information in the geometry of features—but we leave this aside for now.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

神经网络 叠加计算 错误校正 特征 稀疏自动编码器 SAE
相关文章