少点错误 2024年09月26日
Evaluating Synthetic Activations composed of SAE Latents in GPT-2
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

研究发现真实激活无法用无内部结构的‘SAE潜在袋’模拟,SAE潜在的几何和统计特性在真实激活的构成中起重要作用。

🎯稀疏自编码器(SAEs)在机械可解释性中日益流行,用于将模型的残差流分解为单义且可解释的潜在因素。但需验证其能否准确捕捉模型计算中使用的抽象概念。

🔍通过实验发现,朝模型生成的(真实)激活方向的扰动比随机方向的扰动更快导致模型最终层激活的剧变,表明模型对真实激活的扰动更敏感。

💡合成基线激活与真实激活存在较大差距,暗示构成激活的SAE潜在的内部结构很重要,不仅仅是其稀疏性和潜在激活。而考虑了真实激活中潜在成对余弦相似性的合成结构化激活,与真实激活更为相似。

🚀对激活平台的测试发现,模型生成的激活所表现出的平台比合成结构化激活更为显著,说明当前方法未完全捕捉到模型生成激活的一些额外特性。

Published on September 25, 2024 8:37 PM GMT

This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by Stefan Heimershiem (Apollo Research). Find out more about the programme and express interest in upcoming iterations here.

TL;DR

We find that real activations cannot be mimicked using a “bag of SAE latents” with no internal structure, and that geometric and statistical properties of SAE latents play an important role in the composition of a realistic activation (Figure 1).

Figure 1:  The CDF of model sensitivity (over 1000 samples) for perturbations towards real, synthetic, and random activations. Synthetic-baseline activations (purple, simple “bags of SAE latents”) cannot match the behavior of real activations (orange), but synthetic-structured activations (green, taking into account SAE latent geometry) achieve a closer match. This shows that the geometry of SAE latents matters.

Paper Overview

Sparse Auto-Encoders (SAEs) are an increasingly popular method in mechanistic interpretability used to decompose a model’s residual stream into monosemantic and interpretable latents. As the community’s reliance on SAEs grows, it is crucial to verify that they accurately capture abstractions that the model uses in its computation. Recent work has shown that linearly perturbing a language model's activations at an early layer can result in step-function-like changes (“blowups”) in the model's final layer activations (Heimersheim and Mendel, 2024). Importantly, it has been found that perturbing in the direction of model-generated (“real”) activations caused blowups sooner than perturbing in random directions (Figure 2), i.e. the model is more sensitive to perturbations towards real activations than perturbations in random directions. This behavior distinguishes real activations from random directions, and we use it to evaluate synthetic activations made out of SAE latents. 

In order to test the model’s sensitivity to different directions, we perturb 1000 real activations towards different types of activations and present the results in Figure 3. Here is a summary of the different types of activations we compare:

    Model-generated: An activation directly taken from Layer 1 for a random prompt given to the model.Random: A randomly sampled point from a normal distribution with the same mean and covariance as model-generated activations.Synthetic Baseline: An activation composed of SAE latents with similar sparsity and latent activations as active latents found in real activations (“bag of SAE latents”).Synthetic Structured: An activation composed of SAE latents where we take into account the pairwise cosine similarities of active latents found in real activations.
Figure 2:  The L2 distance at the final layer (left) and the KL divergence of the next-token prediction probabilities (right) between the perturbed and unperturbed model, as three model-generated (real) activations at Layer 1 are slowly perturbed towards other real activations (orange) and random points sampled from a gaussian approximation of model activations (blue). The x-axis represents the total length of the perturbation broken into 100 steps of size 0.5 each. The dot on each solid line represents the maximum slope (MS) step for each perturbation, which is the metric we use to define the location of the blowup. The dashed lines represent the average L2 distance and KL divergence per step for 1000 perturbations of both types. The linear part at the start of the curves represents the activation plateau, and the sharp rise in the curves represents the blowup. Our findings confirm that perturbations towards real activations cause blowups sooner than perturbations in random directions.

We find that the model is more sensitive to perturbations towards synthetic-baseline activations than random activations, which suggests that SAE latents encode more information about model computation than random directions do. However, there is still a large gap between synthetic-baseline activations and model-generated activations, which suggests that real activations cannot be explained using a “bag of SAE latents”. This gap implies that the internal structure of SAE latents that make up an activation matters, in addition to their sparsity and latent activation.

We noticed that the pairwise cosine similarities of active SAE latents with the top active latent follow a specific distribution. We attempt to emulate this distribution with the “synthetic-structured” activations and find that they look a lot more similar to real activations than synthetic-baseline activations do (Figure 3). This similarity provides more evidence that geometric and statistical properties of SAE latents are significant factors in how the model processes information and that just the presence or absence of a latent does not fully describe its semantic meaning.

Figure 3: The distributions of the max slope (MS) steps for perturbations towards model-generated (orange), random (blue), synthetic-baseline (purple), and synthetic-structured (green) activations. The left panel shows the counts of MS steps occurring in different bins along the length of the perturbation, and the right panel shows the corresponding CDF. We find that perturbing towards synthetic-structured activations is more similar to perturbing towards model-generated activations than perturbing towards synthetic-baseline activations is.

To test if these findings generalize to activation plateaus (also discussed in Heimersheim and Mendel, 2024), we run a different version of the experiment above, starting from different types of activations and perturbing in random directions. Here, we find that model-generated activations exhibit plateaus that are a lot more pronounced than even our synthetic-structured activations, which means that our synthetic activations are not as robust to noise as real activations. This suggests that there are additional properties of model-generated activations that our current methods do not fully capture, opening up the field for future research.

To read more about this work, including more details about our method and its limitations and a lot more information about different exploratory studies we ran, head to the full paper. Any feedback and comments on our research are highly appreciated!



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

稀疏自编码器 真实激活 合成激活 模型敏感性
相关文章