少点错误 06月24日 03:49
Compressed Computation is (probably) not Computation in Superposition
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究深入探讨了Braun等人提出的压缩计算(CC)玩具模型,该模型声称用比ReLU神经元数量(50个)更多的非线性函数(100个目标ReLU函数)进行计算。研究结果表明,该模型的性能优势并非源于高效的计算机制,而是依赖于噪声标签。研究通过分离噪声项,并分析模型在不同噪声水平下的表现,揭示了噪声在模型中的关键作用,并提出了一种基于混合矩阵的非训练模型,该模型表现出与训练模型相似的性能,但没有完全匹配其所有特性。这表明CC模型可能不适合作为CiS的玩具模型。

🔍 研究的核心发现是,压缩计算(CC)模型的性能高度依赖于噪声,这种噪声以“混合矩阵”的形式存在,它将不同的输入特征混合到标签中。在没有这种混合矩阵的情况下,模型无法超越基线。

📈 研究表明,CC模型的性能与混合矩阵的幅度成正比关系。这意味着,噪声越大,模型的性能通常越好,但超过一定程度后,性能会开始下降。

🧠 研究发现,训练后的CC模型主要集中在混合矩阵的前50个奇异向量所代表的子空间中。这表明,混合矩阵在很大程度上决定了模型学习到的解决方案。

💡 研究提出了一种新的模型,该模型基于混合矩阵的半非负矩阵分解(SNMF)。该模型在定性和定量上都表现出与训练模型相似的损失,这进一步支持了噪声在模型中的关键作用。

Published on June 23, 2025 7:35 PM GMT

This research was completed during the Mentorship for Alignment Research Students (MARS 2.0) Supervised Program for Alignment Research (SPAR spring 2025) programs. The team was supervised by Stefan (Apollo Research). Jai and Sara were the primary contributors, Stefan contributed ideas, ran final experiments and helped writing the post. Giorgi contributed in the early phases of the project. All results can be replicated using this codebase.

Summary

We investigate the toy model of Compressed Computation (CC), introduced by Braun et al. (2025), which is a model that seemingly computes more non-linear functions (100 target ReLU functions) than it has ReLU neurons (50). Our results cast doubt on whether the mechanism behind this toy model is indeed computing more functions than it has neurons: We find that the model performance solely relies on noisy labels, and that its performance advantage compared to baselines diminishes with lower noise.

Specifically, we show that the Braun et al. (2025) setup can be split into two loss terms: the ReLU task and a noise term that mixes the different input features ("mixing matrix"). We isolate these terms, and show that the optimal total loss increases as we reduce the magnitude of the mixing matrix. This suggests that the loss advantage of the trained model does not originate from a clever algorithm to compute the ReLU functions in superposition (computation in superposition, CiS), but from taking advantage of the noise. Additionally, we find that the directions represented by the trained model mainly lie in the subspace of the positive eigenvalues of the mixing matrix, suggesting that this matrix determines the learned solution. Finally we present a non-trained model derived from the mixing matrix which improves upon previous baselines. This model exhibits a similar performance profile as the trained model, but does not match all its properties.

While we have not been able to fully reverse-engineer the CC model, this work reveals several key mechanisms behind the model. Our results suggest that CC is likely not a suitable toy model of CiS.

Introduction

Superposition in neural networks, introduced by Elhage et al. (2022), describes how neural networks can represent  sparse features in a -dimensional embedding space. They also introduce Computation in Superposition (CiS) as computation performed entirely in superposition. The definition of CiS has been refined by Hänni et al. (2024) and Bushnaq & Mendel (2024) as a neural network computing more non-linear functions than it has non-linearities, still assuming sparsity.

Braun et al. (2025) propose a concrete toy model of Compressed Computation (CC) which seemingly implements CiS: It computes 100 ReLU functions using only 50 neurons. Their model is trained to compute  for . Naively one would expect an MSE loss per feature of 0.0833[1]; however the trained model can achieve a loss of . Furthermore, an inspection of the model weights and performance shows that it does not privilege any features, suggesting it computes all 100 functions.[2]

Contributions: Our key results are that

    The CC model’s performance is dependent on noise in the form of a "mixing matrix" which mixes different features into the labels. It does not beat baselines without this mixing matrix.The CC model’s performance scales with the mixing matrix's magnitude (higher is better, up to some point). We also find that the trained model focuses on the top 50 singular vectors of the mixing matrix (all its neuron directions mostly fall into this subspace).We introduce a new model derived from the SNMF (semi non-negative matrix factorization) of the mixing matrix alone that achieves qualitatively and quantitatively similar loss to the trained model.

Methods

We simplify the residual network proposed by Braun et al. (2025), showing that the residual connection (with fixed, non-trainable embedding matrixes) serves as a source of noise that mixes the features. We show that a simple 1-layer MLP model trained on  produces the same results.

Figure 1: The original model architecture from Braun et al. (2025), and our simpler equivalent model.

We explore three settings for :

    The “embedding noise” case  is essentially equivalent to Braun et al. (2025).[3] The rows of  are random unit vectors, following Braun et al. (2025).The “random noise” case where we set  to random values drawn from a normal distribution with mean 0 and standard deviation , typically between 0.01 and 0.05.The “clean” case where we set .

The input vector is sparse: Each input feature  is independently drawn from a uniform distribution [-1, 1] with probability , and otherwise set to zero. We also consider the maximally sparse case where exactly one input is nonzero in every sample. Following Braun et al. (2025) we use a batch size of 2048 and a learning rate of 0.003 with cosine scheduler. To control for training exposure we train all models for 10,000 non-empty batches. The only trainable parameters are  and  (not ); though in some cases  (Figure 5a) we optimize  together with the weights.

Results

Qualitatively different solutions in sparse vs. dense input regimes

We reproduce the qualitative results of Braun et al. (2025) in our setting using  as well as a random matrix (0, 0.02). In both cases we find qualitatively different behaviour at high and low sparsities, as illustrated in Figure 2.

Figure 2: Loss per feature () as a function of evaluation sparsity. Each solid line corresponds to a model trained at a given sparsity. The models learn one of two solution types, depending on the input sparsity used during training: the “compressed computation” (CC) solution (violet) or a dense solution (green). Both types beat the naive baseline (dashed line) in their respective regime. Black circles connected by a dotted line represent the results seen by Braun et al. (2025), where models were evaluated only at their training sparsity.
    In the sparse regime (low probability ) we find solutions that perform well on sparse inputs, and less well on dense inputs. They typically exhibit a similar input-output response for all features (Figure 3a), and weights distributed across all features (Figure 4a, equivalent to Figure 6 in Braun et al. 2025). The maximally sparse case (exactly one input active) behaves very similar to .In the dense regime (high probability ) we find solutions with a constant per-feature loss on sparse inputs, but a better performance on dense inputs. These solutions tend to implement half the input features with a single neuron each, while ignoring the other half (Figures 3b and 4b).

Braun et al. (2025) studied the solutions trained in the sparse regime, referred to as Compressed Computation (CC).[4] In our analysis we will mostly focus on the CC model, but we provide an analysis of the dense model in this section.

Figure 3: Input/output behaviour of the two model types (for one-hot inputs): In the “compressed computation” solution (left panel), all features are similarly-well represented: each input activates the corresponding output feature. In contrast, the dense solution (right panel) shows a strong (and more accurate) response for half the features, while barely responding to the other half. The green dashed line indicates the expected response under perfect performance.
Figure 4: Weights representing each input feature, split by neuron. Each bar corresponds to a feature (x-axis) and shows the adjusted weight value from , split by neuron index (color). The CC solution (left) combinations of neurons to represent each feature (to around 70%), whereas the dense solution (right) allocates a single neuron to fully (~100%) represent 50 out of 100 features. 

Quantitative analysis of the Compressed Computation model

We quantitatively compare the model of Braun et al. (2025) with  to three simple mixing matrices: (a) a fully random mixing matrix , (b) a random but symmetric , and (c) .

We confirm that a fully random mixing matrix qualitatively reproduces the results (green curve in Figure 5a), and that a symmetric (but otherwise random) mixing matrix gives almost exactly the same quantitative results (red curve in Figure 5a) as the embedding case studied in Braun et al. (2025) would (blue line).[5][6]

Importantly, we find that a dataset without a mixing matrix does not beat the naive loss. Braun et al. (2025) hypothesized that the residual stream (equivalent to our mixing matrix) was necessary due to training dynamics but was not the source of the loss improvement. To test this hypothesis we perform two tests.

Firstly we train the CC models over a range of noise scales  from 0 to 0.08, as shown in Figure 5b. We see that the loss almost linearly decreases with , for small . This is evidence that the loss advantage is indeed coming directly from the noise term.

Figure 5, left: Loss per feature as a function of input sparsity, for different choices of . We compare an embedding-like  (Braun et al. 2025, blue) to a fully random  (green) and a symmetric  (red) which is a random lower diagonal matrix mirrored; in both cases we set the magnitude if  that leads to the lowest loss. For comparison we also show a model trained on  (yellow). We find that all non-zero  lead to a qualitatively similar profile, and that a symmetrized random  almost gives the same result as Braun et al. (2025).
Right: Optimal loss as a function of mixing matrix magnitude  (separately trained for every ). For small  the loss linearly decreases with the mixing matrix magnitude, suggesting the loss advantage over the naive solution stems from the mixing matrix . At large values of , the loss increases again.

Secondly, to rule out training dynamics, we trained a CC model on a noisy dataset and then “transplanted” and fine-tuned it on the clean dataset (Figure 6). Here we find that the loss immediately rises to or above the naive loss when we switch to the clean dataset.

Figure 6: Training a model on the noisy dataset (), and then fine-tuning on the clean  case. We see that the loss jumps back up as soon as we switch to the clean task. This is evidence against the hypothesis that the CC solution wasn't learned on the clean case just due to training dynamics.

We conclude that compressed computation is entirely dependent on feature mixing.

Mechanism of the Compressed Computation model

Based on the previous results it seems clear that the CC model exploits properties of the mixing matrix . We have some mechanistic evidence that the MLP weights are optimized for the mixing matrix M, but have not fully reverse-engineered the CC model. Additionally we design model weights based on the semi non-negative matrix factorization (SNMF) of  which also beats the naive loss but does not perform as well as the trained solution.

We find that the direction read () and written () by each MLP neuron lie mostly in the 50-dimensional subspace spanned by the positive eigenvalues of  (Figure 7a, top panels). We find this effect slightly stronger when taking the singular value decomposition (SVD) of [7] as shown in the bottom panels of Figure 7a. Additionally we measure the cosine similarity between singular- and eigenvectors projected through () and find that only the top  singular vectors are represented fully (Figure 7b).[8]

Figure 7, left: Cosine similarity between various eigen- and singular vectors (x-axis) and MLP neuron directions (y-axis). We show eigenvectors in the top panels, singular vectors in the bottom panels, the  matrix in the left panels, and the  matrix in the right panels. In all cases we see that the top-50 vectors (sorted by eigen- / singular value) have significant dot product with the neurons, while the remaining 50 vectors have near-zero dot products (black).
Right: We test how well the ReLU-free MLP (i.e. just the  projection) preserves various eigen- (orange) and singular (blue) directions. Confirming the previous result, we find the cosine similarity between the vectors before and after the projection to be high for only the top 50 vectors.

We hypothesize that the MLP attempts to represent the mixing matrix , as well as an identity component (to account for the ReLU part of the labels). We provide further suggestive evidence of this in the correlation between the entries of the  matrix, and the  matrix (Figure 8a): We find that the entries of both matrices are strongly correlated, and the diagonal entries of  are offset by a constant. We don’t fully understand this relationship quantitatively though (we saw that the numbers depend on the noise scale, but it’s not clear how exactly).

Figure 8, left: Scatter plot of the entries of the product  and the mixing matrix ; the entries are clearly correlated. The diagonal entries of  are offset by a constant, and seem to be correlated at a higher slope.
Right: Visualization of the MLP weight matrices  and . We highlight that  is has mostly positive entries (this makes sense as it feeds into the ReLU), and both matrices have a small number of large entries.

The trained solution however is clearly not just an SVD with  and  determined by U and V. Specifically, as shown in Figure 8b, we note that  is almost non-negative (which makes sense, as it is followed by the ReLU).

Inspired by this we attempt to design a solution using the SNMF of , setting  to the non-negative factor. Somewhat surprisingly, this actually beats the naive loss! Figure 9 shows the SNMF solution for different noise scales . The SNMF solution beats the naive loss for a range of  values, though this range is smaller than the range in which the trained MLP beats the naive loss (Figure 5b). Furthermore, the SNMF solution does not capture all properties that we notice in  and : The SNMF weights are less sparse, and the correlation between them and  (analogue to Figure 8) looks completely different (not shown in post, see Figure here).

Figure 9: Loss of the SNMF solution, compared to the naive solution. Like in Figure 5b, the solution does better than the naive loss for a range of  values though the range is smaller and the loss is higher than for the trained model.

We conclude that we haven’t fully understood how the learned solution works. However, we hope that this analysis shines some light on its mechanism. In particular, we hope that this analysis makes it clear that the initial assumption of CC being a model of CiS is likely wrong.

Mechanism of the dense solution

We now return to the solution the model finds in the high feature probability (dense) input regime. As shown in Figure 2, models trained on dense inputs outperform the naive baseline on dense inputs. Revisiting Figures 3b and 4b, we notice an intriguing pattern: roughly half of the features are well-learned, while the other half are only weakly represented.

Our hypothesis is that the model represents half the features correctly, and approximates the other half by emulating a bias term. Our architecture does not include biases, but we think the model can create an offset in the outputs by setting the corresponding output weight rows to positive values, averaging over all features. This essentially uses the other features to, on average, create an offset.

We test this “naive + offset” solution on the clean dataset () as the behaviour seems to be the same regardless of noise (but we did not explore this further), and find that the hardcoded naive + offset solution does in fact match the trained models’ losses. Figure 10a shows the optimal weight value (same value for every  entry of non-represented features), and Figure 10b shows the corresponding loss as a function of feature probability. We find that the hardcoded models (dashed lines) closely match or exceed the trained models (solid lines).

We thus conclude that the high density behaviour can be explained by a simple bias term, and not particularly interesting. We have not explored this case with a non-zero mixing matrix but we expect that one could generalize the naive + offset solution to noisy cases.

Figure 10, left: A non-zero offset in the  entries of unrepresented features improves the loss in the dense regime. We determine the optimal value empirically for each input feature probability .
Right: This hand-coded naive + offset model (dashed lines) consistently matches or outperforms the model trained on clean labels (solid lines) in the dense regime. (Note that this plot only shows the clean dataset () which is why no solution outperforms the naive loss in the sparse regime.)

At high training feature probabilities, tuning the offset value in the naive + offset model significantly improves performance beyond the naive baseline. However, when the feature probability is 0.1 or lower, no offset leads to better-than-naive performance. This further supports the idea that the model adopts one of two distinct strategies depending on the feature probability encountered during training.

Discussion

Our work sheds light on the mechanism behind the Braun et al. (2025) model of compressed computation. We conclusively ruled out the hypothesis that the embedding noise was only required for training, and showed that the resulting mixing matrix is instead the central component that allows models to beat the naive loss.

That said, we have not fully reverse-engineered the compressed computation model. We would be excited to hear about future work reverse engineering this model (please leave a comment or DM / email StefanHex!). We think an analysis of how the solution relates to the eigenvectors of , and how it changes for different choices of  are very promising.[9]

Acknowledgements: We want to thank @Lucius Bushnaq and @Lee Sharkey for giving feedback at various stages of this project. We also thank @Adam Newgas, @jakemendel, @Dmitrii Krasheninnikov, @Dmitry Vaintrob,  and @Linda Linsefors for helpful discussions.

  1. ^

     This loss corresponds to implementing half the features, ignoring the other half. The number is derived from integrating .

  2. ^

    We’re using the terms feature and function interchangeably, both refer to the 100 inputs and their representation in the model.

  3. ^

    Our <span class="mjx-math" aria-label="W{in}">/ are just 100x50 dimensional, while Braun et al. (2025)’s are 1000x50 dimensional due to the embedding, but this does not make a difference.

  4. ^

    Braun et al. (2025) say “We suspect our model’s solutions to this task might not depend on the sparsity of inputs as much as would be expected, potentially making ‘compressed computation’ and ‘computation in superposition’ subtly distinct phenomena”. We now know that this was due to evaluating a model’s performance only at its training feature probability, rather than across a set of evaluation feature probabilities, i.e. they measured the black dotted line in Figure 2.

  5. ^

    For Figure 5 we chose the optimal noise scale for each model, which happens to be close to the standard deviation of the mixing matrix corresponding to the embeddings chosen in Braun et al. (2025). This was coincidental as Braun et al. (2025) chose  for unrelated reasons.

  6. ^

    Two further differences between Braun et al. (2025)'s embed-like case and the symmetrized case are (a) the diagonal of  is zero in the embed-like case, and (b) the entries of  are not completely independent in the embed-like case as they are all derived by different combinations of the embedding matrix entries. We find that setting the diagonal of the random  to zero almost matches the low feature probability loss of the embed-like case, though it slightly increases the high- loss.

  7. ^

    The eigenvalues of  and the singular values of  are approximately related.

  8. ^

    This is not due to  being low-rank; the singular value spectrum is smooth (as expected from a random matrix).

  9. ^

    During our experiments we found that the loss depends on the rank of , though we found that, depending on the setting, reducing the rank could either increase or decrease the loss.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

压缩计算 噪声 混合矩阵 神经网络
相关文章