APD for linear maps

Published on June 20, 2025 9:58 PM GMT

The setup is that we have some input distribution $x \in R^{M}$ and a linear map $W \in R^{D \times M}$ , and we perform APD with respect to the output $W x$ .

We take $x$ to be a normalised gaussian (i.e uniform on the sphere), for simplicity. In addition, we take $W$ to be the identity matrix. We also take $M = D = 100$ .

APD initializes $C$ components, which are formed as the sum of pairwise outer products of a set of $r$ vectors $U_{i}$ and $V_{i}$ . This outer product is used so that we can compute the simplicity loss efficiently later.

The first step of APD is to calculate gradient attribution scores for each of our $C$ components with respect to an input $x$ .

We have $A_{c} (x) = \sqrt{\frac{\sum_{o = 1}^{D} (\sum_{i, j} \frac{d \sum_{k = 1}^{M} W_{o k} x_{k}}{d W_{i j}} P_{i j}^{C})^{2}}{D}} = \sqrt{\frac{\sum_{o = 1}^{D} \sum_{j} (x_{j}^{2} P_{c, o j}^{2})}{D}} = \sqrt{\frac{\sum_{o = 1}^{D} (x^{2})^{T} P_{c, o :}^{2}}{D}}$

We select the top-k components with the highest attribution scores, and then perform a second forward pass on this sparse subset of components, training for reconstruction loss, and training for a low-rank sum.

Let $K$ be the sum of the top-k components, and $L$ be the sum of all the components. Then the reconstruction loss is | $W x - K x$ |^2 and the faithfulness loss is $\sum_{i, j} (W_{i j} - L_{i j})^{2}$

Simplicity loss drives for low rank, via penalizing the $l_{p}$ -norm of the spectra of active components for $p \in (0, 1)$ , effectively making the spectra sparse (because we have a lower bound on the Frobenius norm of useful active components, so can't just drive the spectrum to $0$ ).

In practice, faithfulness loss goes to very close to $0$ quite quickly, and so we can restrict to just changing the hyperparameters of simplicity and minimality loss. I looked at $α \cdot {loss}_{m i n i m a l i t y} + {loss}_{f a i t h f u l} + {loss}_{s i m p l i c i t y}$ as the loss function for varying values of $α$ .

The fact that there is a tradeoff between minimality and simplicity is a given. But it's interesting to look at what the extremes correspond to:

Small $α$ :

A minimality loss (reconstruction loss) of $1$ corresponds to the same loss as the $0$ map, and for small values of $α$ , the model learns components of the form $\frac{Identity}{C}$ , effectively spreading out $W$ across all $C$ components. But this means that we only get a sparse reconstruction $K$ of $\frac{k Identity}{C}$ , leading to a high minimality loss when $k << C$ .

Our simplicity loss is low even though the components we learn are not low rank. The assertion we made earlier that penalizing the $l_{p}$ -norm will lead to a sparse spectrum assumed a large lower bound on the Frobenius norm of the active components, stopping us from driving the spectrum to $0$ . But we only have this when our sparse reconstruction is reasonably accurate, i.e: our minimality loss is reasonably low.

This is disappointing because it means that we get dull behaviour. As soon as the model loses minimality loss it no longer needs to worry about the simplicity loss, and it will just learn high-rank components:

A typical active component for low values of $α$

$K$ (the sum of the active components) for small values of $α$

Large $α$ :

A typical active component for large values of $α$

$K$ (the sum of the active components) for large values of $α$

This time we get good sparse reconstruction, so low minimality loss. Our simplicity loss is high because the active components we learn are all high rank. In fact, in this case the model seems to consistently use the same active components, meaning we can just straightforwardly combine these components. So it seems like in this case APD was a success!

Modified simplicity loss:

The small $α$ regime is boring because APD just learns to drive the spectrum to $0$ , meaning that it has no incentive to learn low-rank matrices. To stop this, I chose to instead normalize the $l_{p}$ norm by the frobenius norm (the $l_{2}$ norm of the spectrum), and use this for simplicity loss.

In particular, the usual simplicity loss is given by $\sum_{i = 1}^{k} | | P_{c_{i}} | |_{p}^{p}$ where $c_{i}$ are the active components. Instead we can use $\sum_{i = 1}^{k} (\frac{| | P_{c_{i}} | |_{p}^{p}}{| | P_{c_{i}} | |_{2}^{p}} - 1)$ , which we can compute efficiently using the same trick as for the Schattern p-norm.

Note that we have $| | P_{c_{i}} | |_{2} \leq | | P_{c_{i}} | |_{p}$ , with equality in the rank-1 case.

This modified simplicity loss is invariant under scaling any individual component. It is pretty unprincipled / hacky, but it does lead to more interesting behaviour for small $α$ .

Numerical instability:

Note that for $p \in (0, 1)$ , the derivative of $x^{p}$ is $p x^{p - 1}$ , where $p - 1 < 0$ . Therefore gradients are badly behaved near 0. We can fix this just by adding $ϵ$ appropriately when computing the Schatten p-norm.

Modified small $α$ regime:

A typical active component for low values of $α$ with modified simplicity loss

All the active components are now visibly low rank, and yet they still sum to approximate a rough diagonal, though the minimality loss is high.

Conclusion:

Studying APD for linear maps can help us improve our intuition for how it will behave for larger models. Here we used a spherically symmetric input, but it would be interesting to look at how APD behaves for non-homogeneous inputs.

Code to reproduce results:

https://colab.research.google.com/drive/1sBPytrtZNfBMpVYeaiAgwj7Kqle7qgeg?usp=sharing

Discuss

Small $α$ :

Large $α$ :

Modified simplicity loss:

Numerical instability:

Modified small $α$ regime:

Conclusion:

Code to reproduce results:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

Small α:

Large α:

Modified simplicity loss:

Numerical instability:

Modified small α regime:

Conclusion:

Code to reproduce results:

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签

Small $α$ :

Large $α$ :

Modified small $α$ regime: