少点错误 前天 06:07
APD for linear maps
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了自适应参数分解(APD)在线性映射中的应用。研究者通过分析APD在特定设置下的行为,包括使用归一化高斯输入和恒等矩阵,深入研究了其重建损失、保真度损失、简洁性损失和极小化损失之间的关系。研究还提出了一种改进的简洁性损失函数,并在小α值下观察到更有趣的行为。最后,文章讨论了数值不稳定性问题,并提供了复现结果的代码。

✨ APD初始化C个组件,这些组件由一组r个向量Ui和Vi的成对外积构成,用于后续高效计算简洁性损失。

💡 APD的第一步是计算每个组件相对于输入x的梯度归因分数,并选择得分最高的k个组件进行二次前向传递,用于重建损失和低秩和训练。

🔍 极小化损失、保真度损失和简洁性损失之间存在权衡。当α较小时,模型倾向于学习高秩组件,导致极小化损失较高;当α较大时,模型倾向于学习低秩组件,极小化损失较低。

✅ 为了解决小α值下的无趣行为,研究者提出了一种改进的简洁性损失函数,该函数通过Frobenius范数对lp范数进行归一化,从而在小α值下产生更有趣的行为。

⚠️ 在计算Schatten p-范数时,由于p∈(0,1)时导数在0附近表现不佳,需要添加ϵ来解决数值不稳定性问题。

Published on June 20, 2025 9:58 PM GMT

The setup is that we have some input distribution  and a linear map  , and we perform APD with respect to the output 

We take  to be a normalised gaussian (i.e uniform on the sphere), for simplicity. In addition, we take  to be the identity matrix. We also take .

APD initializes  components, which are formed as the sum of pairwise outer products of a set of  vectors  and . This outer product is used so that we can compute the simplicity loss efficiently later.

The first step of APD is to calculate gradient attribution scores for each of our  components with respect to an input .

We have 

We select the top-k components with the highest attribution scores, and then perform a second forward pass on this sparse subset of components, training for reconstruction loss, and training for a low-rank sum.

Let  be the sum of the top-k components, and  be the sum of all the components. Then the reconstruction loss is ||^2 and the faithfulness loss is 

Simplicity loss drives for low rank, via penalizing the -norm of the spectra of active components for , effectively making the spectra sparse (because we have a lower bound on the Frobenius norm of useful active components, so can't just drive the spectrum to ).

In practice, faithfulness loss goes to very close to  quite quickly, and so we can restrict to just changing the hyperparameters of simplicity and minimality loss. I looked at   as the loss function for varying values of 

k = 10, C=100,  
k=20, C=100, 

The fact that there is a tradeoff between minimality and simplicity is a given. But it's interesting to look at what the extremes correspond to:

Small :

A minimality loss (reconstruction loss) of  corresponds to the same loss as the  map, and for small values of , the model learns components of the form , effectively spreading out  across all  components. But this means that we only get a sparse reconstruction  of , leading to a high minimality loss when .

Our simplicity loss is low even though the components we learn are not low rank. The assertion we made earlier that penalizing the -norm will lead to a sparse spectrum assumed a large lower bound on the Frobenius norm of the active components, stopping us from driving the spectrum to . But we only have this when our sparse reconstruction is reasonably accurate, i.e: our minimality loss is reasonably low.

This is disappointing because it means that we get dull behaviour. As soon as the model loses minimality loss it no longer needs to worry about the simplicity loss, and it will just learn high-rank components:

A typical active component for low values of 
 (the sum of the active components) for small values of 

Large :

A typical active component for large values of 

 

 (the sum of the active components) for large values of 

This time we get good sparse reconstruction, so low minimality loss. Our simplicity loss is high because the active components we learn are all high rank. In fact, in this case the model seems to consistently use the same active components,  meaning we can just straightforwardly combine these components. So it seems like in this case APD was a success!

Modified simplicity loss:

The small  regime is boring because APD just learns to drive the spectrum to , meaning that it has no incentive to learn low-rank matrices. To stop this, I chose to instead normalize the  norm by the frobenius norm (the  norm of the spectrum), and use this for simplicity loss. 

In particular, the usual simplicity loss is given by  where  are the active components. Instead we can use , which we can compute efficiently using the same trick as for the Schattern p-norm.

Note that we have   , with equality in the rank-1 case.

This modified simplicity loss is invariant under scaling any individual component. It is pretty unprincipled / hacky, but it does lead to more interesting behaviour for small 

Numerical instability:

Note that for , the derivative of  is , where . Therefore gradients are badly behaved near 0. We can fix this just by adding  appropriately when computing the Schatten p-norm.

Modified small  regime:

A typical active component for low values of  with modified simplicity loss

 

 (the sum of the active components) for small values of 

All the active components are now visibly low rank, and yet they still sum to approximate a rough diagonal, though the minimality loss is high.

Conclusion:

Studying APD for linear maps can help us improve our intuition for how it will behave for larger models. Here we used a spherically symmetric input, but it would be interesting to look at how APD behaves for non-homogeneous inputs.

Code to reproduce results:

https://colab.research.google.com/drive/1sBPytrtZNfBMpVYeaiAgwj7Kqle7qgeg?usp=sharing

 

 



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

APD 线性映射 自适应参数分解 简洁性损失
相关文章