Content feed of the TransferLab — appliedAI Institute 2024年11月27日
Symmetry Teleportation for Accelerated Optimization
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

对称性瞬移是一种新型优化技术,利用损失函数的对称性加速基于梯度的优化算法的收敛速度。文章介绍了其原理、作用、影响及实验结果等内容。

📌对称性瞬移利用损失函数对称性加速收敛,如SGD和AdaGrad算法

📌以正规模不变性的全连接ReLU网络为例说明对称性

📌瞬移可能导致过早过拟合,也可能改善泛化性能

📌作者通过实验展示了瞬移的改进收敛特性

A novel approach, symmetry teleportation, enhances convergence speed in gradient-based optimization by allowing parameters to traverse large distances on the loss level set by exploiting symmetries in the loss landscape.Symmetry teleportation is a novel optimization technique that leveragessymmetries in the loss landscape to accelerate convergence in gradient-basedoptimization algorithms. By allowing parameters to traverse large distances onthe loss level set, symmetry teleportation enhances the convergence speed ofoptimization algorithms such as stochastic gradient descent (SGD) and AdaGrad.More precisely, if $G$ is a symmetry group of the loss function, which means\begin{equation}\mathcal{L} (g \cdot (w, X)) = \mathcal{L} (w, X), \quad \forall g \in G\end{equation}the teleportationoperation allows the optimization algorithm to move along the orbitof the currentparameter configuration $w_t$ under the action of $G$. Usually, teleportation isused to maximize the gradient norm within the orbit, i.e.\begin{align}gt &\leftarrow \operatorname{argmax}{g \in G} |(\nabla \mathcal{L})|_{g \cdot w_t} |^2, \w_t &\leftarrow g_t \cdot w_t\end{align}leading to faster convergence.A simple example of a symmetry group is the positive scale invariance of fullyconnected ReLU networks. In this case, one can scale the incoming weights of ahidden layer by a positive factor and the outgoing weights by the inverse ofthat factor without changing the network output.Figure 1: , Illustration of ateleport operation along the orbit of a symmetry group action.While the loss remains unchanged, the gradient norm of the lossfunction varies along the orbit.See Figure 1 of [Zha24I].Convergence AnalysisThe key to understand the improved convergence, is the observation that for aquadratic loss function it holds that if $w$ is an element with maximal gradientnorm in a level set of the loss function, the gradient becomes an eigenvector ofthe Hessian.Hence, if the symmetry group $G$ acts transitively on the level set,teleportation combined with a first-order optimization stepis equivalent to a Newton step.Moreover, in case of a quadratic function,the group $O(n)$, where $n$ is the number of network parameters, is a symmetrygroup of the loss function that acts transitively on the level sets.Since close to a minimum the loss function can bewell approximated by a quadratic function, teleportation can be seen as a way toapproximate a Newton step in the presence of symmetries.For more details on this, see section 5 of [Zha23S].Influence on GeneralizationA potentially harmful effect of teleportation is that it can lead to earlyoverfitting in non-convex settings. By teleporting the parameters to a pointwith a high gradient norm, the optimization algorithm might move theoptimization trajectory to a sharper minimum. Since flat minima tend togeneralize better, this canresult in a decrease in generalization performance. On the other hand, usingteleportation to minimize the gradient norm can lead to a flatter minimum andimprove generalization, which is further studied in the follow-up paper[Zha24I].ExperimentThe authors conduct a series of experiments to showcase the improvedconvergence properties of teleportation. For this, they use a multilayerfeed-forward network with three layers and a regression loss.The corresponding symmetries used for teleportation are described inProp. 4.3 [Zha23S].The results for comparing thetraining loss and gradient norms of SGD andAdaGrad with and without teleportation are shown in Figure 2.One can observe a faster decay of the training loss for both algorithms, whenadding teleportation. In addition, for the same loss value, the teleportationvariants generate gradients with larger norms, explaining the improvedconvergence.Figure 2:Gradient descent optimization of a multilayer perceptron,with and without teleportation, showingdifferences in loss evolution over epochs and time,as well as variations in gradient magnitude based on loss values.The results highlight the improved convergence achieved through the use ofteleportation. See Figure 3 of [Zha23S].In a second step, the authors investigate the behavior of teleportationon the MNIST classification problem, using a three layerfeed-forward neural network with LeakyReLU activation functions andcross-entropy loss. Contrary to the first experiment, the authorsfocus on the evolution of the validation loss, see Figure 3.While they observe an improved convergence for the training loss,this comes at the cost of a very early overfitting and a slightly largervalidation loss.Figure 3:MNIST classification task; optimization with and without teleportation.Solid lines represent training and dashed ones validation loss. SeeFigure 4 of [Zha23S].DiscussionAlthough the improvements in convergence might seem impressive, the slightlyworse generalization behavior potentially outweighsthe benefit of the speed-up. With this in mind, the authors investfurther analysis into improving generalization with teleportationin the follow-up paper [Zha24I].

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

对称性瞬移 优化算法 收敛速度 泛化性能
相关文章