A novel approach, symmetry teleportation, enhances convergence speed in gradient-based optimization by allowing parameters to traverse large distances on the loss level set by exploiting symmetries in the loss landscape.Symmetry teleportation is a novel optimization technique that leveragessymmetries in the loss landscape to accelerate convergence in gradient-basedoptimization algorithms. By allowing parameters to traverse large distances onthe loss level set, symmetry teleportation enhances the convergence speed ofoptimization algorithms such as stochastic gradient descent (SGD) and AdaGrad.More precisely, if $G$ is a symmetry group of the loss function, which means\begin{equation}\mathcal{L} (g \cdot (w, X)) = \mathcal{L} (w, X), \quad \forall g \in G\end{equation}the teleportationoperation allows the optimization algorithm to move along the orbitof the currentparameter configuration $w_t$ under the action of $G$. Usually, teleportation isused to maximize the gradient norm within the orbit, i.e.\begin{align}gt &\leftarrow \operatorname{argmax}{g \in G} |(\nabla \mathcal{L})|_{g \cdot w_t} |^2, \w_t &\leftarrow g_t \cdot w_t\end{align}leading to faster convergence.A simple example of a symmetry group is the positive scale invariance of fullyconnected ReLU networks. In this case, one can scale the incoming weights of ahidden layer by a positive factor and the outgoing weights by the inverse ofthat factor without changing the network output.Figure 1: , Illustration of ateleport operation along the orbit of a symmetry group action.While the loss remains unchanged, the gradient norm of the lossfunction varies along the orbit.See Figure 1 of [Zha24I].Convergence AnalysisThe key to understand the improved convergence, is the observation that for aquadratic loss function it holds that if $w$ is an element with maximal gradientnorm in a level set of the loss function, the gradient becomes an eigenvector ofthe Hessian.Hence, if the symmetry group $G$ acts transitively on the level set,teleportation combined with a first-order optimization stepis equivalent to a Newton step.Moreover, in case of a quadratic function,the group $O(n)$, where $n$ is the number of network parameters, is a symmetrygroup of the loss function that acts transitively on the level sets.Since close to a minimum the loss function can bewell approximated by a quadratic function, teleportation can be seen as a way toapproximate a Newton step in the presence of symmetries.For more details on this, see section 5 of [Zha23S].Influence on GeneralizationA potentially harmful effect of teleportation is that it can lead to earlyoverfitting in non-convex settings. By teleporting the parameters to a pointwith a high gradient norm, the optimization algorithm might move theoptimization trajectory to a sharper minimum. Since flat minima tend togeneralize better, this canresult in a decrease in generalization performance. On the other hand, usingteleportation to minimize the gradient norm can lead to a flatter minimum andimprove generalization, which is further studied in the follow-up paper[Zha24I].ExperimentThe authors conduct a series of experiments to showcase the improvedconvergence properties of teleportation. For this, they use a multilayerfeed-forward network with three layers and a regression loss.The corresponding symmetries used for teleportation are described inProp. 4.3 [Zha23S].The results for comparing thetraining loss and gradient norms of SGD andAdaGrad with and without teleportation are shown in Figure 2.One can observe a faster decay of the training loss for both algorithms, whenadding teleportation. In addition, for the same loss value, the teleportationvariants generate gradients with larger norms, explaining the improvedconvergence.Figure 2:Gradient descent optimization of a multilayer perceptron,with and without teleportation, showingdifferences in loss evolution over epochs and time,as well as variations in gradient magnitude based on loss values.The results highlight the improved convergence achieved through the use ofteleportation. See Figure 3 of [Zha23S].In a second step, the authors investigate the behavior of teleportationon the MNIST classification problem, using a three layerfeed-forward neural network with LeakyReLU activation functions andcross-entropy loss. Contrary to the first experiment, the authorsfocus on the evolution of the validation loss, see Figure 3.While they observe an improved convergence for the training loss,this comes at the cost of a very early overfitting and a slightly largervalidation loss.Figure 3:MNIST classification task; optimization with and without teleportation.Solid lines represent training and dashed ones validation loss. SeeFigure 4 of [Zha23S].DiscussionAlthough the improvements in convergence might seem impressive, the slightlyworse generalization behavior potentially outweighsthe benefit of the speed-up. With this in mind, the authors investfurther analysis into improving generalization with teleportationin the follow-up paper [Zha24I].