MarkTechPost@AI 13小时前
MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

MIT研究人员提出一种创新的方法,通过对Transformer模型权重进行频谱正则化,来稳定训练大型模型,并确保可证明的Lipschitz界限。与传统依赖激活归一化、QK归一化或logit软阈值等“权宜之计”不同,该方法直接解决了由权重和激活范数无约束增长引起的激活和损失尖峰问题。研究证明,这种方法能在不牺牲性能的情况下,显著提高模型的鲁棒性和可预测性,同时还能将激活值保持在极小的范围内,为低精度训练和推理提供了新的可能性。

✨ **解决Transformer训练不稳定的根源**:文章指出,大型Transformer模型训练不稳定的主要原因是激活值和损失的无约束增长,这源于权重和激活范数的失控。MIT的研究人员通过对Transformer的权重进行频谱正则化,直接从根源上解决了这一问题,无需依赖传统的激活归一化、QK归一化或logit软阈值等辅助技术。

🔬 **Lipschitz界限的重要性与实现**:Lipschitz界限衡量了网络输出对输入或权重扰动的敏感程度。较低的Lipschitz界限意味着模型对变化或噪声更不敏感,这对于模型的稳定性、对抗鲁棒性、隐私保护和泛化能力至关重要。研究提出,通过频谱正则化权重,可以有效地约束模型的Lipschitz常数,从而实现可证明的稳定性。

🚀 **Muon优化器与权重频谱正则化**:研究的核心创新之一是Muon优化器,它通过梯度频谱正则化来确保每次梯度更新都不会增加权重矩阵的谱范数。此外,研究还将这种正则化扩展到权重本身,通过操作来限制每个权重矩阵的奇异值。这种方法使得激活范数保持在非常小的范围内,甚至兼容fp8精度。

💡 **移除传统稳定技巧的优势**:实验表明,在应用频谱权重正则化后,可以完全移除层归一化、QK归一化和logit tanh等传统稳定技术。在GPT-2规模的Transformer中,最大激活项从未超过约100,而未经约束的基线模型则超过了148,000,这极大地提高了训练的稳定性和效率。

📈 **性能与Lipschitz界限的权衡**:研究探索了多种强制执行Lipschitz约束的方法,并比较了它们在维持高性能、保证Lipschitz界限以及优化性能-Lipschitz权衡方面的能力。结果显示,Muon优化器与Spectral Cap(一种新的平滑正则化方法)的组合在损失-Lipschitz权衡方面表现最佳,为模型带来了更高的稳定性,同时保持了有竞争力的性能。

Training large-scale transformers stably has been a longstanding challenge in deep learning, particularly as models grow in size and expressivity. MIT researchers tackle a persistent problem at its root: the unstable growth of activations and loss spikes caused by unconstrained weight and activation norms. Their solution is to enforce provable Lipschitz bounds on the transformer by *spectrally regulating the weights—*with no use of activation normalization, QK norm, or logit softcapping tricks.

What is a Lipschitz Bound—and Why Enforce It?

A Lipschitz bound on a neural network quantifies the maximum amount by which the output can change in response to input (or weight) perturbations. Mathematically, a function fff is KKK-Lipschitz if:∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2|f(x_1) – f(x_2)| leq K |x_1 – x_2| forall x_1, x_2∥f(x1)−f(x2)∥≤K∥x1−x2∥ ∀x1,x2

Motivation and Problem Statement

Traditionally, training stable transformers at scale has involved a variety of “band-aid” stabilization tricks:

But these do not directly address the underlying spectral norm (largest singular value) growth in the weights, a root cause of exploding activations and training instability—especially in large models.

The central hypothesis: If we spectrally regulate the weights themselves—beyond just the optimizer or activations—we can maintain tight control over Lipschitzness, potentially solving instability at its source.

Key Innovations

Weight Spectral Regulation and the Muon Optimizer

Removing Stability Tricks

In all experiments, no layer normalization, no QK norm, no logit tanh were used. Yet,

Table Sample (NanoGPT Experiment)

ModelMax ActivationLayer Stability TricksValidation AccuracyLipschitz Bound
Baseline (Speedrun)148,480Yes39.4%
Lipschitz Transformer160None39.5%10¹⁰²⁶⁴

Methods for Enforcing Lipschitz Constraints

A variety of weight norm constraint methods were explored and compared for their ability to:

    Maintain high performance,Guarantee a Lipschitz bound, andOptimize the performance-Lipschitz tradeoff.

Techniques

Experimental Results and Insights

Model Evaluation at Various Scales

    Shakespeare (Small Transformer, <2-Lipschitz):
      Achieves 60% validation accuracy with a provable Lipschitz bound below.Outperforms unconstrained baseline in validation loss.
    NanoGPT (145M Parameters):
      With a Lipschitz bound <10, validation accuracy: 21.2%.To match the strong unconstrained baseline (39.4% accuracy), required a large upper bound of 1026410^{264}10264. This highlights how strict Lipschitz constraints often trade off with expressivity at large scales for now.

Weight Constraint Method Efficiency

Stability and Robustness

Activation Magnitudes

Limitations and Open Questions

Conclusion

Spectral weight regulation—especially when paired with the Muon optimizer—can stably train large transformers with enforced Lipschitz bounds, without activation normalization or other band-aid tricks. This addresses instability at a deeper level and keeps activations in a compact, predictable range, greatly improving adversarial robustness and potentially hardware efficiency.

This line of work points to new, efficient computational primitives for neural network regulation, with broad applications for privacy, safety, and low-precision AI deployment.


Check out the Paper, GitHub Page and Hugging Face Project Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post MIT Researchers Develop Methods to Control Transformer Sensitivity with Provable Lipschitz Bounds and Muon appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Transformer Lipschitz界限 频谱正则化 Muon优化器 深度学习稳定性
相关文章