MarkTechPost@AI 2024年09月29日
Revisiting Weight Decay: Beyond Regularization in Modern Deep Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了权重衰减在现代深度学习中的作用。指出其在限制网络容量、减少无关权重成分方面的重要性,虽广泛应用但效果未完全理解。研究了其对优化动态的影响,及在不同架构和训练设置中的作用,还通过实验揭示了其在LLM训练中的关键效果。

💪权重衰减在机器学习中至关重要,符合Occam’s razor原则,对讨论泛化边界有重要意义,但在深度网络中其与泛化的相关性受到质疑,且在如GPT-3等先进网络中的作用尚未完全明晰。

🎉近期研究取得显著进展,突出了权重衰减和ℓ2正则化的不同效果,对像Adam这样的优化器有影响,还影响了优化动态,包括在尺度不变网络中对有效学习率的影响等。

🌟EPFL的研究人员提出新视角,认为权重衰减显著改变了过参数化和欠参数化网络的优化动态,能防止bfloat16混合精度训练中突然的损失发散,适用于从ResNets到LLMs的各种架构。

🚀实验表明权重衰减在LLM中有两个主要机制:改善优化和防止使用bfloat16精度时的损失发散。还揭示了其在深度学习中的三个独特效果:结合随机噪声提供正则化、增强训练损失的优化、确保低精度训练的稳定性。

Weight decay and ℓ2 regularization are crucial in machine learning, especially in limiting network capacity and reducing irrelevant weight components. These techniques align with Occam’s razor principles and are central to discussions on generalization bounds. However, recent studies have questioned the correlation between norm-based measures and generalization in deep networks. Although weight decay is widely used in state-of-the-art deep networks like GPT-3, CLIP, and PALM, its effect is still not fully understood. The emergence of new architectures like transformers and nearly one-epoch language modeling has further complicated the applicability of classical results to modern deep-learning settings.

Efforts to understand and utilize weight decay have significantly progressed over time. Recent studies have highlighted the distinct effects of weight decay and ℓ2 regularization, especially for optimizers like Adam. It also highlights weight decay’s influence on optimization dynamics, including its impact on effective learning rates in scale-invariant networks. Other methods include its role in regularizing the input Jacobian and creating specific dampening effects in certain optimizers. Moreover, a recent investigation contains the relationship between weight decay, training duration, and generalization performance. While weight decay has been shown to improve test accuracy, the improvements are often modest, suggesting that implicit regularization plays a significant role in deep learning.

Researchers from the Theory of Machine Learning Lab at EPFL have proposed a new perspective on the role of weight decay in modern deep learning. Their work challenges the traditional view of weight decay as primarily a regularization technique, as studied in classical learning theory. They have shown that weight decay significantly modifies optimization dynamics in overparameterized and underparameterized networks. Moreover, weight decay prevents sudden loss of divergences in bfloat16 mixed-precision training, a crucial aspect of LLM training. It applies across various architectures, from ResNets to LLMs, indicating that the primary advantage of weight decay lies in its ability to influence training dynamics rather than acting as an explicit regularizer.

The experiments are carried out by training GPT-2 models on OpenWebText using the NanoGPT repository. A 124M parameter model (GPT-2-Small) trained for 50,000 iterations is used, with modifications to ensure practicality within academic constraints. It is found that training and validation losses remain closely aligned across different weight decay values. The researchers propose two primary mechanisms for weight decay in LLMs: 

These findings contrast with data-limited environments where generalization is the key focus, highlighting the importance of optimization speed and training stability in LLM training.

Experimental results reveal a crucial effect of weight decay in enabling stable bfloat16 mixed-precision training for LLMs. Bfloat16 training accelerates the process and reduces GPU memory usage, enabling the training of larger models and bigger batch sizes. However, even the more stable bfloat16 can exhibit late-training spikes that harm model performance. It is also found that weight decay prevents these divergences. While float16 training is known to encounter issues with moderately large values exceeding 65,519, it poses a different challenge, and its limited precision can lead to problems when adding network components with varying scales. Weight decay effectively solves these precision-related issues by preventing excessive weight growth.

In this paper, researchers presented a new perspective on the role of weight decay in modern deep learning. They concluded that weight decay shows three distinct effects in deep learning:

Researchers are challenging the traditional idea that weight decay primarily acts as an explicit regularizer. Instead, they argue that its widespread use in modern deep learning is due to its capacity to create beneficial changes in optimization dynamics. This viewpoint offers a unified explanation for the success of weight decay across different architectures and training settings, ranging from vision tasks with ResNets to LLMs. Future approaches include model training and hyperparameter tuning in the deep learning field.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post Revisiting Weight Decay: Beyond Regularization in Modern Deep Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

权重衰减 深度学习 优化动态 LLM训练
相关文章