MarkTechPost@AI 2024年09月30日
Scaling Laws and Model Comparison: New Frontiers in Large-Scale Machine Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

机器学习中从以泛化为中心的范式转向以缩放为中心的范式,带来新挑战。研究者需重新评估传统方法,平衡计算限制与性能提升,开发新原则和方法以应对

🎯以缩放为中心的范式关注通过缩放减少近似误差,而非通过正则化减少泛化误差,这挑战了传统观念,以往在以泛化为中心范式中有效的实践在缩放范式中可能并非最佳

📏两种机器学习范式已出现,以泛化为中心的范式在相对小的数据规模下运作,分为古典偏差-方差权衡和现代过参数化两个子范式;以缩放为中心的范式则具有大数据和大模型规模

💻在缩放范式中,传统正则化技术的有效性被重新评估,因大量训练数据使过拟合担忧减少,传统正则化的益处可能不再适用,需探索更适合该范式的替代方法

🤔缩放范式中模型比较面临独特挑战,传统验证集方法在大规模下不实用,缩放规律交叉现象使小尺度的性能排名未必适用于大模型

Large language models (LLMs) have gained significant attention in machine learning, shifting the focus from optimizing generalization on small datasets to reducing approximation error on massive text corpora. This paradigm shift presents researchers with new challenges in model development and training methodologies. The primary objective has evolved from preventing overfitting through regularization techniques to effectively scaling up models to consume vast amounts of data. Researchers now face the challenge of balancing computational constraints with the need for improved performance on downstream tasks. This shift necessitates a reevaluation of traditional approaches and the development of robust strategies to harness the power of large-scale language pretraining while addressing the limitations imposed by available computing resources.

The shift from a generalization-centric paradigm to a scaling-centric paradigm in machine learning has necessitated reevaluating traditional approaches. Google DeepMind researchers have identified key differences between these paradigms, focusing on minimizing approximation error through scaling rather than reducing generalization error through regularization. This shift challenges conventional wisdom, as practices that were effective in the generalization-centric paradigm may not yield optimal results in the scaling-centric approach. The phenomenon of “scaling law crossover” further complicates matters, as techniques that enhance performance at smaller scales may not translate effectively to larger ones. To mitigate these challenges, researchers propose developing new principles and methodologies to guide scaling efforts and effectively compare models at unprecedented scales where conducting multiple experiments is often infeasible.

Machine learning aims to develop functions capable of making accurate predictions on unseen data by understanding the underlying structure of the data. This process involves minimizing the test loss on unseen data while learning from a training set. The test error can be decomposed into the generalization gap and the approximation error (training error).

Two distinct paradigms have emerged in machine learning, differentiated by the relative and absolute scales of data and models:

1. The generalization-centric paradigm, which operates with relatively small data scales, is further divided into two sub-paradigms:

   a) The classical bias-variance trade-off regime, where model capacity is intentionally constrained.

   b) The modern over-parameterized regime, where model scale significantly surpasses data scale.

2. The scaling-centric paradigm, characterized by large data and model scales, with data scale exceeding model scale.

These paradigms present different challenges and require distinct approaches to optimize model performance and achieve desired outcomes.

The proposed method employs a decoder-only transformer architecture trained on the C4 dataset, utilizing the NanoDO codebase. Key architectural features include Rotary Positional Embedding, QK-Norm for attention computation, and untied head and embedding weights. The model uses Gelu activation with F = 4D, where D is the model dimension and F is the hidden dimension of the MLP. Attention heads are configured with a head dimension of 64, and the sequence length is set to 512.

The model’s vocabulary size is 32,101, and the total parameter count is approximately 12D²L, where L is the number of transformer layers. Most models are trained to Chinchilla optimality, using 20 × (12D²L + DV) tokens. Compute requirements are estimated using the formula F = 6ND, where F represents the number of floating-point operations.

For optimization, the method employs AdamW with β1 = 0.9, β2 = 0.95, ϵ = 1e-20, and a coupled weight decay λ = 0.1. This combination of architectural choices and optimization strategies aims to enhance the model’s performance in the scaling-centric paradigm.

In the scaling-centric paradigm, traditional regularization techniques are being reevaluated for their effectiveness. Three popular regularization methods commonly used in the generalization-centric paradigm are explicit L2 regularization and the implicit regularization effects of large learning rates and small batch sizes. These techniques have been instrumental in mitigating overfitting and reducing the gap between training and test losses in smaller-scale models.

However, in the context of large language models and the scaling-centric paradigm, the necessity of these regularization techniques is being questioned. As models operate in a regime where overfitting is less of a concern due to the vast amount of training data, the traditional benefits of regularization may no longer apply. This shift prompts researchers to reconsider the role of regularization in model training and to explore alternative approaches that may be more suitable for the scaling-centric paradigm.

The scaling-centric paradigm presents unique challenges in model comparison as traditional validation set approaches become impractical at massive scales. The phenomenon of scaling law crossover further complicates matters, as performance rankings observed at smaller scales may not hold true for larger models. This raises the critical question of how to effectively compare models when training is feasible only once at scale.

In contrast, the generalization-centric paradigm relies heavily on regularization as a guiding principle. This approach has led to insights into hyperparameter choices, weight decay effects, and the benefits of over-parameterization. It also explains the effectiveness of techniques like weight sharing in CNNs, locality, and hierarchy in neural network architectures.

However, the scaling-centric paradigm may require new guiding principles. While regularization has been crucial for understanding and improving generalization in smaller models, its role and effectiveness in large-scale language models are being reevaluated. Researchers are now challenged to develop robust methodologies and principles that can guide the development and comparison of models in this new paradigm, where traditional approaches may no longer apply.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 52k+ ML SubReddit.

We are inviting startups, companies, and research institutions who are working on small language models to participate in this upcoming ‘Small Language Models’ Magazine/Report by Marketchpost.com. This Magazine/Report will be released in late October/early November 2024. Click here to set up a call!

The post Scaling Laws and Model Comparison: New Frontiers in Large-Scale Machine Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

机器学习 范式转变 缩放范式 正则化
相关文章