MarkTechPost@AI 2024年10月20日
Nvidia AI Introduces the Normalized Transformer (nGPT): A Hypersphere-based Transformer Achieving 4-20x Faster Training and Improved Stability for LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

英伟达的研究人员提出新型架构nGPT,旨在提高Transformer模型训练效率且不影响性能。该架构将归一化融入整个架构,使训练过程更快更稳定,在多项实验中表现出色。

🌐nGPT将表示学习融入超球面,模型中所有向量均归一化为单位范数,使输入标记在超球面上移动,各层逐步为最终输出预测做贡献,能减少训练步骤。

🎯nGPT的结构围绕系统的归一化过程,所有嵌入以及注意力和MLP矩阵都约束在超球面上,确保所有网络层的均匀表示,还引入了可学习的缩放参数。

🎉研究结果表明,在相同训练预算下,nGPT比GPT显著降低了验证损失,在一系列下游任务中也表现更优,提高了嵌入的可分离性和准确性。

The rise of Transformer-based models has significantly advanced the field of natural language processing. However, the training of these models is often computationally intensive, requiring substantial resources and time. This research addresses the issue of improving the training efficiency of Transformer models without compromising their performance. Specifically, it seeks to explore whether the benefits of normalization, often applied as a separate component, can be integrated throughout the Transformer architecture in a more cohesive manner.

Researchers from NVIDIA propose a novel architecture called the Normalized Transformer (nGPT), which incorporates representation learning on the hypersphere. In this approach, all vectors involved in the embeddings, MLP, attention matrices, and hidden states are normalized to unit norm. This normalization allows the input tokens to move across the surface of a hypersphere, with each model layer incrementally contributing towards the final output prediction. By conceptualizing the entire transformation process as movement on a hypersphere, the researchers aim to make the training process both faster and more stable. The nGPT model reportedly reduces the number of training steps required by a factor of 4 to 20, depending on the sequence length.

The structure of the Normalized Transformer revolves around a systematic normalization process. All embeddings, as well as attention and MLP matrices, are constrained to lie on a hypersphere, ensuring uniform representation across all network layers. Specifically, the embeddings and the outputs from the attention mechanism and MLP are normalized, treating each vector operation as a dot product representing cosine similarity. Furthermore, instead of using traditional weight decay and additional normalization layers like LayerNorm or RMSNorm, the authors introduce learnable scaling parameters to control the impact of normalization. The normalization and optimization process in nGPT is designed as a variable-metric optimization on the hypersphere, with the update steps controlled by learnable eigen learning rates that adaptively adjust each layer’s contributions.

The results of the research are compelling. The authors conducted experiments using the OpenWebText dataset, training both a baseline GPT model and the new nGPT model. For the same training budget, nGPT demonstrated a significant reduction in validation loss compared to GPT, particularly at longer context lengths. For instance, with a context length of 4k tokens, nGPT achieved the same validation loss as GPT with only one-tenth of the iterations. The experiments also confirmed that nGPT consistently outperformed the baseline GPT on a range of downstream tasks, providing not only faster convergence but also improved generalization. The introduction of hyperspherical representation learning led to better embedding separability, which correlated with higher accuracy on benchmark tests.

In conclusion, the Normalized Transformer (nGPT) presents a significant advancement in the efficient training of large language models. By unifying the findings of previous studies on normalization and embedding representation, the authors created a model that is more efficient in terms of computational resources while still maintaining high performance. The approach of utilizing the hypersphere as the foundation for all transformations allows for more stable and consistent training, potentially paving the way for future optimizations in the architecture of Transformer models. The researchers suggest that this method could be extended to more complex encoder-decoder architectures and other hybrid model frameworks.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 50k+ ML SubReddit.

[Upcoming Live Webinar- Oct 29, 2024] Learn how to increase inference throughput by 4x and reduce serving costs by 50% with Turbo LoRA, FP8 and GPU Autoscaling (Promoted)

The post Nvidia AI Introduces the Normalized Transformer (nGPT): A Hypersphere-based Transformer Achieving 4-20x Faster Training and Improved Stability for LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

nGPT 英伟达 Transformer模型 训练效率
相关文章