少点错误 2024年09月07日
Adam Optimizer Causes Privileged Basis in Transformer Language Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了在基于变压器的语言模型中,Adam优化器会导致特权基维度(异常维度)的产生,而使用SGD训练则不会出现此情况,从而改进了对语言模型训练动态的理解。

🎯Adam优化器被明确证明是基于变压器的语言模型中特权基维度的主要原因。通过在TinyStories数据集上用Adam和SGD训练相同的12M参数模型,发现Adam训练的模型过剩峰度迅速超过100,存在明显异常通道,而SGD训练的模型过剩峰度始终接近0,无特权基维度。

📚作者以Anthropic对特权基的研究为主要动机,在实验中使用TinyStories数据集,该数据集是小型的合成英语儿童故事集,能以较小规模重现语言模型的异常通道行为,且训练成本较低。

🔍为衡量基特权,作者受Anthropic启发使用过剩峰度作为度量标准。在Adam优化的LM训练过程中,随着训练进行,激活中的异常通道愈发突出,相应的激活过剩峰度也随之增加。

🌐未来研究将使用稀疏自动编码器查看学习到的特征是否与异常通道对齐,并探讨Adam内部导致特征与基维度对齐的具体机制。

Published on September 6, 2024 5:55 PM GMT

Diego Caples (diego@activated-ai.com)

Rob Neuhaus (rob@activated-ai.com)

Introduction

In principle, neuron activations in a transformer-based language model residual stream should be about the same scale. In practice, however the dimensions unexpectedly widely vary in scale. Mathematical theories of the transformer architecture do not predict this. They expect rotational equivariance within a model, where one dimension is no more important than any other. Is there something wrong with our reasonably informed intuitions of how transformers work? What explains these outlier channels?

Previously, Anthropic researched the existence of these privileged basis dimensions (dimensions more important / larger than expected) and ruled out several causes. By elimination, they reached the hypothesis that per-channel normalization in the Adam optimizer was the cause of privileged basis. However, they did not prove this was the case.

We conclusively show that Adam causes outlier channels / privileged basis within the transformer residual stream. When replacing the Adam optimizer with SGD, models trained do not have a privileged basis.

As a whole, this work improves mechanistic understanding of transformer LM training dynamics and confirms that our mathematical models of transformers are not flawed. Rather, they simply do not take into account the training process.

Our code is open source at the LLM outlier channel exploration GitHub.

Key Results

Background

Recommended Reading

More About Anthropic’s Work

We consider Anthropic’s research on privileged basis the primary motivator for this work. In Anthropic’s Privileged Bases in the Transformer Residual Stream, they demonstrate privileged basis in a 200M parameter LLM, performed some experiments to rule out possible causes, but did not find a definitive cause. They hypothesize that outlier channels are caused by Adam’s lack of rotational equivariance, and suggest that training using SGD could isolate Adam as the cause.

Adam vs SGD, and Rotational Equivariance

Consider an experiment where we rotate the parameter space of a neural network, train it, and then invert the rotation. With Stochastic Gradient Descent (SGD), this process yields the same model as if we hadn't rotated at all. However, with the Adam optimizer, we end up with a different model.

This difference can be explained by the presence/absence a property called rotational equivariance. SGD is rotationally equivariant: optimizer steps are always directly proportional to the gradient of the loss function, regardless of the chosen coordinate system. In contrast, Adam is not rotationally equivariant because it takes steps in ways that are not proportional to the gradient. Updates depend on coordinate-wise gradient statistics. As we later show, this difference is what leads to privileged basis within LMs.

Kurtosis

Motivated by Anthropic, we use excess kurtosis as a metric for measuring basis privilege.

We encourage the reader to read Anthropic’s reasoning for why this is a good metric, but here we aim to demonstrate graphically that excess kurtosis is a reasonable choice for measuring basis privilege.

We plot the middle layer residual stream activations for the last token of string:

“Lilly saw a big red apple!”

as an Adam optimized LM training run progresses.

Model Excess Kurtosis through Training

 

Note how as training progresses, the outlier channels in the activation become increasingly prominent. The excess kurtosis of the activations increases accordingly.

TinyStories

We use the TinyStories datasets in all of our experiments. TinyStories is a small, synthetically generated dataset of English children’s stories. The authors showed that ~10M parameter LMs trained on the dataset can generate coherent and creative stories, and demonstrate emergent properties previously only found in much larger LMs. This enables us to reproduce LMs with outlier channels at a much smaller scale than previous works.

Experiments

Replicating Outlier Channels at Small Scale

To test if training with SGD prevents privileged basis, we first need to have a model that replicates outlier channel behavior.

We train a 12M parameter transformer LM model with Adam. It is capable of generating coherent stories. As the model trains, the excess kurtosis increases, until it is over 100 by the time training terminates. Clear outlier channels are present (as seen in Model Excess Kurtosis through Training figure above).

Training an LM with SGD

Next, we train the exact same model, this time using SGD with momentum (still rotationally equivariant). Adam takes ≈16x fewer steps to reach identical loss to SGD. It is the small size of the model which makes it affordable to train for so long.

Comparing the excess kurtosis of SGD and Adam shows a stark contrast:

note: the SGD-momentum model excess kurtosis is the flat green line at 0

While the Adam-trained model’s excess kurtosis quickly exceeds 100, the excess kurtosis of the SGD trained model remains approximately 0 throughout training. It is clear that Adam is causally responsible for privileged basis/outlier channels.

Conclusions

We conclusively demonstrate that the Adam optimizer is the primary cause of privileged basis dimensions in transformer-based language models. By training identical 12M parameter models on the TinyStories dataset using both Adam and SGD with momentum, we observed a clear difference in the development of outlier channels:

These findings have several important implications:

Future Research



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Adam优化器 语言模型 特权基维度 SGD 过剩峰度
相关文章