【手搓大模型】从零手写GPT2

系列前言

What I cannot create, I do not understand.
—— Richard Feynman

理解大模型最好的方式，应该是亲自动手、从零开始实现。大模型之大在于参数（动辄几十B），而不在于代码量（即便很强大的模型也不过几百行代码）。这样我们便可以在动手写代码中，去思考问题、发现问题、解决问题。

本文不深究背后原理，提供尽可能简单的实现，以便整体理解大模型。

参考Sebastian Raschka和Andrej Karpathy的教程，并进行重新组织，并对核心代码做了优化，使之更简单更清晰。

零基础，具备基本的Python技能，了解Pytorch和Tensor的基本操作。

资源：所有代码均运行在个人电脑上，无需GPU。使用数据均为公开数据集。

系列文章：将会分为以下5篇

【手搓大模型】从零手写GPT2 — Embedding

手搓大模型】从零手写GPT2 — Attention

【手搓大模型】从零手写GPT2 — Model

【手搓大模型】从零训练GPT2:

【手搓大模型】从零微调GPT2:

大模型的本质是空间映射（Mapping Between Spaces）与空间优化（Optimization in Latent Space）。大模型的代码，本质是把输入空间映射到输出空间的函数近似器（function approximator）；而大模型不是直接编程实现规则，而是通过大规模训练过程，在参数空间（weights）中搜索最优解，使得映射函数拟合输入输出的真实分布。

Embedding是把原始输入空间，如文本、语音、图片、视频等，映射到中间空间latent space的过程。

Tokenize: from text to words/tokens

对于文本而言，最先进行的是分词；比如下面代码用最简单的字符分割，把长短的text拆分中单词或token。

import redef tokenize(text):    # Split by punctuation and whitespace    tokens = re.split(r'([,.:;?_!"()']|--|\s)', text)    # Remove empty strings and strip whitespace    tokens = [t.strip() for t in tokens if t.strip()]    return tokens

我们拿Gutenberg中不到1000字的文本为例，分词结果如下：

with open("Peter_Rabbit.txt", "r", encoding="utf-8") as f:    raw_text = f.read()tokens = tokenize(raw_text)print(tokens[:10])

['Once', 'upon', 'a', 'time', 'there', 'were', 'four', 'little', 'Rabbits', ',']

Encode: from token to ID

假设大模型还只是婴儿，看到的所有知识仅是上面的文本。我们可以对文本分词后，从0开始编号。

def build_vocab(whole_text):    tokens = tokenize(whole_text)    vocab = {token:id for id,token in enumerate(sorted(set(tokens)))}    return vocabvocab = build_vocab(raw_text)print(len(vocab))print(list(vocab.items())[:20])

405
[('!', 0), ("'", 1), (',', 2), ('--', 3), ('.', 4), (':', 5), (';', 6), ('A', 7), ('After', 8), ('Also', 9), ('An', 10), ('And', 11), ('Benjamin', 12), ('Bunny', 13), ('But', 14), ('Cotton-tail', 15), ('Cottontail', 16), ('END', 17), ('Father', 18), ('First', 19)]

可见上文仅有405个不同的token。

而编码的过程，就是把不同的token，映射到从0开始的编号中，这个过程就是小学生的“查字典”过程，看看某个单词在字典里的序号是多少。

def encode(vocab, text):    return [vocab[token] for token in tokenize(text)]print(encode(vocab, "Once upon a time there were four little Rabbits"))

[33, 373, 46, 354, 346, 386, 155, 210, 38]

如上示例，encode返回了每个token的编号。

Decode: from ID to token

而Decode的过程，刚好相反，是从编号还原到原始文本的过程。

def decode(vocab, ids):    vocab_inverse = {id:token for token,id in vocab.items()}    text= " ".join([vocab_inverse[id] for id in ids])    return textprint(decode(vocab,[33, 373, 46, 354, 346, 386, 155, 210, 38]))

Once upon a time there were four little Rabbits

如上所示，我们成功根据ID还原了原始文本。

Tokenizer: vocab, encode, decode

class SimpleTokenizerV1:    def __init__(self, vocab):        self.vocab = vocab        self.vocab_inverse = {id:token for token,id in vocab.items()}    def encode(self, text):        return [self.vocab[token] for token in tokenize(text)]    def decode(self, ids):        return " ".join([self.vocab_inverse[id] for id in ids])tokenizer = SimpleTokenizerV1(vocab)print(tokenizer.decode(tokenizer.encode("Once upon a time there were four little Rabbits")))

Once upon a time there were four little Rabbits

把上面的代码放在一起，可以验证先把文本encode，再decode还原。

注：有时候会不成功；因为此处故意使用了特别简单的分词；你可以想办法完善。

Special token: UNKnown/EndOfSentence

上面的字典vocab明显太小，如果遇到新的单词，就会报错，如下：

print(tokenizer.decode(tokenizer.encode("Once upon a time there were four little Rabbits, and they were all very happy.")))

KeyError Traceback (most recent call last)
Cell In[24], line 1
----> 1 print(tokenizer.decode(tokenizer.encode("Once upon a time there were four little Rabbits, and they were all very happy.")))
Cell In[15], line 7, in SimpleTokenizerV1.encode(self, text)
6 def encode(self, text):
----> 7 return [self.vocab[token] for token in tokenize(text)]
Cell In[15], line 7, in (.0)
6 def encode(self, text):
----> 7 return [self.vocab[token] for token in tokenize(text)]
KeyError: 'they'

回想幼儿园的学生，遇到不认识的字，会画个圆圈。同理，我们可以为vocab添加未知token。

vocab['<unk>'] = len(vocab)print(list(vocab.items())[-5:])

[('wriggled', 401), ('you', 402), ('young', 403), ('your', 404), ('', 405)]

如上，我们在字段最后加上了，以代表所有未知单词。

改进下上述代码，再次运行。

class SimpleTokenizerV2:    def __init__(self, vocab):        self.vocab = vocab        self.vocab_inverse = {id:token for token,id in vocab.items()}    def encode(self, text):        unk_id = self.vocab.get("<unk>")        return [self.vocab.get(token,unk_id) for token in tokenize(text)]    def decode(self, ids):        return " ".join([self.vocab_inverse[id] for id in ids])tokenizer = SimpleTokenizerV2(vocab)print(tokenizer.decode(tokenizer.encode("Once upon a time there were four little Rabbits, and they were all very happy.")))

Once upon a time there were four little Rabbits , and were all very .

可见至少没有报错了。当然，这依然不完美，因为我们无法完全还原最初的文本，所有不认识的token都成了unknown，当然会造成信息的损失。

BytePair Encoding: break words into chunks/subwords

现在回想幼儿园学生学习新单词的另一个方法：拆分。

以现在大模型最常用的tokenizer为例，一个单词可能会被拆分成如下4个token：

import tiktokentokenizer = tiktoken.get_encoding("gpt2")print(tokenizer.encode("unbelievability"))

[403, 6667, 11203, 1799]

print(tokenizer.decode([403,12,6667,12,11203,12,1799]))

un-bel-iev-ability

tiktoken是OpenAI发布的高性能分词器，上述过程也可以在线可视化看到：

可以查看下gpt2使用的词表大小是50257。

print("vocab size of gpt2: ",tokenizer.n_vocab)

vocab size of gpt2: 50257

通过这种拆分的方式，gpt2可以应对任何未知单词，因为最坏情况下可以拆成26个字母和标点。而 Byte Pair的过程，就是反复合并频率最高的token对，构建出固定大小的词表。

注：BPE的原理和详细过程这里不赘述。可以从上面直观理解，unbelievability之所以拆分出un和ability，显然是因为他们是英文里面常见的前缀和后缀。

词表一点都不神秘，跟小学生用的字典没有本质上的区别，请看gpt2的vocab，刚好有50257个，最后一个是"<|endoftext|>": 50256

BPE 就像把一个完整的词语打碎成字符拼图，再根据频率统计一步步把常见组合粘回去，形成一个适合机器处理的 token 库。而合并的过程，需要根据merges的指引，直到不能再合并，最终每个 token 必须存在于 vocab.json 中才能作为模型输入。

在本文中，我们会始终使用该分词器：tokenizer = tiktoken.get_encoding("gpt2")。

Data Sampling with Sliding Window

用gpt2分词器把上述短文转成token IDs，如下：

with open("Peter_Rabbit.txt", "r", encoding="utf-8") as f:    raw_text = f.read()enc_text = tokenizer.encode(raw_text)print("tokens: ",len(enc_text))print("first 15 token IDs: ", enc_text[:15])print("first 15 tokens: ","|".join(tokenizer.decode([token]) for token in enc_text[:15]))

tokens: 1547
first 15 token IDs: [7454, 2402, 257, 640, 612, 547, 1440, 1310, 22502, 896, 11, 290, 511, 3891, 198]
first 15 tokens: Once| upon| a| time| there| were| four| little| Rabb|its|,| and| their| names|

从此我们的关注点都在token ID上，而不在关注原始的token；也就是不再看原文、只记单词编号。

大模型在训练和推理过程中，本质上就是一个自回归（autoregressive）过程。即我们在训练大模型的过程中，需要依次把单词逐一输入，并把当前的预测输出，作为下次的输入。

而模型一次能够“看到”的最大上下文长度，叫context_size。在gpt2中是1024，表示模型支持的输入序列最多是1024个token。

假设context_size是5，那么这一过程如下：

Once --> upon
Once upon --> a
Once upon a --> time
Once upon a time --> there
Once upon a time there --> were

我们用token ID来表示：

context_size = 5for i in range(1,context_size+1):    context = enc_text[:i]    desired = enc_text[i]    print(context, "-->", desired)

[7454] --> 2402
[7454, 2402] --> 257
[7454, 2402, 257] --> 640
[7454, 2402, 257, 640] --> 612
[7454, 2402, 257, 640, 612] --> 547

而上述就是滑动窗口的过程，每次target相比input偏移1。

而在训练的过程中，我们需要把输入分batch，并进行shuffle，完整代码示例如下：

from torch.utils.data import Datasetimport torchclass GPTDatasetV1(Dataset):    def __init__(self, txt,tokenizer, context_size, stride):        token_ids = tokenizer.encode(txt)        assert len(token_ids) > context_size, "Text is too short"        self.input_ids = [torch.tensor(token_ids[i:i+context_size])                          for i in range(0, len(token_ids)-context_size, stride)]        self.target_ids = [torch.tensor(token_ids[i+1:i+context_size+1])                          for i in range(0, len(token_ids)-context_size, stride)]    def __len__(self):        return len(self.input_ids)    def __getitem__(self, idx):        return self.input_ids[idx], self.target_ids[idx]def dataloader_v1(txt,batch_size=3,context_size=5,stride=2,shuffle=False,drop_last=True,num_workers=0):    tokenizer = tiktoken.get_encoding("gpt2")    dataset = GPTDatasetV1(txt,tokenizer,context_size,stride)    return DataLoader(dataset, batch_size, shuffle=shuffle, drop_last=drop_last, num_workers=num_workers)

依然读取上述短文，通过dataloader和构建迭代器读取inputs和targets，均是token id，如下：

with open("Peter_Rabbit.txt", "r", encoding="utf-8") as f:    raw_text = f.read()dataloader = dataloader_v1(raw_text)data_iter = iter(dataloader)inputs, targets = next(data_iter)print("shape of input: ",inputs.shape)print("first batch, input: \n", inputs,"\n targets: \n", targets)

结果如下：

shape of inputs: torch.Size([3, 5])
first batch, input:
tensor([[ 7454, 2402, 257, 640, 612],
[ 257, 640, 612, 547, 1440],
[ 612, 547, 1440, 1310, 22502]])
targets:
tensor([[ 2402, 257, 640, 612, 547],
[ 640, 612, 547, 1440, 1310],
[ 547, 1440, 1310, 22502, 896]])

在上述示例中，batch=3，context_size=5，所以inputs和targets的维度都是[3, 5]，也就分为3个batch，每个batch里面最多5个token id；而targets相比inputs总是偏移1；stride=2表示每次采样的偏移是2，也就是inputs的第二行相比第一行偏移2。因此batch对应二维张量的行数，context size对应列数，stride对应行间偏移。而targets与inputs的偏移总是1，这是由上述大模型自回归训练的本质决定的。

Token Embedding: From Words to Vectors

Vectors are

high-dimensionaldenselearnable

Embedding is

looking up vectors from a big tableusually a matrix with shape (vocab_size, embed_dim)initialized with random valuesupdated during training

上面我们把单词转换成了离散的编号数字，其实已经很接近大模型所能理解的空间了。只不过前面是连续的整数，如0,1,2。而大模型所希望的空间是高纬度的、连续的浮点数张量。为什么呢？最重要的原因是embedding空间的张量是可学习的参数，因此需要是“可微”的，便于做微分计算。而可学习参数的另一个隐含意思是，初始值没有那么重要，模型训练的过程中会不断地调整优化参数，直到达到最终较为完美的状态。

注：训练过程是根据梯度指明的方向优化参数的过程，可结合Pytorch理解计算图和自动微分过程；此处不赘述。

此处我们简单示例下Embedding空间，假设字典的总单词数是10，而Embedding的维度是4，如下：

from torch import nnvocab_size = 10embed_dim = 4torch.manual_seed(123)token_embedding_layer = nn.Embedding(vocab_size, embed_dim)print("token_embedding_layer shape: ", token_embedding_layer.weight.shape)print("token_embedding_layer weight: ", token_embedding_layer.weight)

token_embedding_layer shape: torch.Size([10, 4])
token_embedding_layer weight: Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880],
[ 0.3486, 0.6603, -0.2196, -0.3792],
[ 0.7671, -1.1925, 0.6984, -1.4097],
[ 0.1794, 1.8951, 0.4954, 0.2692],
[-0.0770, -1.0205, -0.1690, 0.9178],
[ 1.5810, 1.3010, 1.2753, -0.2010],
[ 0.9624, 0.2492, -0.4845, -2.0929],
[-0.8199, -0.4210, -0.9620, 1.2825],
[-0.3430, -0.6821, -0.9887, -1.7018],
[-0.7498, -1.1285, 0.4135, 0.2892]], requires_grad=True)

可见，最终生成的Embedding空间的维度是(vocab_size, embed_dim)；而我们随机初始化了该空间，得到的是10行4列的二维张量。相当于构建了另一本“字典”，只不过这个字典后续是要被训练优化的。

而Embedding的过程，就是查询“新字典”的过程。

input_ids = torch.tensor([2,3,5])token_embeddings = token_embedding_layer(input_ids)print("token_embeddings: \n", token_embeddings) # return row 2,3,5 of weights

token_embeddings:
tensor([[ 0.7671, -1.1925, 0.6984, -1.4097],
[ 0.1794, 1.8951, 0.4954, 0.2692],
[ 1.5810, 1.3010, 1.2753, -0.2010]], grad_fn=)

如上，假设对token id分别为2、3、5的token做Embedding，那么返回的是上面二维张量的第2、3、5行（从0开始）。

上面仅是示例，在真实的GPT2中，使用的是(50257 tokens × 768 dimensions)。随机初始化如下：

from torch import nnvocab_size = 50527embed_dim = 768torch.manual_seed(123)token_embedding_layer_gpt2 = nn.Embedding(vocab_size, embed_dim)print("token_embedding_layer_gpt2 shape: ", token_embedding_layer_gpt2.weight.shape)print("token_embedding_layer_gpt2 weight: ", token_embedding_layer_gpt2.weight)

token_embedding_layer_gpt2 shape: torch.Size([50527, 768])
token_embedding_layer_gpt2 weight: Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, ..., -0.3181, -1.3936, 0.5226],
[ 0.2579, 0.3420, -0.8168, ..., -0.4098, 0.4978, -0.3721],
[ 0.7957, 0.5350, 0.9427, ..., -1.0749, 0.0955, -1.4138],
...,
[-1.8239, 0.0192, 0.9472, ..., -0.2287, 1.0394, 0.1882],
[-0.8952, -1.3001, 1.4985, ..., -0.5879, -0.0340, -0.0092],
[-1.3114, -2.2304, -0.4247, ..., 0.8176, 1.3480, -0.5107]],
requires_grad=True)

可以把上面巨大的二维张量，想象成如下的表格：

Token ID     |     Embedding vector (768 dims)-----------------------------------------------0            |    [0.12, -0.03, ...,  0.88]1            |    [0.54,  0.21, ..., -0.77]...          |    ...50526        |    [...]

同样的，Embedding的过程依然是查询这个巨大的表格的过程。

input_ids = torch.tensor([2,3,5])print(token_embedding_layer_gpt2(input_ids))

tensor([[ 0.7957, 0.5350, 0.9427, ..., -1.0749, 0.0955, -1.4138],
[-0.0312, 1.6913, -2.2380, ..., 0.2379, -1.1839, -0.3179],
[-0.4334, -0.5095, -0.7118, ..., 0.8329, 0.2992, 0.2496]],
grad_fn=)

如上示例，对token id=2进行Embedding，取的是表格的第2行。

综上可知，Embedding的过程，就是把token转换为tensor的过程，就是把一维离散的token id映射到高纬、连续的稠密空间的过程。而这个过程，就是平淡无奇的lookup操作。

Position Embedding: From Position to Vectors

position embeddin is

a matrix with shape (context_size, embed_dim)initialized with random valuesa learnable parameter, updated during training

前面讲述了token的嵌入过程，其实就是在大表中根据token id查询的过程。然而，我们需要注意到，在上面表格里的不同行之间，似乎是无关的。

如“You eat fish”和“Fish eat you”在token embeding的层面是相似的（Transformer本身是顺序无感的），但表达的语义却是截然不同的。因此，需要引入位置信息，从0开始进行位置编号。

如下例子，第0的position是you还是fish，一目了然：

"You eat fish"   ↓        ↓       ↓[you] + P0 [eat] + P1 [fish] + P2"Fish eat you"   ↓         ↓         ↓[fish] + P0 [eat] + P1 [you] + P2→ 即使 Token 一样，只要位置不同，最终向量就不同。→ Transformer 能区分主语、宾语等结构含义。

显然，位置编码应该从0开始，直到context_size-1。

回想如前所述，使用离散整数会导致空间表达能力削弱、无法进行自动求导优化，因此同样地，我们依然需要把位置编号转化为高维稠密的张量。

如下，假设context_size是5，Embedding维度是4：

from torch import nncontext_size = 5embed_dim = 4torch.manual_seed(123)position_embedding_layer = nn.Embedding(context_size, embed_dim)print("position_embedding_layer shape: ", position_embedding_layer.weight.shape)print("position_embedding_layer weight: ", position_embedding_layer.weight)

position_embedding_layer shape: torch.Size([5, 4])
position_embedding_layer weight: Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880],
[ 1.5810, 1.3010, 1.2753, -0.2010],
[-0.1606, -0.4015, 0.6957, -1.8061],
[-1.1589, 0.3255, -0.6315, -2.8400],
[-0.7849, -1.4096, -0.4076, 0.7953]], requires_grad=True)

我们就会得到5*4的二维张量。

请注意位置张量的维度是(context_size, embed_dim)，也就是说跟token张量的列数是一样的，但是行数不一样。

Position Tensor本质上是另一本“位置字典”，embedding的过程是根据位置编号查询的过程。

如下示例，

input_ids = torch.tensor([2,3,5])# use Position of input_ids, NOT values of itposition_embeddings = position_embedding_layer(torch.arange(len(input_ids)))print("position_embeddings: \n", position_embeddings) # return row 0,1,2 of weights

position_embeddings:
tensor([[ 0.3374, -0.1778, -0.3035, -0.5880],
[ 1.5810, 1.3010, 1.2753, -0.2010],
[-0.1606, -0.4015, 0.6957, -1.8061]], grad_fn=)

返回的是上述Embedding的前3行，因为输入是3个token。请特别注意，做position Embedding的时候，使用的是token id的位置，而不是token id的值。

注：此处PE使用的是Learnable Absolute Positional Embeddings，除此之外，还有不可训练的、固定位置编码，如Sinusoidal PE（固定位置、使用sin/cos）、RoPE（相对旋转）等，性能更强。不过从初学角度上，Learnable PE是最简单、最容易理解的位置编码思想。

Input Embedding: token_embedding + position_embedding

综上，有了token embeding和pos embedding之后，只需要简单相加，就可以得到最终的input embedding。

input_embeddings = token_embeddings + position_embeddingsprint("shape of input_embeddings : ",input_embeddings.shape)print("input_embeddings: ", input_embeddings)

shape of input_embeddings :  torch.Size([3, 4])input_embeddings:  tensor([[ 1.1045, -1.3703,  0.3948, -1.9977],        [ 1.7603,  3.1962,  1.7707,  0.0682],        [ 1.4204,  0.8996,  1.9710, -2.0070]], grad_fn=<AddBackward0>)

你可以手动计算验证，0.7671+0.3374=1.1045

GPT2使用的位置编号是(1024 positions × 768 dimensions)，1024代表模型最大能处理的输入长度，768是与token Embedding维度相同，选择了一个相对高纬稠密的空间。

我们通过如下代码复现：

from torch import nncontext_size = 1024embed_dim = 768torch.manual_seed(123)position_embedding_layer_gpt2 = nn.Embedding(context_size, embed_dim)print("position_embedding_layer_gpt2 shape: ", position_embedding_layer_gpt2.weight.shape)print("position_embedding_layer_gpt2 weight: ", position_embedding_layer_gpt2.weight)

position_embedding_layer_gpt2 shape: torch.Size([1024, 768])
position_embedding_layer_gpt2 weight: Parameter containing:
tensor([[ 0.3374, -0.1778, -0.3035, ..., -0.3181, -1.3936, 0.5226],
[ 0.2579, 0.3420, -0.8168, ..., -0.4098, 0.4978, -0.3721],
[ 0.7957, 0.5350, 0.9427, ..., -1.0749, 0.0955, -1.4138],
...,
[-1.2094, 0.6397, 0.6342, ..., -0.4582, 1.4911, 1.2406],
[-0.2253, -0.1078, 0.0479, ..., 0.2521, -0.2893, -0.5639],
[-0.5375, -1.1562, 2.2554, ..., 1.4322, 1.2488, 0.1897]],
requires_grad=True)

我们依然用上面短文为例：

with open("Peter_Rabbit.txt", "r", encoding="utf-8") as f:    raw_text = f.read()dataloader = dataloader_v1(raw_text,batch_size=3, context_size=1024,stride=2)data_iter = iter(dataloader)inputs, targets = next(data_iter)print("shape of input: ",inputs.shape)print("first batch, input: \n", inputs,"\n targets: \n", targets)

shape of input: torch.Size([3, 1024])
first batch, input:
tensor([[ 7454, 2402, 257, ..., 480, 517, 290],
[ 257, 640, 612, ..., 290, 517, 36907],
[ 612, 547, 1440, ..., 36907, 13, 1763]])
targets:
tensor([[ 2402, 257, 640, ..., 517, 290, 517],
[ 640, 612, 547, ..., 517, 36907, 13],
[ 547, 1440, 1310, ..., 13, 1763, 1473]])

输入input的维度是：3批 * 1024的context长度。

我们可以对batch0的input进行Embedding，如下：

token_embeddings = token_embedding_layer_gpt2(inputs)print("shape of token_embeddings: ",token_embeddings.shape)position_embeddings = position_embedding_layer_gpt2(torch.arange(context_size))print("shape of position_embeddings: ",position_embeddings.shape)# token_embeddings shape: [batch_size, seq_len, embedding_dim]# position_embeddings shape: [seq_len, embedding_dim]# PyTorch automatically broadcasts position_embeddings across batch dimensioninput_embeddings = token_embeddings + position_embeddingsprint("shape of input_embeddings : ",input_embeddings.shape)

shape of token_embeddings: torch.Size([3, 1024, 768])
shape of position_embeddings: torch.Size([1024, 768])
shape of input_embeddings : torch.Size([3, 1024, 768])

请特别注意tensor shape的变化；其中position embedding是没有batch维度的，但有赖于pytorch的广播机制，依然可以自动地相加。

注：tensor是大模型的“语言”，所有大模型的操作本质上都是对tensor的操作。建议熟悉tensor常用的概念与操作，如broadcast/view/reshape/squeeze/transpose/einsum/mul/matmul/dot等。Tensor是描述集合与高维空间最简单优雅的语言。初学者不要把tensor想象成复杂的数学难题，可以当成简单的“外语”，只是一种简洁的约定和表达手段。

至此，我们得到了大模型真正能够处理的输入空间，即inputs embedding，包含了token embeding的token相关信息，以及token位置的position embedding信息。