Trae搭建了我的小小Transformer模型

我正在参加Trae「超级体验官」创意实践征文，本文所使用的 Trae 免费下载链接：www.trae.com.cn/?utm_source…

前言

在阅读了《Attention is all you need》文章，从中了解到Transformer，作为第一个将注意力机制融入上下文的任务中，抛弃递归和卷积，通过注意力机制进行模型训练。那么作为机器翻译的新手小白，简单实现一下Transformer的模型是必须的！

模型搭建

下载Trae，给Ai一个考验，让我们丝滑的搭建个Transformer相关的EncoderLayer 实例模型吧！

简单给ai提出：

attention is all you need

ai很快就将这个与论文相关联：

想基于此实现简单的代码

简版的代码呼之欲出！

import torchimport torch.nn as nnimport torch.nn.functional as Fclass ScaledDotProductAttention(nn.Module):    def __init__(self):        super(ScaledDotProductAttention, self).__init__()    def forward(self, query, key, value, mask=None):        d_k = query.size(-1)        scores = torch.matmul(query, key.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))                if mask is not None:            scores = scores.masked_fill(mask == 0, -1e9)                attn_weights = F.softmax(scores, dim=-1)        output = torch.matmul(attn_weights, value)        return output, attn_weightsclass MultiHeadAttention(nn.Module):    def __init__(self, num_heads, d_model):        super(MultiHeadAttention, self).__init__()        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"                self.d_k = d_model // num_heads        self.num_heads = num_heads        self.W_q = nn.Linear(d_model, d_model)        self.W_k = nn.Linear(d_model, d_model)        self.W_v = nn.Linear(d_model, d_model)        self.W_o = nn.Linear(d_model, d_model)    def split_heads(self, x):        batch_size, seq_length, d_model = x.size()        return x.view(batch_size, seq_length, self.num_heads, self.d_k).transpose(1, 2)    def combine_heads(self, x):        batch_size, num_heads, seq_length, d_k = x.size()        return x.transpose(1, 2).contiguous().view(batch_size, seq_length, -1)    def forward(self, query, key, value, mask=None):        Q = self.split_heads(self.W_q(query))        K = self.split_heads(self.W_k(key))        V = self.split_heads(self.W_v(value))                output, attn_weights = ScaledDotProductAttention()(Q, K, V, mask)        output = self.combine_heads(output)        output = self.W_o(output)                return output, attn_weightsclass PositionwiseFeedForward(nn.Module):    def __init__(self, d_model, d_ff):        super(PositionwiseFeedForward, self).__init__()        self.fc1 = nn.Linear(d_model, d_ff)        self.fc2 = nn.Linear(d_ff, d_model)        self.relu = nn.ReLU()    def forward(self, x):        return self.fc2(self.relu(self.fc1(x)))class EncoderLayer(nn.Module):    def __init__(self, num_heads, d_model, d_ff, dropout):        super(EncoderLayer, self).__init__()        self.self_attn = MultiHeadAttention(num_heads, d_model)        self.feed_forward = PositionwiseFeedForward(d_model, d_ff)        self.norm1 = nn.LayerNorm(d_model)        self.norm2 = nn.LayerNorm(d_model)        self.dropout = nn.Dropout(dropout)    def forward(self, x, mask):        attn_output, _ = self.self_attn(x, x, x, mask)        x = self.norm1(x + self.dropout(attn_output))                ff_output = self.feed_forward(x)        x = self.norm2(x + self.dropout(ff_output))                return x# 示例使用if __name__ == "__main__":    num_heads = 8    d_model = 512    d_ff = 2048    dropout = 0.1    seq_length = 10    batch_size = 32    encoder_layer = EncoderLayer(num_heads, d_model, d_ff, dropout)    input_tensor = torch.randn(batch_size, seq_length, d_model)    mask = torch.ones(batch_size, 1, seq_length)    output = encoder_layer(input_tensor, mask)    print("Output shape:", output.shape)

Trae的环境配置

获取到简单的代码，那就是调试。关于环境的搭建，可以参考我之前的文章，这里就不作过多的介绍了。使用Trae需要把虚拟环境关联进来，很简单！

在右下角就有环境的设置。

在这里就加载好自己配置好的虚拟环境。

完成后就会具备好已有的虚拟环境啦！

调试

简单创建一个文件，将代码复制出来，并编译好。

报错啦！！！！！不用着急，向Ai求助！

从错误信息可知道函数的输入有不匹配的地方，在多头注意力机制的实现中，出现维度与掩码的不匹配，Ai很快就修改代码，解决了问题。

运行结果

通过上述的反复调试，最后运行出结果。

分析

确实很快就通过ai实现了简单的Transformer模型，但是内部的函数是否真的符合Transformer模型呢？让Ai再分析具体函数的设计。

ScaledDotProductAttention 类

计算 d_k ，即查询向量的维度。计算查询向量和键向量的点积，并除以 sqrt(d_k) 进行缩放。如果提供了掩码 mask ，将掩码为 0 的位置的分数设置为负无穷大，这样在应用 softmax 函数时，这些位置的权重会趋近于 0。对分数应用 softmax 函数得到注意力权重。将注意力权重与值向量相乘得到输出。

MultiHeadAttention 类

init

通过线性层生成查询、键、值向量，并将它们分割成多个头。如果提供了掩码，扩展掩码的维度以匹配多头注意力的形状。调用 ScaledDotProductAttention 类计算注意力输出。将多头注意力的输出合并，并通过线性层 W_o 得到最终输出。

PositionwiseFeedForward 类

init

EncoderLayer 类

init

调用多头注意力层进行自注意力计算，得到注意力输出。将注意力输出与输入相加，并通过层归一化层 norm1 。调用位置前馈网络层得到前馈输出。将前馈输出与上一步的输出相加，并通过层归一化层 norm2 。

结论

通过Trae快速搭建了Transformer模型，效率杠杠的！

前言

模型搭建

Trae的环境配置

调试

运行结果

分析

结论

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签