Towardsai 2024年06月05日
The Architecture of Mistral’s Sparse Mixture of Experts (S〽️⭕E)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Last Updated on June 4, 2024 by Editorial Team Author(s): JAIGANESAN Originally published on Towards AI. Exploring Feed Forward Networks, Gating Mechanism, Mixture of Experts (MoE), and Sparse Mixture of Experts (SMoE). Photo by Ticka Kao on Unsplash Introduction:? In this article, we’ll dive deeper into the specifics of Mistral’s SMoE (Sparse Mixture of Experts)[2] architecture. If you’re new to this topic or need a refresher, I recommend checking out my previous articles, “LLM: In and Out” and “Breaking Down Mistral 7B”, which cover the basics of LLMs and the Mistral 7B architecture. Large Language Model (LLM)?: In and Out Delving into the Architecture of LLM: Unraveling the Mechanics Behind Large Language Models like GPT, LLAMA, etc. pub.towardsai.net Breaking down Mistral 7B ⚡? Exploring Mistral’s Rotary positional Embedding, Sliding Window Attention, KV Cache with rolling buffer, and… pub.towardsai.net If you’re already familiar with the inner workings of Mistral 7B[1], then you can continue this article. Before we explore the topic further, we first need to understand Mistral’s 8 X 7B (MoE) Architecture. According to the research paper, the main distinction between Mistral 7B and Mistral 8 X 7B (SMoE) lies in adding the Mixture of Experts layer in the Mistral 8 X 7B (SMoE). Let’s look closer at the Mistral 8 x 7B architecture. This architecture has a model dimension of 4096. It’s made up of 32 layers. The model uses 32 heads and 8 key value (KV) heads. Each head in the model has a dimension of 128. The feed-forward network (FFN) has a hidden dimension of 14336. The context length is 32768 and the vocabulary size is 32000 ? And the Number of Experts in the MoE layer is 8 and the Top 2 Experts are selected for each token. Image 1: Mistral 7 X 8B Architecture. Created by author If you have any doubt, about where this architecture comes from, I recommend you to check out the Mistral inference Github code and Research paper. GitHub – mistralai/mistral-inference: Official inference library for Mistral models Official inference library for Mistral models. Contribute to mistralai/mistral-inference development by creating an… github.com In this article, we’ll explore three key concepts that make Mistral’s MoE architecture ✅. Here’s what I am going to explore: ? ✨ Feed Forward Network (FFN) ✨ Mixture of Experts (MoE) ✨ Mistral’s Sparse Mixture of Experts (SMoE) 1. Feed Forward Network (SwiGLU FFN) Note: If you’ve already read my previous article on Mistral 7B, feel free to skip this section and jump straight to the Mixture of Experts part. Before we dive into the Mixture of Experts (MoE), We should Understand the Experts. Yes, What you read is right, Experts in MoE are nothing but a Feed-forward Network. So Let’s dive into the Feed-Forward Network (FFN) of the Mistral. Image 2: FFN in Mistral. Created by author Image 2: Numbers in the images are input and output dimensions from respective neurons and input layer. If you have a question why do we need FFN? The answer is that attention gives the relationship between words (Semantic, syntactic relationships, and how one words are related to others). But FFN learns to represent the words, It actually learns to write the sequence of the words from the attention information. It learns to write the next word using the context (Auto regressive property). The FFN (Neural Network) is backed by the Universal approximation theorem (UAT). What does UAT say? It can approximate any ideal function in the real world. It is the same thing that the FFN learns to represent the language and to write the words from the attention information. The FFN takes the input from the attention layer, which has undergone RMS normalization, as its input. For Example, Let’s take the input token size as 9. It has gone through the Attention layer and Normalization layer. This FFN input matrix has a shape of (9, 4096), where 9 represents the number of tokens or sequence length (i.e., the number of vectors), and 4096 is the model dimension. Before we dive into the illustration of a feed-forward network, let’s take a closer look at the code of a FeedForward Network from Mistral 8 x 7B. class FeedForward(nn.Module): def init(self, args: ModelArgs): super().init() # Image 3 will represents pytorch linear layer operation. self.w1 = nn.Linear(args.dim, args.hidden_dim, bias=False) self.w2 = nn.Linear(args.hidden_dim, args.dim, bias=False) self.w3 = nn.Linear(args.dim, args.hidden_dim, bias=False) def forward(self, x) -> torch.Tensor: return self.w2(nn.functional.silu(self.w1(x)) self.w3(x))# SwiGLU = nn.functional.silu(self.w1(x)) self.w3(x) # self.w3(x) = Acts as Gating Mechanism # Swish Activation(Beta=1) = nn.functional.silu(self.w1(x))# By doing this we Introduce Non-linearity in element wise and Preserve the high magnitude of the vector. Image 3: Linear transformation. Source: pytorch.org Image 3: x is the input vector, A is the weight matrix, and b is the bias vector. This code shows that there are two hidden layer operations, not two hidden layers fully connected. (SwiGLU Mechanism) Let’s break it down: the input matrix is fed into two hidden layers, self.w1, and self.w3, and then multiplied point-wise. The self.w1 hidden layer goes SiLU activation, But self.w3 does not. As we Know The Mistral Uses the SwiGLU activation Function (That has two hidden layer operations as you can understand from the code) SwiGLU (x) = SiLU(W_1.x).(W_3.x) Why do they use two hidden layer operations (SwiGLU)? From my understanding, they have used another hidden layer operation to maintain the Magnitude of the vector, Because SiLU will reduce the Magnitude of all vectors. When sigmoid output is multiplied by the input vector (x) their magnitude will decrease unless the sigmoid output is 1. So that’s why Mistral uses another hidden layer in FFN and multiplies the output with a SiLU-activated hidden layer to maintain the magnitude of the vector as high that will be given to the next layer or linear layer. Otherwise, it will lead to a Vanishing gradient problem. This Operation is called SwiGLU activation. It gives non-linearity to Each element and preserves the high magnitude. There are two hidden layers’ weight matrices size of (14336,4096) as […]

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

相关文章