Ahead of AI 2024年10月22日
Model Merging, Mixtures of Experts, and Towards Smaller LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

2023年大语言模型发展迅速,2024年开源和研究有新进展,本文介绍了多篇相关论文,包括模型合并、权重平均、代理调优等技术,以提升模型性能并降低成本。

模型合并和权重平均是当前突出方法,可将多个模型或检查点合并为一个,能增强训练收敛性、提升整体性能和增加鲁棒性,如Stochastic Weight Averaging、Exponentially Moving Average等方法。

Model Ratatouille方法可将相同基础模型的多个微调迭代重复用于各种辅助任务进行模型合并,该方法对LLM有重要意义,相关理念也可应用于LoRA适配器。

WARM论文旨在通过对微调奖励模型的权重平均来增强LLM的RLHF对齐步骤,缓解奖励黑客问题,经实验,合并后的奖励模型胜率达79.4%。

Proxy-tuning是一种改进大语言模型的有前途的技术,在解码阶段通过调整目标LLM的logits来工作,涉及计算小基础模型和微调模型的logits差异并加到目标模型的logits上。

2023 was the year when the potential and complexity of Large Language Models (LLMs) were growing rapidly. Looking at the open source and research advancements in 2024, it seems we are going to a welcome phase of making models better (and smaller) without increasing their size.

In this month's article, I am highlighting four recent papers consistent with this theme:

1. Weight averaging and model merging allow us to combine multiple LLMs into a single, better one without the typical drawbacks of traditional ensembles, such as increased resource requirements.

2. Proxy-tuning to boost the performance of an existing large LLM, using two small LLMs, without changing the large model's weights.

3. Creating mixture-of-experts models by combining multiple smaller modules results in LLMs that compete with and often surpass their larger counterparts in efficiency and effectiveness.

4. Pretraining a small 1.1B parameter LLM reduces development and operational costs and opens new avenues for educational and research applications.


1. WARM: On the Benefits of Weight Averaged Reward Models

In this WARM: On the Benefits of Weight Averaged Reward Models (Jan 22), researchers propose a weight averaging approach for LLM reward models. ("Reward models" refer to those used in reinforcement learning with human feedback, RLHF, for alignment.)

What is weight averaging? Since weight averaging and model merging for LLMs seem to be the most interesting themes in 2024, I want to briefly introduce this topic before diving further into the WARM paper.

Understanding Model Merging and Weight Averaging

Model merging and weight averaging, while not new, are currently the most prominent methods, dominating Open LLM leaderboards. Let's briefly discuss these two concepts. (I may write a more detailed article some time in the future.)

Both weight averaging and model merging involve combining multiple models or checkpoints into a single entity. What are the advantages? Similar to the concept of creating model ensembles, this approach of combining multiple models into one can enhance training convergence, improve overall performance, and increase robustness. It's worth highlighting that unlike traditional ensemble methods, model merging and weight averaging result in a singular model rather than maintaining multiple separate models, as illustrated in the figure below.

Comparison between weight averaging and model merging (left) and traditional ensemble methods like majority voting (right).

Traditionally, weight averaging involves averaging a single model's weights (parameters) at different points in its training process. Typically, it's done towards the end of the training when the model has nearly converged. A common form of this technique is Stochastic Weight Averaging (SWA), where we decay an initially large learning rate, and weights are averaged over several iterations during periods of decayed (but still relatively high) learning rates.

Stochastic Weight Averaging (SWA) averages a model's weights towards the end of the training cycle.

Since a model's training trajectory can be uneven, the strategy is to average the models towards the end of the training when the learning rate is low (if a scheduler is used), as illustrated in the figure above, where the training is nearing convergence.

Alternatively, the Exponentially Moving Average (EMA) method, computes a smoothed version of the weights by exponentially decreasing the weights of older states.

In 2022, Latest Weight Averaging (LaWA) demonstrated that averaging the weights of the latest k checkpoints, each taken at the end of an epoch, can expedite training progress in terms of loss and accuracy by several epochs. This was shown to be effective for ResNet vision models and RoBERTa language models.

Then, in 2023, Early Weight Averaging Meets High Learning Rates for LLM Pre-training explored a modified version of LaWA with higher learning rates and an earlier start in averaging checkpoints during training. The researchers found that this approach significantly outperformed standard SWA and EMA techniques.

The modified LaWA method from the Early Weight Averaging meets High Learning Rates for LLM Pre-training paper (https://arxiv.org/abs/2306.03241).

While weight averaging combines multiple checkpoints of the same model into a single model, model merging involves combining multiple different trained models into a single model. Each of these models may have been trained independently, possibly on different datasets or tasks. 

Model merging goes back a long way, but the perhaps most recent and influential paper relevant to LLMs is Model Ratatouille: Recycling Diverse Models for Out-of-Distribution Generalization (thanks to Alexandre Ramé for drawing my attention to this).

The idea behind Model Ratatouille is to reuse multiple fine-tuned iterations of the identical base model across various diverse auxiliary tasks, as illustrated in the figure below.

Model merging via Model Ratatouille compared to other finetuning strategies. (OOD = out of distribution / generalization); annotated figure from https://arxiv.org/abs/2212.10445

To provide a bit more detail, the Model Ratatouille method can be summarized as shown in the figure below.

The Model Ratatouille method for model merging; annotated figure from https://arxiv.org/abs/2212.10445

Note that this overall merging idea can also be applied to LoRA adapters, as shown in LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition.

I am planning to write more about model merging in the future. In the meantime, I can warmly recommend the reading list by Omar Sanseviero and ‘s article on model merging.

Weight Averaged Reward Models

Having discussed the concepts of weight averaging and model merging, let's briefly go back to the newly released paper WARM: On the Benefits of Weight Averaged Reward Models, published on January 22nd last week.

This research primarily aims to enhance the RLHF alignment step for LLMs (RLHF is the method behind ChatGPT and Llama 2 Chat; for additional details on this process, please refer to my article LLM Training: RLHF and Its Alternatives). Specifically, the researchers attempt to mitigate reward hacking in LLMs by averaging the weights of finetuned reward models.

Reward hacking occurs when an LLM learns to manipulate or exploit its reward system's flaws to attain high scores or rewards, without genuinely fulfilling the intended task or achieving the essential objectives.

Weight averaging makes reward modeling more robust to reward hacking (annotated figure from WARM paper: https://arxiv.org/abs/2401.12187)

To address reward hacking, the researchers suggest combining LLM reward models through weight averaging. The merged reward model resulting from this process attained a 79.4% win rate over a single reward model.

How does WARM function? The method is fairly straightforward: similar to stochastic weight averaging, WARM averages the weights of multiple models (in this case, reward models), as depicted in the figure below.

An outline of how WARM is used in the RLHF process. The only new aspect here is that the method uses a reward model from weight averaging instead of training a single reward modeling (annotated figure from WARM paper: https://arxiv.org/abs/2401.12187).

Previously, we discussed several weight averaging methods. How exactly does WARM average the weights to obtain the reward model? Here, they use a simple linear average as in stochastic weight averaging. The difference, however, is that the models are not sampled from the same trajectory but are independently created from the pretrained model, as in Model ratatouille. Alternatively WARM also has a so-called Baklava procedure to sample along a finetuning trajectory. The differences are compared in the figure below.

A comparison between the different model merging and averaging methods. Thanks to Alexandre Rame for providing this adjusted version from the Model ratatouille paper.

Following the WARM procedure above and averaging 10 reward models, the researchers found that an RL policy WARM has a 79.4% win rate against a policy with a single reward model, as shown in the figure below.

WARM outperforms the single best reward model at step 3000 (annotated figure from WARM paper: https://arxiv.org/abs/2401.12187).

Conclusion

Model merging is not a new technique, but it seems that in the context of LLMs, it's particularly promising since LLMs are very expensive and resource-intensive. Hence, methods that take advantage of multiple existing LLMs created during training (without extra work) are particularly attractive. Also, in contrast to traditional ensembles that require running multiple models at the same time, weight-averaged models are relatively lightweight and don't cost more than a single model during inference time.

Looking ahead, I think that the future of model merging in LLMs holds exciting prospects. In particular, I also expect more creative ways of merging models on the horizon.



2. Tuning Language Models by Proxy

The paper Tuning Language Models by Proxy introduces a promising technique for improving  Large Language Models (LLMs) called proxy-tuning. This method (sort of) finetunes LLMs without altering their weights. 

Proxy-tuning works through a straightforward process at the decoding stage by adjusting the logits of the target LLM. Specifically, it involves calculating the difference in logits between a smaller base model and a finetuned model. This difference is then added to the logits of the target model. (Logits are the raw output values generated by the model's final layer. Before being transformed into probabilities through a function like softmax, these logits represent the unnormalized scores for each possible output token in the LLMs vocabulary.)

Annotated illustration of proxy-tuning via https://arxiv.org/abs/2401.08565

To illustrate this concept more clearly, consider the objective of improving a large target model, M1 (for example, Llama 2 70B). The process involves two smaller models:

The enhancement is achieved by applying the difference in predictions (logits) of these smaller models to the target model M1. The output logits of the improved target model, M1*, is computed as M1*(x) = M1(x) + [M3(x) - M2(x)]. After obtaining these output logits, they are converted into probabilities using the softmax function. These probabilities are then used to sample the final output, i.e., the generated text, using nucleus sampling or top-k decoding.

How well does proxy-tuning work in practice?

The experimental results are impressively positive. The researchers applied their methods in three distinct scenarios:

    Instruction-Tuning: Improve the 70B size Llama 2 Base mode to match the performance of the Llama 2 70B Chat model.

    Domain Adaptation: Upgrading the 70B size Llama 2 Base model in coding tasks, aiming to reach the performance level of CodeLlama 70B.

    Task-Specific Finetuning: Improving the 70B size Llama 2 Base model for specialized tasks such as TriviaQA or math problems.

In each scenario, significant improvements were observed compared to the original base models. The table below focuses on the comparison between the Llama 70B Base and Chat models for conciseness. However, the paper provides additional benchmarks for CodeLlama.

Modified and annotated figure from the proxy-tuning paper, https://arxiv.org/abs/2401.08565

As one can see, based on the benchmarks shown in the figure above, the proxy-tuned 70B Llama 2 model performs much better than the 70B base model, and it's almost as good as the directly tuned Llama 70B Chat model.

Practical Considerations

The use-case or motivation for using this method could be increasing R&D efficiency: developing new training or model enhancements and testing them on smaller models to reduce costs. Then, these methods can be scaled up to enhance larger base models without having to train large models.

However, implementing this approach in a real-world setting still involves using three different models: 

    A large general-purpose base model;

    A smaller general-purpose model;

    Several small specialized models tailored to specific use cases or client needs.

So, why choose this approach over LoRA (Low-Rank Adaptation), which doesn't require the smaller general-purpose model (2) and can substitute multiple small specialized models (3) with a set of small LoRA matrices?

There are two potential advantages to the proxy-tuning approach:

a) It might outperform LoRA in certain contexts, although there's no direct comparison available yet.

b) It's useful when the large base model (1) is a "black box", and its internal weights are inaccessible.

However, there's a catch: the smaller models must share the same vocabulary as the larger target model. (In theory, if someone knows the vocabulary of GPT-4 and can access its logit outputs, they could create specialized GPT-4 models using this method.)



3. Mixtral of Experts

The Mixtral 8x7B paper is finally here! Mixtral 8x7B is a sparse mixture of experts (sparse MoE) model that currently ranks as one of the best performing and most interesting openly available large language models (LLMs). The model repository is released under an Apache 2 license and, according to the paper, is free to use for both academic and commercial purposes.

What is an MoE? An MoE, or Mixture of Experts, is a type of ensemble model that combines several smaller "expert" subnetworks. Each subnetwork is responsible for handling different types of tasks or, more concretely, tokens. By using multiple smaller subnetworks instead of one large network, MoEs aim to allocate computational resources more efficiently. This enables them to scale more effectively and potentially achieve better performance across a wider range of tasks.  (Also see the brief Mixture of Experts 101 section in my previous article.)

In the Mixtral of Experts paper, which I'll discuss below, the authors discuss how they built Mixtral 8x7B. This model compares very favorably to the much larger Llama 2 70B model.

Annotated figure from Mixtral of Experts paper (https://arxiv.org/abs/2401.04088) showing that Mixtral 8x7B matches or even outperforms the larger Llama 2 70B model on many benchmarks

Mixtral Architecture

The key idea in Mixtral 8x7B is to replace each feed-forward module in a transformer architecture with 8 expert layers, as illustrated in the figure below.

Annotated transformer architecture from Attention Is All You Need, https://arxiv.org/abs/1706.03762

A feed-forward module is essentially a multilayer perceptron. In PyTorch-like pseudocode, it essentially looks like this:

class FeedForward(torch.nn.Module):    def __init__(self, embed_dim, coef):        super().__init__()        self.layers = nn.Sequential(            torch.nn.Linear(embed_dim, coef*embed_dim),            torch.nn.ReLU(),            torch.nn.Linear(coef*n_embed, embed_dim),            torch.nn.Dropout(dropout)        )    def forward(self, x):        return self.layers(x)

In addition, there is a routing module (also known as a gating network) that redirects each of the token embeddings to the 8 expert feed-forward modules. The outputs from these 8 expert feed-forward layers are then summed, as illustrated in the figure below.

Annotated figure from Mixtral of Experts paper explaining the MoE module, https://arxiv.org/abs/2401.04088

Mathematically, the figure above can be written as follows for the 8 experts {E1, E2, ..., E8}:

Here, G represents the router (or gating network), and Ei are the outputs of the expert modules. Based on the equation above, the MoE layer computes a weighted sum of the expert outputs Ei, where the weights are provided by the gating network G(x)i for the inputs x.

At first glance, it might seem like Mixtral is simply adding additional parameters to an LLM via these expert (feed-forward) modules to represent a sort of weighted ensemble approach. However, there's an additional tweak: Mixtral is a sparse MoE, which means that only a subset of the experts are used for each input:

In the specific case of Mixtral 8x7B, the authors specify TopK=2, meaning that only 2 experts are used at a time. So, based on the equation above, an output from G(x) might look as follows: [0, 0, 0.63, 0, 0, 0.37, 0, 0]. This indicates that the third expert contributes 63% to the output, and the sixth expert contributes 37%, respectively.

Model Size

Where does Mixtral 8x7B get its name, and what is the actual size of this sparse MoE model? The "8x" refers to the use of 8 expert subnetworks. The "7B" indicates that it combines Mistral 7B modules. However, it's important to note that the size of Mixtral is not 8x7B = 56B. The 7 billion parameters represent the size of an entire Mistral 7B model, but in Mixtral 8x7B, only the feed-forward layers are replaced by expert layers.

In total, Mixtral 8x7B comprises 47B parameters. If we consider the following equations, where FF stands for the forward layers and NonFF refers to the non-feed forward layers (for instance, attention weights):

Solving these equations, we find that the 7B mistral model has 40/7 = 5.71B parameters in the feed forward layers and 7B-5.71B = 1.29B parameters in the attention layers. Intriguingly, most of the parameters in an LLM are contained in a feed-forward module, not in the attention mechanism. And this is especially true for the Mixtral 8x7B model, which has 8*5.71B = 45.68B of its parameters in the expert (feed forward) layers.

With 47B parameters in total, Mixtral 8x7B is significantly smaller than, for example, the Llama 2 70B model. Moreover, since only 2 experts are active at each time step, the model utilizes only 13B parameters for each input token, making it much more efficient than a regular non-MoE 47B parameter model.

Annotated figure from Mixtral of Experts paper, https://arxiv.org/abs/2401.04088

Expert Specialization

The interesting question is whether the experts exhibit any task- or token-specific patterns. Unfortunately, the authors were unable to observe such specialization by topic (where "topics" refer to datasets like GitHub, Arxiv, Mathematics, Wikipedia, etc.).

However, the authors make an interesting observation: consecutive tokens in text datasets are often assigned to the same experts. Additionally, indentation tokens in Python code are frequently assigned to the same expert, as shown in the figure below.

Annotated figure from Mixtral of Experts paper, https://arxiv.org/abs/2401.04088

(The authors didn't specify which of the two experts per token is colored, but I assume they always colored the expert with the higher weight.)

Conclusion

Mixtral 8x7B has several advantages: it's openly available, matches or outperforms larger models such as Llama 2 70B, and employs sparse MoE modules in a relatively fresh (although not entirely new) way of building LLMs.

Its strong performance, coupled with parameter efficiency and the ability to handle context windows of up to 32k, will likely make it an attractive model for the foreseeable future (or, at least the upcoming months). I believe that MoE models will also be one of the main focus areas for most open-source projects in 2024, making Mixtral of Experts worth keeping on your radar.

If there's one small nitpick, it's that the authors did not share any information about the training datasets. (This may be understandable to avoid copyright debates.) 

Furthermore, even though such a study would be very expensive, it would be intriguing to see how a Mixtral 8x70B would compare to a Llama 2 70B model trained on the same dataset. Furthermore, I'd be interested in seeing a comparison of Mixtral 8x70B with the following two hypothetical models some time in the future to more directly compare the performance of MoE and non-MoE approaches:

(Fun fact: The Brave browser now uses Mixtral 8x7B as default LLM for their Leo assistant feature.)


4. TinyLlama: An Open-Source Small Language Model

After Microsoft's phi-2 made headlines in December, TinyLlama is the latest addition to the "small" LLM category. TinyLlama is not only small, with 1.1 billion parameters, but also fully open source. Here, "open source" means that the training code and model checkpoints are available via an unrestricted open source library. You can find the GitHub repository here: https://github.com/jzhang38/TinyLlama.

What makes small LLMs (also referred to as SLMs, short for Small Language Models) so attractive? Small LLMs are:

    Accessible and affordable, meaning they can be run (in inference mode) on limited resource regimes (such as laptops and/or small GPUs).

    Cheaper to develop and pretrain — these models only require a relatively small number of GPUs.

    Easier to customize for target tasks — small models can typically be finetuned on just a single GPU.

    More energy-efficient — this is an important consideration given concerns about the environmental impact of training and running large-scale AI models. Or, think of battery life when deploying LLMs on portable devices such as smartphones.

    Valuable for educational purposes — they are more manageable and thus easier to understand and tweak.

TinyLlama Performance

Besides being small and open-source, TinyLlama also performs relatively well on both common-sense reasoning and problem-solving benchmarks, compared to other open-source models of similar size.

Source: TinyLlama: An Open-Source Small Language Model, https://arxiv.org/abs/2401.02385

Of course, TinyLlama can't rival much larger models in those benchmarks, but since all the code is open source, it presents interesting opportunities for further studies and finetuning.

TinyLlama Learnings

For instance, an intriguing educational takeaway from the authors' training runs is that training the model for 3 epochs (instead of 1 epoch) on 1 trillion tokens is actually beneficial (although maybe not optimal according to the Chinchilla scaling laws via Hoffmann et al., 2022).

Source: Training Compute-Optimal Large Language Models, https://arxiv.org/abs/2203.15556

For instance, as shown in the plot below, the model still keeps improving even though the data is repeated by training over multiple epochs.

Annotated figure from the TinyLlama paper (https://arxiv.org/abs/2401.02385); the paper includes similar plots for 6 other benchmarks

Studying the behavior on "too large" datasets or running the training for more than one epoch would not be trivial with larger models. In any case, I am excited to see what future finetuning experiments on TinyLlama will yield. (Although, early experiments showed that it's currently lacking behind the small but 3x larger phi-2 model.)


Other Interesting Research Papers In January

Below is a selection of other interesting papers I stumbled upon this month. Given the length of the list, I have highlighted those I found particularly interesting with an asterisk (*).

*KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization by Hooper, Kim, Mohammadzadeh, Mahoney, et al. (31 Jan), https://arxiv.org/abs/2401.18079

*Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling by Maini, Seto, Bai, Grangier, et al. (29 Jan), https://arxiv.org/abs/2401.16380

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models by Lin, Tang, Ye, Cui, et al. (29 Jan), https://arxiv.org/abs/2401.15947

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty by Li, Wei, C. Zhang, and H. Zhang (26 Jan), https://arxiv.org/abs/2401.15077

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities by Zhang, Ding, Gong, Ge, and Yue (25 Jan), https://arxiv.org/abs/2401.14405

Pix2gestalt: Amodal Segmentation by Synthesizing Wholes by Ozguroglu, Liu, Surís, Chen, et al. (25 Jan), https://arxiv.org/abs/2401.14398

Rethinking Patch Dependence for Masked Autoencoders by Fu, Lian, Wang, Shi, at al. (25 Jan), https://arxiv.org/abs/2401.14391

SpacTor-T5: Pre-training T5 Models with Span Corruption and Replaced Token Detection by Ye, Jiang, Rostamizadeh, Chakrabarti, et al. (24 Jan), https://arxiv.org/abs/2401.13160 

1 Instead of masking individual tokens like in BERT, T5 randomly replaces contiguous spans of tokens in the input text with a single mask token. 

2 In replacement detection, some tokens (words or subwords) in the input text are replaced with other tokens, and the model's task is to identify which tokens have been replaced.


MambaByte: Token-free Selective State Space Model by Wang, Gangavarapu, Yan, and Rush (24 Jan), https://arxiv.org/abs/2401.13660

Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated Text by Hans, Schwarzschild, Cherepanova, Kazemi, et al. (22 Jan), https://arxiv.org/abs/2401.12070

*WARM: On the Benefits of Weight Averaged Reward Models by Ramé, Vieillard, Hussenot, Dadashi, et al. (22 Jan), https://arxiv.org/abs/2401.12187

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities by Chen, Xu, Kirmani, Ichter, et al. (22 Jan), https://arxiv.org/abs/2401.12168

*Knowledge Fusion of Large Language Models by Wan, Huang, Cai, Quan, et al. (19 Jan), https://arxiv.org/abs/2401.10491

VMamba: Visual State Space Model by Liu, Tian, Zhao, Yu, et al. (18 Jan), https://arxiv.org/abs/2401.10166

* Self-Rewarding Language Models by Yuan, Pang, Cho, Sukhbaatar, et al. (18 Jan), https://arxiv.org/abs/2401.10020

DiffusionGPT: LLM-Driven Text-to-Image Generation System by Qin, Wu, Chen, Ren, et al. (18 Jan), https://arxiv.org/abs/2401.10061

ReFT: Reasoning with Reinforced Fine-Tuning by Luong, Zhang, Jie, Sun, et al. (17 Jan), https://arxiv.org/abs/2401.08967

RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture by Balaguer, Benara, de Freitas Cunha, Estevão Filho, et al. (16 Jan), https://arxiv.org/abs/2401.08406

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering by Ridnik, Kredo, and Friedman (Jan 16), https://arxiv.org/abs/2401.08500

* Scalable Pre-training of Large Autoregressive Image Models by El-Nouby, Klein, Zhai, Bautista, et al. (16 Jan), https://arxiv.org/abs/2401.08541

* Tuning Language Models by Proxy by Liu, Han, Wang et al. (16 Jan), https://arxiv.org/abs/2401.08565

An Experimental Design Framework for Label-Efficient Supervised Finetuning of Large Language Models by Bhatt, Chen, Das et al. (12 Jan), https://arxiv.org/abs/2401.06692

A Closer Look at AUROC and AUPRC under Class Imbalance by McDermott, Hansen, Zhang, et al. (11 Jan), https://arxiv.org/abs/2401.06091

* The Unreasonable Effectiveness of Easy Training Data for Hard Tasks by Hase, Bansal, Clark, and Wiegreffe, https://arxiv.org/abs/2401.06751

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training by Hubinger, Denison, Mu, et al. (10 Jan), https://arxiv.org/abs/2401.05566

Transformers are Multi-State RNNs by Oren, Hassid, Adi, and Schwartz (11 Jan), https://arxiv.org/abs/2401.06104

RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation by Nikdan, Tabesh, and Alistarh (9 Jan), https://arxiv.org/abs/2401.04679

A Minimaximalist Approach to Reinforcement Learning from Human Feedback by Swamy, Dann, Kidambi et al. (8 Jan), https://arxiv.org/abs/2401.04056

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts by Pioro, Ciebiera, Krol et al. (8 Jan), https://arxiv.org/abs/2401.04081

* Mixtral of Experts by Jiang, Sablayrolles, Roux, et al. (8 Jan), https://arxiv.org/abs/2401.04088

Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon by Zhang, Liu, Xiao, Shao, et al. (7 Jan 2024), https://arxiv.org/abs/2401.03462

* Denoising Vision Transformers by Yang, Luo, Li et al. (5 Jan), https://arxiv.org/abs/2401.02957

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism by Bi, Chen, Chen et al. (5 Jan), https://arxiv.org/abs/2401.02954

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM by Lu, Liusie, Raina et al. (4 Jan) https://arxiv.org/abs/2401.02994

LLM Augmented LLMs: Expanding Capabilities through Composition by Bansal, Samanta, Dalmia et al. (4 Jan), https://arxiv.org/abs/2401.02412

LLaMA Pro: Progressive LLaMA with Block Expansion by Wu, Gan, Ge et al. (4 Jan), https://arxiv.org/abs/2401.02415

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity by Lee, Bai, Pres et al. (3 Jan), https://arxiv.org/abs/2401.01967

LLaMA Beyond English: An Empirical Study on Language Capability Transfer by Zhao, Zhang, Gao et al. (2 Jan), https://arxiv.org/abs/2401.01055

* Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models by Chen, Deng, Yuan et al. (2 Jan), https://arxiv.org/abs/2401.01335

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning by Jin, Han, Yang, et al. (2 Jan), https://arxiv.org/abs/2401.01325

A Comprehensive Study of Knowledge Editing for Large Language Models (2 Jan), by Zhang, Yao, Tian et al. https://arxiv.org/abs/2401.01286

Astraios: Parameter-Efficient Instruction Tuning Code Large Language Models (1 Jan), by Terry Zhuo, Zebaze, Suppattarachai et al., https://arxiv.org/abs/2401.00788

Building a Large Language Model from Scratch

Creating an LLM from scratch is an excellent way to gain a deep understanding of its inner workings. In my book project, "Building a Large Language Model from Scratch," I code and detail the entire process, from developing an LLM architecture to implementing pretraining, fine-tuning, and alignment steps.

For more information, please visit the GitHub repository at https://github.com/rasbt/LLMs-from-scratch.


This magazine is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues.

Your support means a great deal! Thank you!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大语言模型 模型合并 权重平均 代理调优
相关文章