Ahead of AI 2024年10月22日
LLM Research Insights: Instruction Masking and New LoRA Finetuning Experiments
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章涵盖了关于大语言模型中指令微调及LoRA的研究。包括对指令微调中是否屏蔽指令的探讨,以及LoRA与全微调在编程和数学领域的比较。还提到了作者的个人经验及相关结论。

🎯在指令微调中,常见做法是在计算损失时屏蔽指令,但研究发现不屏蔽指令(除特殊提示令牌)可能更有益于模型性能,其效果取决于指令与响应长度的比例及数据集的大小。

📚LoRA学习能力低于全微调,在将预训练基础模型转化为指令跟随模型时,更新较少参数会限制学习能力,且在不同领域中LoRA与全微调的差距有所不同。

💡LoRA遗忘信息少于全微调,当应用于远离源域的数据集时,全微调遗忘的知识比LoRA多,在数学数据集上差距较小。

This month, I am covering three new papers related to instruction finetuning and parameter-efficient finetuning with LoRA in large language models (LLMs). I work with these methods on a daily basis, so it's always exciting to see new research that provides practical insights.

This article might be a bit shorter than usual since I am currently finishing the final chapter of my book, "Build a Large Language Model From Scratch." Additionally, I am preparing for a virtual ACM Tech Talk on LLMs this Wednesday. The talk is free and open to all, and you are very welcome to join if you're interested!

1. Instruction Tuning With Loss Over Instructions

One paper that caught my eye this month is Instruction Tuning With Loss Over Instructions.

In this paper, the authors question a widely accepted practice in instruction finetuning: masking the instruction when calculating the loss. But before we discuss the findings, let's begin with a general overview.

1.1 Instruction Masking During Instruction Finetuning

Instruction finetuning (or instruction tuning for short) is the task of improving the responses of a pretrained LLM to follow instructions ("Summarize this article," "Translate this sentence," etc.).

An illustration of a dataset example for instruction finetuning

When instruction finetuning LLMs, it is common to mask out the instruction itself when calculating the loss. For instance, this is done by default in our LitGPT library, and I use it also in Chapter 7 of my Build a Large Language Model From Scratch book (however, I am now considering moving the masking to a reader exercise). 

In other popular LLM libraries like Axolotl, this is also done automatically via the train_on_inputs: false default setting in the config.yaml. In Hugging Face, this is not done by default, but one can implement this through a DataCollatorForCompletionOnlyLM dataset collator, as detailed in their documentation.

An illustration of input masking where the highlighted text is still fed to the LLM, but it is not used when calculating the loss during training.

Masking the input prompts, as described above, is a routine task, and some papers may include comparisons with and without this masking. For example, the QLoRA paper included a comparison in its appendix, finding that masking performs better.

Annotated table from "QLoRA: Efficient Finetuning of Quantized LLMs", https://arxiv.org/abs/2305.14314

Note that MMLU is a benchmark focused on measuring the performance on multiple-choice questions, and the authors did not measure or study how it affects conversational performance when the finetuned model is used as a chatbot.

1.2 Instruction Modeling

After the brief introduction to the topic in the previous section, let's examine the Instruction Tuning With Loss Over Instructions paper. In this study, the authors systematically investigate the differences in LLM performance when instructions are masked versus unmasked. 

Illustration of (1) no instruction masking, (2) instruction masking, and (3) boilerplate masking.

Approach number 1 in the figure above is the default approach when implementing an LLM because it does not require any extra work or modification of the loss function. It's the method that the authors refer to in the paper as "instruction modeling." (In the paper, they additionally mask special prompt tokens like <|user|>, <|assistant|>, and <|system|> that may occur in non-Alpaca prompt templates.)

Approach number 2 is currently the most common approach in practice, where everything except for the response is masked when computing the loss. In the paper, they refer to this approach as "instruction tuning" (the naming is a bit unfortunate in this context, because we are not "tuning" the instruction but excluding the instructions from the loss, compared to approach 1).

While drawing the figure above, I thought of an interesting third approach that is not explored in the paper: Masking prompt-specific boilerplate text. For instance, in Alpaca-style prompts, all examples begin with "Below is an instruction...". Compared to the actual instruction and input, this text is constant and could thus be excluded from the loss. This is not studied in the paper but might be an interesting additional experiment, which I am planning to add as another reader exercise (plus solution) to my LLMs from Scratch book.

It turns out that instruction modeling (that is, not masking the instructions) seems to outperform the masking approach, as shown in the figure below.

Not masking the instructions performs better than masking the instructions. Annotated from Instruction Tuning With Loss Over Instructions, (https://arxiv.org/abs/2405.14394)

However, the performance of "instruction modeling" depends on (a) the ratio between the instruction and response length and (b) the size of the dataset (in terms of the number of training examples).

The benefit of not masking instructions depends on the dataset length and size. Annotated from Instruction Tuning With Loss Over Instructions, (https://arxiv.org/abs/2405.14394)

A plausible explanation for the dataset length and size dependence shown above is that if the responses are short and there are only a few training examples, it becomes easier for the model to memorize the responses, so including the instruction modeling, that is, modeling more of the model output in the loss, can help reduce overfitting. 

1.3 Conclusion

In short, the authors found that going back to the basics, not masking the instructions, can benefit model performance.

It's positively surprising that the simpler approach of not masking the instruction (except for mask special prompt tokens like <|user|>, <|assistant|>, and <|system|>) performs better. 

In my personal experience, when I tried both approaches in the past and didn't see a clear winner, I usually didn't experiment with masking much, though, and focused more on changing other settings if things didn't work well. In hindsight, it may make sense to consider experimenting with the (not) masking approach a bit more, as it seems to depend on the dataset size and length.

2. LoRA learns less and forgets less

LoRA Learns Less and Forgets Less is a comprehensive empirical study of low-rank adaptation (LoRA) for finetuning LLMs, comparing LoRA to full finetuning across two target domains: programming and mathematics. In addition to these domains, the comparison extends to two target tasks: instruction finetuning and continued pretraining.

If you are looking for a LoRA refresher before reading on, I recently covered in "Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch":

2.1 LoRA Learns Less

Based on the first set of results, LoRA learns significantly less than full finetuning, as shown in the figure below. This is expected, in my view, because updating fewer parameters limits learning capacity. Learning new knowledge generally requires more capacity than converting a pretrained base model into an instruction-following model.

Full finetuning vs LoRA. The performance is measured on HumanEval, which is a dataset consisting of 164 coding challenges. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673.

In the figure above, it is also interesting to note that the gap between LoRA and full finetuning is larger for continued pretraining than for instruction finetuning. This is consistent with the common intuition that pretraining primarily teaches an LLM new knowledge, whereas instruction finetuning primarily changes the behavior of an LLM.

Next, examining the same set of experiments in the domain of math instead of coding, we observe that the gap between full finetuning and LoRA shrinks, as illustrated in the figure below.

The full finetuning vs. LoRA comparison for math tasks. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673.

One could argue that solving math problems is more closely related to the source domain of the LLM than coding is. In other words, the LLM may have encountered more math problems during pretraining than programming tasks. Furthermore, math problems are usually described in words, whereas programming requires an entirely new set of terms.

As a takeaway, so far we can conclude that the further the new task deviates from the pretraining data, the more advantageous it is to use full finetuning over LoRA when it comes to acquiring new knowledge, such as through continued pretraining.

2.2 LoRA Forgets Less

Above, we examined how LoRA compares to full finetuning in terms of knowledge updates. The next set of experiments assesses these two approaches in terms of forgetting information after additional training, via continued pretraining and instruction finetuning. The difference to the previous results is that here they measure the performance on the original source tasks.

Full finetuning vs LoRA on the original source tasks after training on programming data. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673.

As we can see from the figure above, full finetuning forgets much more knowledge than LoRA when these methods are applied to datasets that are farther from the source domain (here, coding). The gaps are smaller for the math dataset, as shown below.

Full finetuning vs LoRA on the original source tasks after training on math data. Annotated figures from LoRA Learns Less and Forgets Less, https://arxiv.org/abs/2405.09673.

2.3 Conclusion

LoRA versus full finetuning? Perhaps as expected, it all boils down to a learning-forgetting trade-off. Full finetuning results in stronger performance in the new target domain, whereas LoRA maintains better performance in the original source domain *.

Intuitively, I suspect this is merely a side effect of LoRA changing fewer parameters in the model—the goal of LoRA, as its name suggests, is low-rank adaptation, which means not substantially modifying all model parameters. 

Moreover, in practice, it often isn't a question of whether to use full finetuning or LoRA, as the latter might be the only feasible option due to its memory savings and lower storage footprint.

Nonetheless, it's very interesting to see this thoroughly laid out and analyzed in great experimental detail. (The experiments were conducted with 7B and 13B Llama 2 models).

* (One caveat that Mariano Kamp pointed out to me was that they did not update the embedding layers in the LoRA experiments, which is crucial when adapting a model to a new task.)

3. MoRA: High-Rank Updating for Parameter-Efficient Finetuning

It's always exciting when a new paper with a LoRA-like method for efficient LLM finetuning is published. In MoRA: High-Rank Updating for Parameter-Efficient Finetuning the authors take a related yet opposite approach to low-rank adaptation by replacing the LoRA adapters with a square matrix.

Also, Appendix E of my Build a Large Language Model From Scratch book implements LoRA from scratch for a GPT model trained to classify spam messages.

Visualizations from the LoRA sections of my Build a Large Language Model From Scratch book

3.1 Low Vs High Ranks

As shown in the figure below, MoRA uses a small square matrix (M) instead of the two small LoRA matrices, A and B. We will return to discussing how this works in the next section.

LoRA and MoRA side by side, where W is the weight in a neural network layer. Figure source: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, https://arxiv.org/abs/2405.12130

Why introduce another LoRA alternative, now with "high" ranks? The reason is that LoRA updates the original weights in a relatively limited way. In my view, this is by design, because we don't want to perturb or change the original model capabilities too much when we finetune a model with LoRA. However, while this low-rank updating is effective and sufficient for tasks like instruction finetuning, one downside of the low-rank updating is that it is relatively ineffective for incorporating new knowledge, such as via continued pretraining (also referred to as finetuning on regular text data, not instruction finetuning).

In the MoRA paper, the authors seek to develop a parameter-efficient finetuning method that can perform well for both instruction finetuning and absorbing new knowledge in continued pretraining. 

Below is a side-by-side comparison between LoRA, MoRA, and regular full finetuning (FFT) based on a synthetic dataset that requires the LLM to memorize specific identifier codes.

Performance of LoRA, MoRA and full finetuning (FFT) for memorizing identifier codes. Figure source: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, https://arxiv.org/abs/2405.12130

In the synthetic benchmark shown in the figure above, MoRA has similar knowledge-uptake (or memorization) capabilities as full fine-tuning. LoRA with high ranks (r=256) can eventually memorize the synthetic identifier codes as well, but it requires more steps. LoRA with a small rank (r=8) fails to memorize.

Note that memorization in LLMs is not necessarily a good thing (except for historical dates and facts, of course). However, the fact that LoRA does not tend to easily memorize can also be advantageous in reducing the tendency to overfit the training data. That being said, this benchmark primarily probes whether LoRA adds enough capacity to learn new knowledge. Think of it as similar to batch overfitting, a popular debugging technique where we try to get the model to overfit on a small portion of the training set to ensure that we have implemented the architecture correctly. (We will look at a benchmark on real data in the next section.)

3.2 MoRA in a Nutshell

So, how is MoRA implemented?

Here, the authors use a trainable square matrix that is applied to the original weights W instead of the matrices AB in LoRA. This square matrix is usually much smaller than the original weight matrix. In fact, the number of trainable parameters in LoRA and MoRA can even be the same.

For example, if the original weight layer has 4096 × 4096 = 16,777,216 parameters, LoRA with r=8 has 4096×8 + 8×4096 = 65,536 parameters. With MoRA, we can match the number of parameters with r=256, i.e., 256 × 256 = 65,536.

How do you apply this 256×256 matrix to the original 1024×1024 weight matrix? They define several non-parametric compression and decompression methods (the details are a bit out of scope for this post, but I tried to summarize them with PyTorch code in the figure).

Left: Visualization of the matrix dimensions for LoRA and DoRA when the total number of parameters is the same. Right: Example code illustrating the compression and decompression of the MoRA matrix.

Figure source for the left subfigure: MoRA: High-Rank Updating for Parameter-Efficient Finetuning, https://arxiv.org/abs/2405.121

How do LoRA and MoRA compare?

Looking at real dataset benchmarks (see the table below), MoRA and LoRA are in relatively similar ballparks. However, in the case of continued pretraining on biomedical and finance data, MoRA outperforms all LoRA variants; only full fine-tuning (FFT) performs better. Additionally, an interesting observation is that LoRA ties with or outperforms DoRA, a LoRA variant I covered a few months ago in "Improving LoRA: Implementing Weight-Decomposed Low-Rank Adaptation (DoRA) from Scratch":

Comparison of several LoRA variants with MoRA on real datasets. Annotated table from MoRA: High-Rank Updating for Parameter-Efficient Finetuning, https://arxiv.org/abs/2405.12130

3.3 Conclusion

It turns out that the relatively simple MoRA method works surprisingly well. And it indeed slightly outperforms LoRA in continued pretraining. However, it does lag behind LoRA a bit in instruction fine-tuning and mathematical reasoning with small ranks. For me, the results are not compelling enough to replace LoRA with MoRA. Nonetheless, the paper presents an interesting method with an interesting set of experiments.

4. Other Interesting Research Papers In May

Below is a selection of other interesting papers I stumbled upon this month. Given the length of this list, I highlighted those 10 I found particularly interesting with an asterisk (*). However, please note that this list and its annotations are purely based on my interests and relevance to my own projects.

Contextual Position Encoding: Learning to Count What's Important by Golovneva, Wang, Weston, and Sukhbaatar (29 May), https://arxiv.org/abs/2405.18719

LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models by Sarah, Sridhar, Szanking, and Sundaresan (28 May), https://arxiv.org/abs/2405.18377

VeLoRA: Memory Efficient Training using Rank-1 Sub-Token Projections by Miles, Reddy, Elezi, and Deng (28 May), https://arxiv.org/abs/2405.17991

* gzip Predicts Data-dependent Scaling Laws by Pandey (26 May), https://arxiv.org/abs/2405.16684

Offline Regularised Reinforcement Learning for Large Language Models Alignment by Yao, Wu, Yang, et al. (22 May), https://arxiv.org/abs/2405.13800

Trans-LoRA: Towards Data-free Transferable Parameter Efficient Finetuning, by Wang, Ghosh, Cox, et al. (27 May), https://arxiv.org/abs/2405.17258

Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training (26 May) by Du, Luo, Qiu, et al., https://arxiv.org/abs/2405.15319

The Road Less Scheduled, Defazio, Yang, Mehta, et al. (24 May), https://arxiv.org/abs/2405.15682

Instruction Tuning With Loss Over Instructions by Shi, Yang, Wu, et al. (23 May), https://arxiv.org/abs/2405.14394

SimPO: Simple Preference Optimization with a Reference-Free Reward by Meng, Xia, and Chen (23 May), https://arxiv.org/abs/2405.14734

AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability by Fei Zhao, Taotian Pang, and Chunhui Li (23 May), https://arxiv.org/abs/2405.14129

Dense Connector for MLLMs by Yao, Wu, Yang, et al. (22 May), https://arxiv.org/abs/2405.13800

Attention as an RNN by Feng, Tung, Hajimirsadeghi, et al. (22 May), https://arxiv.org/abs/2405.13956

* MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning by Jiang, Huang, Luo, et al. (20 May), https://arxiv.org/abs/2405.12130

SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization by Guo, Chen, Tang, and Wang (19 May), https://arxiv.org/abs/2405.11582

Towards Modular LLMs by Building and Reusing a Library of LoRAs by Ostapenko, Su, Ponti, et al. (17 May), https://arxiv.org/abs/2405.11157

Chameleon: Mixed-Modal Early-Fusion Foundation Models by unknown authors at Meta AI (16 May), https://arxiv.org/abs/2405.09818

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model by Xu, Liu, He, et al. (15 May), https://arxiv.org/abs/2405.09215

* LoRA Learns Less and Forgets Less by Biderman, Ortiz, Portes, et al. (15 May), https://arxiv.org/abs/2405.09673 

RLHF Workflow: From Reward Modeling to Online RLHF by Dong, Xiong, Pang, et al. (13 May), https://arxiv.org/abs/2405.07863

PHUDGE: Phi-3 as Scalable Judge by Deshwal and Chawla (12 May), https://arxiv.org/abs/2405.08029

Value Augmented Sampling for Language Model Alignment and Personalization by Han, Shenfeld, Srivastava, et al. (10 May), https://arxiv.org/abs/2405.06639

* Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? by Gekhman, Yona, Aharoni, et al. (9 May), https://arxiv.org/abs/2405.05904

* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models by Land and Bartolo (8 May), https://arxiv.org/abs/2405.05417

* DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by Liu, Feng, Wang, et al. (8 May), https://arxiv.org/abs/2405.04434

You Only Cache Once: Decoder-Decoder Architectures for Language Models by Sun, Dong, Zhu, et al. (8 May), https://arxiv.org/abs/2405.05254

* xLSTM: Extended Long Short-Term Memory by Beck, Poeppel, Spanring, et al. (7 May), https://arxiv.org/abs/2405.04517

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention by Prabhu, Nayak, Mohan et al. (7 May), https://arxiv.org/abs/2405.04437

* Is Flash Attention Stable? by Golden, Hsia, Sun, et al. (5 May), https://arxiv.org/abs/2405.02803

* What Matters When Building Vision-Language Models? by Laurencon, Tronchon, Cord, and Sanh (3 May), https://arxiv.org/abs/2405.02246

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models by (2 May), https://arxiv.org/abs/2405.01535

* A Careful Examination of Large Language Model Performance on Grade School Arithmetic by Zhang, Da, Lee, et al. (1 May), https://arxiv.org/abs/2405.00332

Self-Play Preference Optimization for Language Model Alignment by Wu, Sun, Yuan, et al. (1 May), https://arxiv.org/abs/2405.00675

Is Bigger Edit Batch Size Always Better? An Empirical Study on Model Editing with Llama-3 by Yoon, Gupta, and Anumanchipalli (1 May), https://arxiv.org/abs/2405.00664




This magazine is a personal passion project that does not offer direct compensation. However, for those who wish to support me, please consider purchasing a copy of one of my books. If you find them insightful and beneficial, please feel free to recommend them to your friends and colleagues. (Sharing your feedback with others via a book review on Amazon helps a lot, too!)

Your support means a great deal! Thank you!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

指令微调 LoRA 语言模型 模型性能
相关文章