MarkTechPost@AI 02月09日
Fine-Tuning of Llama-2 7B Chat for Python Code Generation: Using QLoRA, SFTTrainer, and Gradient Checkpointing on the Alpaca-14k Dataset
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了如何高效地微调Llama-2 7B Chat模型,使其擅长生成Python代码。利用QLoRA、梯度检查点和监督式微调等先进技术,结合Alpaca-14k数据集,逐步演示了环境配置、LoRA参数设置以及内存优化策略。本指南旨在帮助开发者以最小的计算开销,充分利用LLM的能力,生成高质量的Python代码。内容涵盖了从安装必要库、加载数据集和模型,到配置训练参数和执行微调的完整流程,并展示了如何使用微调后的模型生成代码。

🚀QLoRA微调:文章采用QLoRA技术对Llama-2 7B Chat模型进行微调,显著降低了计算资源的需求,使得在有限的硬件条件下也能完成模型训练。

💾梯度检查点与内存优化:通过启用梯度检查点,并结合其他内存优化策略,有效减少了训练过程中的内存占用,解决了大型模型训练时常见的内存瓶颈问题。

📚Alpaca-14k数据集:利用Alpaca-14k数据集进行监督式微调,该数据集专门用于指导模型生成高质量的Python代码,从而提升了模型在代码生成任务上的性能。

⚙️详细的参数配置:文章详细介绍了LoRA参数、训练参数和SFT参数的配置方法,包括学习率、batch size、优化器等关键参数的设置,为读者提供了可操作的参考。

🐍代码生成实例:展示了如何使用微调后的模型生成Python代码,并给出了一个实际的代码生成示例,验证了微调方法的有效性。

In this tutorial, we demonstrate how to efficiently fine-tune the Llama-2 7B Chat model for Python code generation using advanced techniques such as QLoRA, gradient checkpointing, and supervised fine-tuning with the SFTTrainer. Leveraging the Alpaca-14k dataset, we walk through setting up the environment, configuring LoRA parameters, and applying memory optimization strategies to train a model that excels in generating high-quality Python code. This step-by-step guide is designed for practitioners seeking to harness the power of LLMs with minimal computational overhead.

!pip install -q accelerate!pip install -q peft!pip install -q transformers!pip install -q trl

First, install the required libraries for our project. They include accelerate, peft, transformers, and trl from the Python Package Index. The -q flag (quiet mode) keeps the output minimal.

import osfrom datasets import load_datasetfrom transformers import (    AutoModelForCausalLM,    AutoTokenizer,    HfArgumentParser,    TrainingArguments,    pipeline,    logging,)from peft import LoraConfig, PeftModelfrom trl import SFTTrainer

Import the essential modules for our training setup. They include utilities for dataset loading, model/tokenizer, training arguments, logging, LoRA configuration, and the SFTTrainer.

# The model to train from the Hugging Face hubmodel_name = "NousResearch/llama-2-7b-chat-hf"# The instruction dataset to usedataset_name = "user/minipython-Alpaca-14k"# Fine-tuned model namenew_model = "/kaggle/working/llama-2-7b-codeAlpaca"

We specify the base model from the Hugging Face hub, the instruction dataset, and the new model’s name.

# QLoRA parameters# LoRA attention dimensionlora_r = 64# Alpha parameter for LoRA scalinglora_alpha = 16# Dropout probability for LoRA layerslora_dropout = 0.1

Define the LoRA parameters for our model. lora_r sets the LoRA attention dimension, lora_alpha scales LoRA updates, and lora_dropout controls dropout probability.

# TrainingArguments parameters# Output directory where the model predictions and checkpoints will be storedoutput_dir = "/kaggle/working/llama-2-7b-codeAlpaca"# Number of training epochsnum_train_epochs = 1# Enable fp16 training (set to True for mixed precision training)fp16 = True# Batch size per GPU for trainingper_device_train_batch_size = 8# Batch size per GPU for evaluationper_device_eval_batch_size = 8# Number of update steps to accumulate the gradients forgradient_accumulation_steps = 2# Enable gradient checkpointinggradient_checkpointing = True# Maximum gradient norm (gradient clipping)max_grad_norm = 0.3# Initial learning rate (AdamW optimizer)learning_rate = 2e-4# Weight decay to apply to all layers except bias/LayerNorm weightsweight_decay = 0.001# Optimizer to useoptim = "adamw_torch"# Learning rate schedulelr_scheduler_type = "constant"# Group sequences into batches with the same length# Saves memory and speeds up training considerablygroup_by_length = True# Ratio of steps for a linear warmupwarmup_ratio = 0.03# Save checkpoint every X updates stepssave_steps = 100# Log every X updates stepslogging_steps = 10

These parameters configure the training process. They include output paths, number of epochs, precision (fp16), batch sizes, gradient accumulation, and checkpointing. Additional settings like learning rate, optimizer, and scheduling help fine-tune training behavior. Warmup and logging settings control how the model starts training and how we monitor progress.

import torchprint("PyTorch Version:", torch.version)print("CUDA Version:", torch.version.cuda)

Import PyTorch and print both the installed PyTorch version and the corresponding CUDA version.

This command shows the GPU information, including driver version, CUDA version, and current GPU usage.

# SFT parameters# Maximum sequence length to usemax_seq_length = None# Pack multiple short examples in the same input sequence to increase efficiencypacking = False# Load the entire model on the GPU 0device_map = {"": 0}

Define SFT parameters, such as the maximum sequence length, whether to pack multiple examples, and mapping the entire model to GPU 0.

# SFT parameters# Maximum sequence length to usemax_seq_length = None# Pack multiple short examples in the same input sequence to increase efficiencypacking = False# Load datasetdataset = load_dataset(dataset_name, split="train")# Load tokenizertokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right"# Load base model with 8-bit quantizationmodel = AutoModelForCausalLM.from_pretrained(    model_name,    torch_dtype=torch.float16,    device_map="auto",)# Prepare model for trainingmodel.gradient_checkpointing_enable()model.enable_input_require_grads()

Set additional SFT parameters and load our dataset and tokenizer. We configure padding tokens for the tokenizer and load the base model with 8-bit quantization. Finally, we enable gradient checkpointing and ensure the model requires input gradients for training.

from peft import get_peft_model

Import the get_peft_model function, which applies parameter-efficient fine-tuning (PEFT) to our base model.

# Load LoRA configurationpeft_config = LoraConfig(    lora_alpha=lora_alpha,    lora_dropout=lora_dropout,    r=lora_r,    bias="none",    task_type="CAUSAL_LM",)# Apply LoRA to the modelmodel = get_peft_model(model, peft_config)# Set training parameterstraining_arguments = TrainingArguments(    output_dir=output_dir,    num_train_epochs=num_train_epochs,    per_device_train_batch_size=per_device_train_batch_size,    gradient_accumulation_steps=gradient_accumulation_steps,    optim=optim,    save_steps=save_steps,    logging_steps=logging_steps,    learning_rate=learning_rate,    weight_decay=weight_decay,    fp16=fp16,    max_grad_norm=max_grad_norm,    warmup_ratio=warmup_ratio,    group_by_length=True,    lr_scheduler_type=lr_scheduler_type,)# Set supervised fine-tuning parameterstrainer = SFTTrainer(    model=model,    train_dataset=dataset,    dataset_text_field="text",    max_seq_length=max_seq_length,    tokenizer=tokenizer,    args=training_arguments,    packing=packing,)

Configure and apply LoRA to our model using LoraConfig and get_peft_model. We then create TrainingArguments for model training, specifying epoch counts, batch sizes, and optimization settings. Lastly, we set up the SFTTrainer, passing it the model, dataset, tokenizer, and training arguments.

# Train modeltrainer.train()# Save trained modeltrainer.model.save_pretrained(new_model)

Initiate the supervised fine-tuning process (trainer.train()) and then save the trained LoRA model to the specified directory.

# Run text generation pipeline with the fine-tuned modelprompt = "How can I write a Python program that calculates the mean, standard deviation, and coefficient of variation of a dataset from a CSV file?"pipe = pipeline(task="text-generation", model=trainer.model, tokenizer=tokenizer, max_length=400)result = pipe(f"<s>[INST] {prompt} [/INST]")print(result[0]['generated_text'])

Create a text generation pipeline using our fine-tuned model and tokenizer. Then, we provide a prompt, generate text using the pipeline, and print the output.

from kaggle_secrets import UserSecretsClientuser_secrets = UserSecretsClient()secret_value_0 = user_secrets.get_secret("HF_TOKEN")

Access Kaggle Secrets to retrieve a stored Hugging Face token (HF_TOKEN). This token is used for authentication with the Hugging Face Hub.

# Empty VRAM# del model# del pipe# del trainer# del datasetdel tokenizerimport gcgc.collect()gc.collect()torch.cuda.empty_cache()

The above snippet shows how to free up GPU memory by deleting references and clearing caches. We delete the tokenizer, run garbage collection, and empty the CUDA cache to reduce VRAM usage.

import torch# Check the number of GPUs availablenum_gpus = torch.cuda.device_count()print(f"Number of GPUs available: {num_gpus}")# Check if CUDA device 1 is availableif num_gpus > 1:    print("cuda:1 is available.")else:    print("cuda:1 is not available.")

We import PyTorch and check the number of GPUs detected. Then, we print the count and conditionally report whether the GPU with ID 1 is available.

import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModel# Specify the device ID for your desired GPU (e.g., 0 for the first GPU, 1 for the second GPU)device_id = 1  # Change this based on your available GPUsdevice = f"cuda:{device_id}"# Load the base model on the specified GPUbase_model = AutoModelForCausalLM.from_pretrained(    model_name,    low_cpu_mem_usage=True,    return_dict=True,    torch_dtype=torch.float16,    device_map="auto",  # Use auto to load on the available device)# Load the LoRA weightslora_model = PeftModel.from_pretrained(base_model, new_model)# Move LoRA model to the specified GPUlora_model.to(device)# Merge the LoRA weights with the base model weightsmodel = lora_model.merge_and_unload()# Ensure the merged model is on the correct devicemodel.to(device)# Load tokenizertokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right"

Select a GPU device (device_id 1) and load the base model with specified precision and memory optimizations. Then, load and merge LoRA weights into the base model, ensuring the merged model is moved to the designated GPU. Finally, load the tokenizer and configure it with appropriate padding settings.

In conclusion, following this tutorial, you have successfully fine-tuned the Llama-2 7B Chat model to specialize in Python code generation. Integrating QLoRA, gradient checkpointing, and SFTTrainer demonstrates a practical approach to managing resource constraints while achieving high performance.


Download the Colab Notebook here. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 75k+ ML SubReddit.

Marktechpost is inviting AI Companies/Startups/Groups to partner for its upcoming AI Magazines on ‘Open Source AI in Production’ and ‘Agentic AI’.

The post Fine-Tuning of Llama-2 7B Chat for Python Code Generation: Using QLoRA, SFTTrainer, and Gradient Checkpointing on the Alpaca-14k Dataset appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama-2 QLoRA Python代码生成 微调 Alpaca-14k
相关文章