AWS Machine Learning Blog 2024年11月22日
Fine-tune large language models with Amazon SageMaker Autopilot
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍如何使用Amazon SageMaker Autopilot和AutoMLV2 SDK,对Meta Llama2-7B基础模型进行微调,使其能够更好地处理问答任务。具体来说,文章展示了如何使用科学考试的多项选择题(涵盖物理、化学和生物学)来训练模型。这种微调方法可以扩展到其他任务,例如医疗保健、教育或金融服务领域的摘要或文本生成。文章还详细介绍了使用SageMaker Pipelines自动化数据准备、微调和模型创建等步骤,并使用fmeval库评估模型,最终将其注册到Amazon SageMaker模型注册中心。

🤔 **使用AutoMLV2 SDK简化模型微调流程:** AutoMLV2 SDK提供了高级函数和抽象,使得开发人员即使不熟悉AutoML概念,也能轻松创建和管理AutoML作业。它提供了CreateAutoMLJobV2 API,允许更精细的控制和定制。

🚀 **利用SageMaker Pipelines自动化模型训练和推理:** 文章采用SageMaker Pipelines来编排模型训练和推理的不同步骤,包括数据准备、模型训练、模型评估和模型注册等环节,实现了自动化和可扩展性。

📚 **使用SciQ数据集进行模型训练:** SciQ数据集包含关于物理、化学、生物学等学科的科学考试问题,文章将其格式化为CSV文件,并将其分为训练集和验证集,用于模型的微调。

⚙️ **配置AutoMLV2文本生成配置:** 在微调过程中,需要配置文本生成配置,包括指定基础模型(例如Llama2-7B)、接受模型EULA、设置训练轮数、学习率、批大小和学习率预热步数等超参数。

📊 **评估模型并注册到模型注册中心:** 文章使用fmeval库评估模型性能,并根据性能要求将模型注册到Amazon SageMaker模型注册中心,以便后续部署和使用。

Fine-tuning foundation models (FMs) is a process that involves exposing a pre-trained FM to task-specific data and fine-tuning its parameters. It can then develop a deeper understanding and produce more accurate and relevant outputs for that particular domain.

In this post, we show how to use an Amazon SageMaker Autopilot training job with the AutoMLV2 SDK to fine-tune a Meta Llama2-7B model on question answering tasks. Specifically, we train the model on multiple-choice science exam questions covering physics, chemistry, and biology. This fine-tuning approach can be extended to other tasks, such as summarization or text generation, in domains like healthcare, education, or financial services.

AutoMLV2 supports the instruction-based fine-tuning of a selection of general-purpose FMs powered by Amazon SageMaker JumpStart. We use Amazon SageMaker Pipelines, which helps automate the different steps, including data preparation, fine-tuning, and creating the model. We use the open source library fmeval to evaluate the model and register it in the Amazon SageMaker Model Registry based on its performance.

Solution overview

The following architecture diagram shows the various steps involved to create an automated and scalable process to fine-tune large language models (LLMs) using AutoMLV2. The AutoMLV2 SDK simplifies the process of creating and managing AutoML jobs by providing high-level functions and abstractions, making it straightforward for developers who may not be familiar with AutoML concepts. The CreateAutoMLJobV2 API offers a low-level interface that allows for more control and customization. Using the SDK offers benefits like faster prototyping, better usability, and pre-built functions, and the API is better for advanced customizations.

To implement the solution, we use SageMaker Pipelines in Amazon SageMaker Studio to orchestrate the different steps. The solution consists of two pipelines: training and inference.

To create the training pipeline, you complete the following steps:

Load and prepare the dataset.

    Create a SageMaker Autopilot CreateAutoMLJobV2 training job. Check the training job status. Deploy the best candidate model.

The following steps configure the inference pipeline:

Preprocess data for evaluation.

    Evaluate the model using the fmeval library. Register the model if it meets the required performance.

To deploy the solution, refer to the GitHub repo, which provides step-by-step instructions for fine-tuning Meta Llama2-7B using SageMaker Autopilot and SageMaker Pipelines.

Prerequisites

For this walkthrough, complete the following prerequisite steps:

    Set up an AWS account. Create a SageMaker Studio environment. Create two AWS Identity and Access Management (IAM) roles: LambdaExecutionRole and SageMakerExecutionRole, with permissions as outlined in the SageMaker notebook. The managed policies should be scoped down further for improved security. For instructions, refer to Create a role to delegate permissions to an IAM user. On the SageMaker Studio console, upload the code from the GitHub repo. Open the SageMaker notebook ipynb and run the cells.

Training pipeline

The following training pipeline shows a streamlined way to automate the fine-tuning of a pre-trained LLM and the deployment of the model to a real-time endpoint inference.

Prepare the data

For this project, we used the SciQ dataset, which contains science exam questions about physics, chemistry, biology, and other subjects. SageMaker Autopilot supports instruction-based fine-tuning datasets formatted as CSV files (default) or as Parquet files.

When you prepare your CSV file, make sure that it contains exactly two columns:

In this project, we start by removing the irrelevant columns. Next, we combine the question and support columns to create a comprehensive prompt, which is then placed in the input column. SageMaker Autopilot sets a maximum limit on the number of rows in the dataset and the context length based on the type of model being used. We select 10,000 rows from the dataset.

Finally, we divide the data into training and validation sets:

# Load and split dataset. Change this to your own datasetdataset = load_dataset("allenai/sciq", split="train")   dataset = dataset.train_test_split(test_size=0.1, shuffle=True)   dataset_training_df = pd.DataFrame(dataset['train'])   dataset_validation_df = pd.DataFrame(dataset['test'])   dataset_training_df = dataset_training_df.sample(n=10000, random_state=42, ignore_index=True)   # prepare training dataset to fit autopilot job.   fields = ['question', 'correct_answer', 'support']dataset_train_ist_df = dataset_training_df[fields].copy()  dataset_fine_tune_ist = Dataset.from_pandas(dataset_train_ist_df)dataset_fine_tune_ist_cpy= dataset_train_ist_df.copy()dataset_fine_tune_ist_cpy["input"] = ("Below is an instruction that describes a task, paired with an input that provides further context.""Write a response that appropriately completes the request.\n\n### Instruction:\n"+ dataset_fine_tune_ist_cpy["question"]+ "\n\n### Input:\n" + dataset_fine_tune_ist_cpy["support"])dataset_fine_tune_ist_cpy["output"] = dataset_fine_tune_ist_cpy["correct_answer"]autopilot_fields = ['input', 'output']dataset_fine_tune = Dataset.from_pandas(dataset_fine_tune_ist_cpy[autopilot_fields])dataset_fine_tune.to_csv(train_dataset_s3_path, index=False)

Create an CreateAutoMLJobV2 training job

AutoMLV2 makes it straightforward to train, optimize, and deploy machine learning (ML) models by automating the tasks involved in the ML development lifecycle. It provides a simple approach to create highly accurate models tailored to your specific problem type, whether it’s classification, regression, forecasting, or others. In this section, we go through the steps to train a model with AutoMLV2, using an LLM fine-tuning job as an example. For this project, we used the Meta Llama2-7B model. You can change the model by choosing from the supported LLMs for fine-tuning.

Define the text generation configuration

AutoMLV2 automates the entire ML process, from data preprocessing to model training and deployment. However, for AutoMLV2 to work effectively, it’s crucial to provide the right problem configuration. This configuration acts as a guide, helping SageMaker Autopilot understand the nature of your problem and select the most appropriate algorithm or approach. By specifying details such as the problem type (such as classification, regression, forecasting, or fine-tuning), you give AutoMLV2 the necessary information to tailor its solution to your specific requirements.

For a fine-tuning job, the configuration consists of determining the model to be used and its access configuration, in addition to the hyperparameters that optimize the model learning process. See the following code:

text_generation_config = AutoMLTextGenerationConfig(    base_model_name= "Llama2-7B", accept_eula= True, text_generation_hyper_params={"epochCount": "3", "learningRate": "0.00001", "batchSize": "1", "learningRateWarmupSteps": "1"},)

The definitions of each parameter used in text_generation_config are:

epochCount – The number of times the model goes through the entire training dataset. Its value should be a string containing an integer value within the range of 1–10. One epoch means the Meta Llama2-7B model has been exposed to the 10,000 samples and had a chance to learn from them. You can set it to 3, meaning the model will make three complete passes, or increase the number, if the model doesn’t converge with just three epochs.

learningRate – The step size at which a model’s parameters are updated during training. Its value should be a string containing a floating-point value within the range of 0–1. A learning rate of 0,00001 or 0,00002 is a good standard when fine-tuning LLMs like Meta Llama2-7B.

batchSize – The number of data samples used in each iteration of training. Its value should be a string containing an integer value within the range of 1–64. Start with 1 in order to not receive an out-of-memory error.

learningRateWarmupSteps – The number of training steps during which the learning rate gradually increases before reaching its target or maximum value. Its value should be a string containing an integer value within the range of 0–250. Start with 1.

The configuration settings can be adjusted to align with your specific requirements and the chosen FM.

Start the AutoMLV2 job

Next, set up the AutoMLV2 job by providing the problem configuration details, the AWS role with the necessary permissions, a base name for job identification, and the output path where the model artifacts will be saved. To initiate the training process in a pipeline step, we invoked the create_auto_ml_job_v2 method. In the following code snippet, the create_auto_ml_job_v2 method is called to create an AutoML job object with specific inputs. The AutoMLJobInputDataConfig parameter takes a list that includes an AutoMLDataChannel, which specifies the type of data (in this case, ‘S3Prefix’) and the location of the training dataset (given by train_dataset_s3_path.default_value) in an S3 bucket. The channel_type is set to ‘training’, indicating that this dataset is used for training the model.

sagemaker_client.create_auto_ml_job_v2(AutoMLJobName=event["AutopilotJobName"],AutoMLJobInputDataConfig=[{ "ChannelType": "training", "CompressionType": "None", "ContentType": "text/csv;header=present","DataSource": {"S3DataSource": {"S3DataType": "S3Prefix",                                "S3Uri": event["TrainDatasetS3Path"],}}}],DataSplitConfig={'ValidationFraction':0.1},OutputDataConfig={"S3OutputPath": event["TrainingOutputS3Path"]}, AutoMLProblemTypeConfig={ "TextGenerationJobConfig":{"BaseModelName": event["BaseModelName"],'TextGenerationHyperParameters':{"epochCount": event["epochCount"],"learningRate": event["learningRate"],"batchSize": event["batchSize"],"learningRateWarmupSteps": event["learningRateWarmupSteps"]},'ModelAccessConfig':{'AcceptEula': True}}},    RoleArn=event["AutopilotExecutionRoleArn"],)

Check SageMaker Autopilot job status

This step tracks the status of the Autopilot training job. In the script check_autopilot_job_status.py, we repeatedly check the status of the training job until it’s complete.

The callback step sends a token in an Amazon Simple Queue Service (Amazon SQS) queue, which invokes the AWS Lambda function to check the training job status. If the job is complete, the Lambda function sends a success message back to the callback step and the pipeline continues with the next step.

Deploy a model with AutoMLV2 using real-time inference

AutoMLV2 simplifies the deployment of models by automating the entire process, from model training to deployment. It takes care of the heavy lifting involved in selecting the best-performing model and preparing it for production use.

Furthermore, AutoMLV2 simplifies the deployment process. It can directly create a SageMaker model from the best candidate model and deploy it to a SageMaker endpoint with just a few lines of code.

In this section, we look at the code that deploys the best-performing model to a real-time SageMaker endpoint.

This pipeline step uses a Lambda step, which runs a serverless Lambda function. We use a Lambda step because the API call to create and deploy the SageMaker model is lightweight.

The first stage after the completion of the AutoMLV2 training process is to select the best candidate, making sure that the most accurate and efficient solution is chosen for deployment. We use the method describe_auto_ml_job_v2 to retrieve detailed information about a specific AutoMLV2 job. This method provides insights into the current status, configuration, and output of your AutoMLV2 job, allowing you to monitor its progress and access relevant information. See the following code:

autopilot_job = sagemaker_client.describe_auto_ml_job_v2(AutoMLJobName= event['autopilot_job_name'])best_candidate = autopilot_job['BestCandidate']

In SageMaker Autopilot, the best candidate model is selected based on minimizing cross-entropy loss, a default metric that measures the dissimilarity between predicted and actual word distributions during fine-tuning. Additionally, the model’s quality is evaluated using metrics like ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated text and human-written reference text, along with perplexity, which assesses how well the model predicts the next word in a sequence. The model with the lowest cross-entropy and perplexity, combined with strong ROUGE scores, is considered the best candidate.

With the best candidate model identified, you can create a SageMaker model object, encapsulating the trained model artifacts and necessary dependencies. For that, we use the method create_model for the AutoML job object:

best_candidate_name = best_candidate['CandidateName']response = sagemaker_client.create_model(ModelName=best_candidate_name,PrimaryContainer={'Image': autopilot_job["BestCandidate"]["InferenceContainers"][0].pop("Image"),'ModelDataUrl': autopilot_job["BestCandidate"]["InferenceContainers"][0].pop("ModelDataUrl"),'ImageConfig': {'RepositoryAccessMode': 'Platform',},'Environment': {"HUGGINGFACE_HUB_CACHE": "/tmp", "TRANSFORMERS_CACHE": "/tmp","HF_MODEL_ID": "/opt/ml/model"}},ExecutionRoleArn=event["AutopilotExecutionRoleArn"])

Next, we create a SageMaker endpoint configuration and deploy a SageMaker endpoint for real-time inference using the best candidate model. We use the instance type ml.g5.12xlarge to deploy the model. You may need to increase your quota to use this instance.

endpoint_name = f"ep-{model_name}-automl"endpoint_config_name = f"{model_name}-endpoint-config"endpoint_configuration = sagemaker_client.create_endpoint_config(EndpointConfigName = endpoint_config_name,ProductionVariants=[{'VariantName': "Variant-1",'ModelName': model_name,'InstanceType': "ml.g5.12xlarge",'InitialInstanceCount': 1,}],)response = sagemaker_client.create_endpoint(EndpointName=endpoint_name,EndpointConfigName=endpoint_config_name)endpoint_arn = response["EndpointArn"] 

Inference pipeline

The inference pipeline is used for batch inference. It demonstrates a way to deploy and evaluate an FM and register it in SageMaker Model Registry. The following diagram shows the workflow starting with a preprocess data step, through model inference, to post-inference evaluation and conditional model registration.

Preprocess data for evaluation

The first crucial step in evaluating the performance of the fine-tuned LLM is to preprocess the data for evaluation. This preprocessing stage involves transforming the data into a format suitable for the evaluation process and verifying the compatibility with the chosen evaluation library.

In this particular case, we use a pipeline step to prepare the data for evaluation. The preprocessing script (preprocess_evaluation.py) creates a .jsonl (JSON Lines) file, which serves as the test dataset for the evaluation phase. The JSON Lines format is a convenient way to store structured data, where each line represents a single JSON object.

This test dataset is crucial for obtaining an unbiased evaluation of the model’s generalization capabilities and its ability to handle new, previously unseen inputs. After the evaluation_dataset.jsonl file is created, it’s saved in the appropriate path in an Amazon Simple Storage Service (Amazon S3) bucket.

Evaluate the model using the fmeval library

SageMaker Autopilot streamlines the entire ML workflow, automating steps from data preprocessing to model evaluation. After training multiple models, SageMaker Autopilot automatically ranks them based on selected performance metrics, such as cross-entropy loss for text generation tasks, and identifies the best-performing model.

However, when deeper, more granular insights are required, particularly during post-training evaluation with a testing dataset, we use fmeval, an open source library tailored for fine-tuning and evaluating FMs. Fmeval provides enhanced flexibility and control, allowing for a comprehensive assessment of model performance using custom metrics tailored to the specific use case. This makes sure the model behaves as expected in real-world applications. Fmeval facilitates the evaluation of LLMs across a broad range of tasks, including open-ended text generation, summarization, question answering, and classification. Additionally, fmeval assesses models on metrics such as accuracy, toxicity, semantic robustness, and prompt stereotyping, helping identify the optimal model for diverse use cases while maintaining ethical and robust performance.

To start using the library, follow these steps:

    Create a ModelRunner that can perform invocation on your LLM. ModelRunner encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. For this project, we use SageMakerModelRunner from fmeval. In the file py used by our pipeline, create a DataConfig object to use the evaluation_dataset created in the previous step. Next, use an evaluation algorithm with the custom dataset. For this project, we use the QAAccuracy algorithm, which measures how well the model performs in question answering tasks. The model is queried for a range of facts, and we evaluate the accuracy of its response by comparing the model’s output to target answers under different metrics:
      Exact match (EM) – Binary score. 1 if model output and target answer match exactly. Quasi-exact match – Binary score. Similar to exact match, but both model output and target answer are normalized first by removing articles and punctuation. Precision over words – The fraction of words in the prediction that are also found in the target answer. The text is normalized as before. Recall over words – The fraction of words in the target answer that are also found in the prediction. F1 over words – The harmonic mean of precision and recall over words (normalized).

As an output, the evaluation step produces a file (evaluation_metrics.json) that contains the computed metrics. This file is stored in Amazon S3 and is registered as a property file for later access in the pipeline.

Register the model

Before registering the fine-tuned model, we introduce a quality control step by implementing a condition based on the evaluation metrics obtained from the previous step. Specifically, we focus on the F1 score metric, which measures the harmonic mean of precision and recall between the normalized response and reference.

To make sure that only high-performing models are registered and deployed, we set a predetermined threshold for the F1 score metric. If the model’s performance meets or exceeds this threshold, it is suitable for registration and deployment. However, if the model fails to meet the specified threshold, the pipeline concludes without registering the model, stopping the deployment of suboptimal models.

Create and run the pipeline

After we define the pipeline steps, we combine them into a SageMaker pipeline. The steps are run sequentially. The pipeline runs the steps for an AutoML job, using SageMaker Autopilot for training, model evaluation, and model registration. See the following code:

pipeline = Pipeline(name="training-pipeline",parameters=[evaluation_dataset_s3_path,model_name,metrics_report_s3_path, output_s3_path,model_package_name,model_approval_status],steps=[step_preprocess_evaluation_data,step_evaluate_autopilot_model,step_condition,step_register_autopilot_model],sagemaker_session=sagemaker_session,)pipeline.upsert(role_arn=SAGEMAKER_EXECUTION_ROLE_ARN)pipeline_execution = pipeline.start()pipeline_execution.wait(delay=20, max_attempts=24 * 60 * 3)  # max wait: 24 hours

Clean up

To avoid unnecessary charges and maintain a clean environment after running the demos outlined in this post, it’s important to delete all deployed resources. Follow these steps to properly clean up:

    To delete deployed endpoints, use the SageMaker console or the AWS SDK. This step is essential because endpoints can accrue significant charges if left running. Delete both SageMaker pipelines created during this walkthrough. This will help prevent residual executions that might generate additional costs. Remove all artifacts stored in your S3 buckets that were used for training, storing model artifacts, or logging. Make sure you delete only the resources related to this project to help avoid data loss. Clean up any additional resources. Depending on your implementation and any additional configurations, there may be other resources to consider, such as IAM roles, Amazon CloudWatch logs, or other AWS services. Identify and delete any resources that are no longer needed.

Conclusion

In this post, we explored how AutoMLV2 streamlines the process of fine-tuning FMs by automating the heavy lifting involved in model development. We demonstrated an end-to-end solution that uses SageMaker Pipelines to orchestrate the steps of data preparation, model training, evaluation, and deployment. The fmeval library played a crucial role in assessing the fine-tuned LLM’s performance, enabling us to select the best-performing model based on relevant metrics. By seamlessly integrating with the SageMaker infrastructure, AutoMLV2 simplified the deployment process, allowing us to create a SageMaker endpoint for real-time inference with just a few lines of code.

Get started by accessing the code on the GitHub repo to train and deploy your own custom AutoML models.

For more information on SageMaker Pipelines and SageMaker Autopilot, refer to Amazon SageMaker Pipelines and SageMaker Autopilot, respectively.


About the Author

Hajer Mkacher is a Solutions Architect at AWS, specializing in the Healthcare and Life Sciences industries. With over a decade in software engineering, she leverages generative AI to create innovative solutions, acting as a trusted advisor to her customers. In her free time, Hajer enjoys painting or working on creative robotics projects with her family.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

基础模型 AutoMLV2 Llama2-7B 微调 问答
相关文章