AWS Machine Learning Blog 2024年11月20日
Customize small language models on AWS with automotive terminology
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着人工智能的快速发展,定制化语言模型以满足特定行业需求变得越来越重要。大型语言模型(LLM)虽然擅长处理各种自然语言任务,但在专业领域却可能面临挑战,例如汽车行业诊断中缺乏专业术语和特定故障代码的理解。针对此问题,小型语言模型(SLM)因其高效、易于定制和部署的特点而备受关注。本文介绍了如何在AWS上定制SLM,以提升汽车领域问答任务的准确性和效率,并通过案例展示了基于Automotive_NER数据集对SLM进行微调、部署和评估的过程,最终实现了更精准的汽车故障诊断。

🤔 **大型语言模型(LLM)在特定领域(如汽车行业)应用存在局限性:** 由于缺乏专业术语和特定故障代码的理解,LLM在处理汽车诊断等任务时可能出现错误诊断或无法给出答案的情况,例如用户描述“发动机运转不平稳”而没有提供特定故障代码时,LLM可能给出许多无关紧要的建议。

🚀 **小型语言模型(SLM)成为定制化解决方案的理想选择:** SLM具有训练和部署成本低、推理速度快、资源需求低等优势,非常适合部署在各种设备上,例如边缘计算场景。通过微调,SLM可以快速适应特定领域的需求,例如汽车诊断。

💡 **利用LoRA等高效技术提升模型定制效率:** LoRA等技术使得模型定制变得更加容易,降低了模型定制的门槛,让更多用户能够轻松定制和部署自己的语言模型。

🧰 **AWS提供的解决方案:** AWS提供了Amazon Bedrock和Amazon SageMaker等服务,帮助用户构建生成式AI应用并部署定制模型,方便用户进行SLM的微调、部署和评估。

📚 **Automotive_NER数据集助力模型训练:** Automotive_NER数据集包含大量汽车行业相关的文本数据,例如汽车品牌、型号、部件、缺陷描述等,其中包含丰富的专业术语,有助于SLM学习汽车领域的专业知识,提升模型的准确性和专业性。

In the rapidly evolving world of AI, the ability to customize language models for specific industries has become more important. Although large language models (LLMs) are adept at handling a wide range of tasks with natural language, they excel at general purpose tasks as compared with specialized tasks. This can create challenges when processing text data from highly specialized domains with their own distinct terminology or specialized tasks where intrinsic knowledge of the LLM is not well-suited for solutions such as Retrieval Augmented Generation (RAG).

For instance, in the automotive industry, users might not always provide specific diagnostic trouble codes (DTCs), which are often proprietary to each manufacturer. These codes, such as P0300 for a generic engine misfire or C1201 for an ABS system fault, are crucial for precise diagnosis. Without these specific codes, a general purpose LLM might struggle to provide accurate information. This lack of specificity can lead to hallucinations in the generated responses, where the model invents plausible but incorrect diagnoses, or sometimes result in no answers at all. For example, if a user simply describes “engine running rough” without providing the specific DTC, a general LLM might suggest a wide range of potential issues, some of which may be irrelevant to the actual problem, or fail to provide any meaningful diagnosis due to insufficient context. Similarly, in tasks like code generation and suggestions through chat-based applications, users might not specify the APIs they want to use. Instead, they often request help in resolving a general issue or in generating code that utilizes proprietary APIs and SDKs.

Moreover, generative AI applications for consumers can offer valuable insights into the types of interactions from end-users. With appropriate feedback mechanisms, these applications can also gather important data to continuously improve the behavior and responses generated by these models.

For these reasons, there is a growing trend in the adoption and customization of small language models (SLMs). SLMs are compact transformer models, primarily utilizing decoder-only or encoder-decoder architectures, typically with parameters ranging from 1–8 billion. They are generally more efficient and cost-effective to train and deploy compared to LLMs, and are highly effective when fine-tuned for specific domains or tasks. SLMs offer faster inference times, lower resource requirements, and are suitable for deployment on a wider range of devices, making them particularly valuable for specialized applications and edge computing scenarios. Additionally, more efficient techniques for customizing both LLMs and SLMs, such as Low Rank Adaptation (LoRA), are making these capabilities increasingly accessible to a broader range of customers.

AWS offers a wide range of solutions for interacting with language models. Amazon Bedrock is a fully managed service that offers foundation models (FMs) from Amazon and other AI companies to help you build generative AI applications and host customized models. Amazon SageMaker is a comprehensive, fully managed machine learning (ML) service to build, train, and deploy LLMs and other FMs at scale. You can fine-tune and deploy models with Amazon SageMaker JumpStart or directly through Hugging Face containers.

In this post, we guide you through the phases of customizing SLMs on AWS, with a specific focus on automotive terminology for diagnostics as a Q&A task. We begin with the data analysis phase and progress through the end-to-end process, covering fine-tuning, deployment, and evaluation. We compare a customized SLM with a general purpose LLM, using various metrics to assess vocabulary richness and overall accuracy. We provide a clear understanding of customizing language models specific to the automotive domain and its benefits. Although this post focuses on the automotive domain, the approaches are applicable to other domains. You can find the source code for the post in the associated Github repository.

Solution overview

This solution uses multiple features of SageMaker and Amazon Bedrock, and can be divided into four main steps:

The following diagram illustrates the solution architecture.

Using the Automotive_NER dataset

The Automotive_NER dataset, available on the Hugging Face platform, is designed for named entity recognition (NER) tasks specific to the automotive domain. This dataset is specifically curated to help identify and classify various entities related to the automotive industry and uses domain-specific terminologies.

The dataset contains approximately 256,000 rows; each row contains annotated text data with entities related to the automotive domain, such as car brands, models, component, description of defects, consequences, and corrective actions. The terminology used to describe defects, reference to components, or error codes reported is a standard for the automotive industry. The fine-tuning process enables the language model to learn the domain terminologies better and helps improve the vocabulary used in the generation of answers and overall accuracy for the generated answers.

The following table is an example of rows contained in the dataset.

1 COMPNAME DESC_DEFECT CONEQUENCE_DEFECT CORRECTIVE_ACTION
2 ELECTRICAL SYSTEM:12V/24V/48V  BATTERY:CABLES CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN  ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION  DAMAGE. THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT  DAMAGE TO THE CABLES.  BESIDES HEAT  DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY  FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR. DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE  INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS.  AS NECESSARY, CABLES WILL BE REROUTED,  RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED.   OWNER NOTIFICATION BEGAN FEBRUARY 10,  2003.   OWNERS WHO DO NOT RECEIVE THE  FREE REMEDY  WITHIN A REASONABLE TIME  SHOULD CONTACT FORD AT 1-866-436-7332.
3 ELECTRICAL SYSTEM:12V/24V/48V  BATTERY:CABLES CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC ENGINES, LOOSE OR BROKEN  ATTACHMENTS AND MISROUTED BATTERY CABLES COULD LEAD TO CABLE INSULATION  DAMAGE. THIS, IN TURN, COULD CAUSE THE BATTERY CABLES TO SHORT RESULTING IN HEAT  DAMAGE TO THE CABLES.  BESIDES HEAT  DAMAGE, THE “CHECK ENGINE” LIGHT MAY ILLUMINATE, THE VEHICLE MAY  FAIL TO START, OR SMOKE, MELTING, OR FIRE COULD ALSO OCCUR. DEALERS WILL INSPECT THE BATTERY CABLES FOR THE CONDITION OF THE CABLE  INSULATION AND PROPER TIGHTENING OF THE TERMINAL ENDS.  AS NECESSARY, CABLES WILL BE REROUTED,  RETAINING CLIPS INSTALLED, AND DAMAGED BATTERY CABLES REPLACED.   OWNER NOTIFICATION BEGAN FEBRUARY 10,  2003.   OWNERS WHO DO NOT RECEIVE THE  FREE REMEDY  WITHIN A REASONABLE TIME  SHOULD CONTACT FORD AT 1-866-436-7332.
4 EQUIPMENT:OTHER:LABELS ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL CERTIFICATION (AND RVIA)  LABELS HAVE THE INCORRECT GROSS VEHICLE WEIGHT RATING, TIRE SIZE, AND  INFLATION PRESSURE LISTED. IF THE TIRES WERE INFLATED TO 80 PSI, THEY COULD BLOW RESULTING IN A  POSSIBLE CRASH. OWNERS WILL BE MAILED CORRECT LABELS FOR INSTALLATION ON THEIR  VEHICLES.   OWNER NOTIFICATION BEGAN  SEPTEMBER 23, 2002.    OWNERS SHOULD  CONTACT JAYCO AT 1-877-825-4782.
5 STRUCTURE ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM  HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME  MISALIGNED.  THE AFFECTED VEHICLES ARE  1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS  VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES  MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS. CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND  AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK.  THE ADDITIONAL STRESS CAN RESULT IN THE  FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS  FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM.  THE POSSIBILITY EXISTS THAT THERE COULD BE  DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO  A FIRE. DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE  THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK  SUPPORT.   OWNER NOTIFICATION BEGAN  NOVEMBER 5, 2002.  OWNERS SHOULD  CONTACT MONACO AT 1-800-685-6545.
6 STRUCTURE ON CERTAIN CLASS A MOTOR HOMES, THE FLOOR TRUSS NETWORK SUPPORT SYSTEM  HAS A POTENTIAL TO WEAKEN CAUSING INTERNAL AND EXTERNAL FEATURES TO BECOME  MISALIGNED.  THE AFFECTED VEHICLES ARE  1999 – 2003 CLASS A MOTOR HOMES MANUFACTURED ON F53 20,500 POUND GROSS  VEHICLE WEIGHT RATING (GVWR), FORD CHASSIS, AND 2000-2003 CLASS A MOTOR HOMES  MANUFACTURED ON W-22 22,000 POUND GVWR, WORKHORSE CHASSIS. CONDITIONS CAN RESULT IN THE BOTTOMING OUT THE SUSPENSION AND  AMPLIFICATION OF THE STRESS PLACED ON THE FLOOR TRUSS NETWORK.  THE ADDITIONAL STRESS CAN RESULT IN THE  FRACTURE OF WELDS SECURING THE FLOOR TRUSS NETWORK SYSTEM TO THE CHASSIS  FRAME RAIL AND/OR FRACTURE OF THE FLOOR TRUSS NETWORK SUPPORT SYSTEM.  THE POSSIBILITY EXISTS THAT THERE COULD BE  DAMAGE TO ELECTRICAL WIRING AND/OR FUEL LINES WHICH COULD POTENTIALLY LEAD TO  A FIRE. DEALERS WILL INSPECT THE FLOOR TRUSS NETWORK SUPPORT SYSTEM, REINFORCE  THE EXISTING STRUCTURE, AND REPAIR, AS NEEDED, THE FLOOR TRUSS NETWORK  SUPPORT.   OWNER NOTIFICATION BEGAN  NOVEMBER 5, 2002.  OWNERS SHOULD  CONTACT MONACO AT 1-800-685-6545.

Data analysis and preparation on SageMaker Studio

When you’re fine-tuning LLMs, the quality and composition of your training data are crucial (quality over quantity). For this post, we implemented a sophisticated method to select 6,000 rows out of 256,000. This method uses TF-IDF vectorization to identify the most significant and the rarest words in the dataset. By selecting rows containing these words, we maintained a balanced representation of common patterns and edge cases. This improves computational efficiency and creates a high-quality, diverse subset leading to effective model training.

The first step is to open a JupyterLab application previously created in our SageMaker Studio domain.

After you clone the git repository, install the required libraries and dependencies:

pip install -r requirements.txt

The next step is to read the dataset:

from datasets import load_datasetimport pandas as pddataset = load_dataset("sp01/Automotive_NER")df = pd.DataFrame(dataset['train'])

The first step of our data preparation activity is to analyze the importance of the words in our dataset, for identifying both the most important (frequent and distinctive) words and the rarest words in the dataset, by using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization.

Given the dataset’s size, we decided to run the fine-tuning job using Amazon SageMaker Training.

By using the @remote function capability of the SageMaker Python SDK, we can run our code into a remote job with ease.

In our case, the TF-IDF vectorization and the extraction of the top words and bottom words are performed in a SageMaker training job directly from our notebook, without any code changes, by simply adding the @remote decorator on top of our function. You can define the configurations required by the SageMaker training job, such as dependencies and training image, in a config.yaml file. For more details on the settings supported by the config file, see Using the SageMaker Python SDK

See the following code:

SchemaVersion: '1.0'SageMaker:  PythonSDK:    Modules:      RemoteFunction:        Dependencies: ./requirements.txt        ImageUri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311'        InstanceType: ml.g5.12xlarge        PreExecutionCommands:          - 'export NCCL_P2P_DISABLE=1'  Model:    EnableNetworkIsolation: false

Next step is to define and execute our processing function:

import numpy as npimport refrom sagemaker.remote_function import remotefrom sklearn.feature_extraction.text import TfidfVectorizerimport string@remote(volume_size=10, job_name_prefix=f"preprocess-auto-ner-auto-merge", instance_type="ml.m4.10xlarge")def preprocess(df,               top_n=6000,               bottom_n=6000    ):    # Download nltk stopwords    import nltk    nltk.download('stopwords')    from nltk.corpus import stopwords    # Define a function to preprocess text    def preprocess_text(text):        if not isinstance(text, str):            # Return an empty string or handle the non-string value as needed            return ''            # Remove punctuation        text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)            # Convert to lowercase        text = text.lower()            # Remove stop words (optional)        stop_words = set(stopwords.words('english'))        text = ' '.join([word for word in text.split() if word not in stop_words])            return text        print("Applying text preprocessing")        # Preprocess the text columns    df['DESC_DEFECT'] = df['DESC_DEFECT'].apply(preprocess_text)    df['CONEQUENCE_DEFECT'] = df['CONEQUENCE_DEFECT'].apply(preprocess_text)    df['CORRECTIVE_ACTION'] = df['CORRECTIVE_ACTION'].apply(preprocess_text)        # Create a TfidfVectorizer object    tfidf_vectorizer = TfidfVectorizer()    print("Compute TF-IDF")        # Fit and transform the text data    X_tfidf = tfidf_vectorizer.fit_transform(df['DESC_DEFECT'] + ' ' + df['CONEQUENCE_DEFECT'] + ' ' + df['CORRECTIVE_ACTION'])        # Get the feature names (words)    feature_names = tfidf_vectorizer.get_feature_names_out()        # Get the TF-IDF scores    tfidf_scores = X_tfidf.toarray()        top_word_indices = np.argsort(tfidf_scores.sum(axis=0))[-top_n:]    bottom_word_indices = np.argsort(tfidf_scores.sum(axis=0))[:bottom_n]    print("Extracting top and bottom words")        # Get the top and bottom words    top_words = [feature_names[i] for i in top_word_indices]    bottom_words = [feature_names[i] for i in bottom_word_indices]    return top_words, bottom_wordstop_words, bottom_words = preprocess(df)

After we extract the top and bottom 6,000 words based on their TF-IDF scores from our original dataset, we classify each row in the dataset based on whether it contained any of these important or rare words. Rows are labeled as ‘top’ if they contained important words, ‘bottom’ if they contained rare words, or ‘neither’ if they don’t contain either:

# Create a function to check if a row contains important or rare wordsdef contains_important_or_rare_words(row):    try:        if ("DESC_DEFECT" in row.keys() and row["DESC_DEFECT"] is not None and            "CONEQUENCE_DEFECT" in row.keys() and row["CONEQUENCE_DEFECT"] is not None and            "CORRECTIVE_ACTION" in row.keys() and row["CORRECTIVE_ACTION"] is not None):            text = row['DESC_DEFECT'] + ' ' + row['CONEQUENCE_DEFECT'] + ' ' + row['CORRECTIVE_ACTION']                    text_words = set(text.split())                    # Check if the row contains any important words (top_words)            for word in top_words:                if word in text_words:                    return 'top'                    # Check if the row contains any rare words (bottom_words)            for word in bottom_words:                if word in text_words:                    return 'bottom'                    return 'neither'        else:            return 'none'    except Exception as e:        raise edf['word_type'] = df.apply(contains_important_or_rare_words, axis=1)

Finally, we create a balanced subset of the dataset by selecting all rows containing important words (‘top’) and an equal number of rows containing rare words (‘bottom’). If there aren’t enough ‘bottom’ rows, we filled the remaining slots with ‘neither’ rows.

DESC_DEFECT CONEQUENCE_DEFECT CORRECTIVE_ACTION word_type
2 ON CERTAIN FOLDING TENT CAMPERS, THE FEDERAL C… IF THE TIRES WERE INFLATED TO 80 PSI, THEY COU… OWNERS WILL BE MAILED CORRECT LABELS FOR INSTA… top
2402 CERTAIN PASSENGER VEHICLES EQUIPPED WITH DUNLO… THIS COULD RESULT IN PREMATURE TIRE WEAR. DEALERS WILL INSPECT AND IF NECESSARY REPLACE … bottom
0 CERTAIN PASSENGER VEHICLES EQUIPPED WITH ZETEC… THIS, IN TURN, COULD CAUSE THE BATTERY CABLES … DEALERS WILL INSPECT THE BATTERY CABLES FOR TH… neither

Finally, we randomly sampled 6,000 rows from this balanced set:

# Select all rows from each grouptop_rows = df[df['word_type'] == 'top']bottom_rows = df[df['word_type'] == 'bottom']# Combine the two groups, ensuring a balanced datasetif len(bottom_rows) > 0:    df = pd.concat([top_rows, bottom_rows.sample(n=len(bottom_rows), random_state=42)], ignore_index=True)else:    df = top_rows.copy()# If the combined dataset has fewer than 6010 rows, fill with remaining rowsif len(df) < 6000:    remaining_rows = df[df['word_type'] == 'neither'].sample(n=6010 - len(df), random_state=42)    df = pd.concat([df, remaining_rows], ignore_index=True)df = df.sample(n=6000, random_state=42)

Fine-tuning Meta Llama 3.1 8B with a SageMaker training job

After selecting the data, we need to prepare the resulting dataset for the fine-tuning activity. By examining the columns, we aim to adapt the model for two different tasks:

The following code is for the first prompt:

# User: {MFGNAME}{COMPNAME}{DESC_DEFECT}# AI: {CONEQUENCE_DEFECT}

With this prompt, we instruct the model to highlight the possible consequences of a defect, given the manufacturer, component name, and description of the defect.

The following code is for the second prompt:

# User:{MFGNAME}{COMPNAME}{DESC_DEFECT}# AI: {CORRECTIVE_ACTION}

With this second prompt, we instruct the model to suggest possible corrective actions for a given defect and component of a specific manufacturer.

First, let’s split the dataset into train, test, and validation subsets:

from sklearn.model_selection import train_test_splittrain, test = train_test_split(df, test_size=0.1, random_state=42)train, valid = train_test_split(train, test_size=10, random_state=42)

Next, we create prompt templates to convert each row item into the two prompt formats previously described:

from random import randint# template dataset to add prompt to each sampledef template_dataset_consequence(sample):    # custom instruct prompt start    prompt_template = f"""    <|begin_of_text|><|start_header_id|>user<|end_header_id|>    These are the information related to the defect        Manufacturer: {{mfg_name}}    Component: {{comp_name}}    Description of a defect:    {{desc_defect}}        What are the consequences of defect?    <|eot_id|><|start_header_id|>assistant<|end_header_id|>    {{consequence_defect}}    <|end_of_text|><|eot_id|>    """    sample["text"] = prompt_template.format(        mfg_name=sample["MFGNAME"],        comp_name=sample["COMPNAME"],        desc_defect=sample["DESC_DEFECT"].lower(),        consequence_defect=sample["CONEQUENCE_DEFECT"].lower())    return samplefrom random import randint# template dataset to add prompt to each sampledef template_dataset_corrective_action(sample):    # custom instruct prompt start    prompt_template = f"""    <|begin_of_text|><|start_header_id|>user<|end_header_id|>    Manufacturer: {{mfg_name}}    Component: {{comp_name}}        Description of a defect:    {{desc_defect}}        What are the possible corrective actions?    <|eot_id|><|start_header_id|>assistant<|end_header_id|>    {{corrective_action}}    <|end_of_text|><|eot_id|>    """    sample["text"] = prompt_template.format(        mfg_name=sample["MFGNAME"],        comp_name=sample["COMPNAME"],        desc_defect=sample["DESC_DEFECT"].lower(),        corrective_action=sample["CORRECTIVE_ACTION"].lower())    return sample

Now we can apply the template functions template_dataset_consequence and template_dataset_corrective_action to our datasets:

As a final step, we concatenate the four resulting datasets for train and test:

Our final training dataset comprises approximately 12,000 elements, properly split into about 11,000 for training and 1,000 for testing.

Now we can prepare the training script and define the training function train_fn and put the @remote decorator on the function.

The training function does the following:

Because we want to distribute the training across all the available GPUs in our instance, by using PyTorch Distributed Data Parallel (DDP), we use the Hugging Face Accelerate library that enables us to run the same PyTorch code across distributed configurations.

For optimizing memory resources, we have decided to run a mixed precision training:

from accelerate import Acceleratorfrom huggingface_hub import loginfrom peft import AutoPeftModelForCausalLM, LoraConfig, get_peft_model, prepare_model_for_kbit_trainingfrom sagemaker.remote_function import remoteimport torchfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, set_seedimport transformers# Start training@remote(    keep_alive_period_in_seconds=0,    volume_size=100, job_name_prefix=f"train-{model_id.split('/')[-1].replace('.', '-')}-auto",    use_torchrun=True,    nproc_per_node=4)def train_fn(        model_name,        train_ds,        test_ds=None,        lora_r=8,        lora_alpha=16,        lora_dropout=0.1,        per_device_train_batch_size=8,        per_device_eval_batch_size=8,        gradient_accumulation_steps=1,        learning_rate=2e-4,        num_train_epochs=1,        fsdp="",        fsdp_config=None,        gradient_checkpointing=False,        merge_weights=False,        seed=42,        token=None):    set_seed(seed)    accelerator = Accelerator()    if token is not None:        login(token=token)    tokenizer = AutoTokenizer.from_pretrained(model_name)    # Set Tokenizer pad Token    tokenizer.pad_token = tokenizer.eos_token    with accelerator.main_process_first():        # tokenize and chunk dataset        lm_train_dataset = train_ds.map(            lambda sample: tokenizer(sample["text"]), remove_columns=list(train_ds.features)        )        print(f"Total number of train samples: {len(lm_train_dataset)}")        if test_ds is not None:            lm_test_dataset = test_ds.map(                lambda sample: tokenizer(sample["text"]), remove_columns=list(test_ds.features)            )            print(f"Total number of test samples: {len(lm_test_dataset)}")        else:            lm_test_dataset = None          torch_dtype = torch.bfloat16    # Defining additional configs for FSDP    if fsdp != "" and fsdp_config is not None:        bnb_config_params = {            "bnb_4bit_quant_storage": torch_dtype        }        model_configs = {            "torch_dtype": torch_dtype        }        fsdp_configurations = {            "fsdp": fsdp,            "fsdp_config": fsdp_config,            "gradient_checkpointing_kwargs": {                "use_reentrant": False            },            "tf32": True        }    else:        bnb_config_params = dict()        model_configs = dict()        fsdp_configurations = dict()    bnb_config = BitsAndBytesConfig(        load_in_4bit=True,        bnb_4bit_use_double_quant=True,        bnb_4bit_quant_type="nf4",        bnb_4bit_compute_dtype=torch_dtype,        **bnb_config_params    )    model = AutoModelForCausalLM.from_pretrained(        model_name,        trust_remote_code=True,        quantization_config=bnb_config,        attn_implementation="flash_attention_2",        use_cache=not gradient_checkpointing,        cache_dir="/tmp/.cache",        **model_configs    )    if fsdp == "" and fsdp_config is None:        model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=gradient_checkpointing)    if gradient_checkpointing:        model.gradient_checkpointing_enable()    config = LoraConfig(        r=lora_r,        lora_alpha=lora_alpha,        target_modules="all-linear",        lora_dropout=lora_dropout,        bias="none",        task_type="CAUSAL_LM"    )    model = get_peft_model(model, config)    print_trainable_parameters(model)    trainer = transformers.Trainer(        model=model,        train_dataset=lm_train_dataset,        eval_dataset=lm_test_dataset if lm_test_dataset is not None else None,        args=transformers.TrainingArguments(            per_device_train_batch_size=per_device_train_batch_size,            per_device_eval_batch_size=per_device_eval_batch_size,            gradient_accumulation_steps=gradient_accumulation_steps,            gradient_checkpointing=gradient_checkpointing,            logging_strategy="steps",            logging_steps=1,            log_on_each_node=False,            num_train_epochs=num_train_epochs,            learning_rate=learning_rate,            bf16=True,            ddp_find_unused_parameters=False,            save_strategy="no",            output_dir="outputs",            **fsdp_configurations        ),        data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),    )    trainer.train()    if trainer.is_fsdp_enabled:        trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")    if merge_weights:        output_dir = "/tmp/model"        # merge adapter weights with base model and save        # save int 4 model        trainer.model.save_pretrained(output_dir, safe_serialization=False)              if accelerator.is_main_process:            # clear memory            del model            del trainer            torch.cuda.empty_cache()            # load PEFT model            model = AutoPeftModelForCausalLM.from_pretrained(                output_dir,                torch_dtype=torch.float16,                low_cpu_mem_usage=True,                trust_remote_code=True,            )             # Merge LoRA and base model and save            model = model.merge_and_unload()            model.save_pretrained(                "/opt/ml/model", safe_serialization=True, max_shard_size="2GB"            )    else:        trainer.model.save_pretrained("/opt/ml/model", safe_serialization=True)    if accelerator.is_main_process:        tokenizer.save_pretrained("/opt/ml/model")

We can specify to run a distributed job in the @remote function through the parameters use_torchrun and nproc_per_node, which indicates if the SageMaker job should use as entrypoint torchrun and the number of GPUs to use. You can pass optional parameters like volume_size, subnets, and security_group_ids using the @remote decorator.

Finally, we run the job by invoking train_fn():

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"train_fn(    model_id,    train_ds=train_dataset,    test_ds=test_dataset,    per_device_train_batch_size=2,    per_device_eval_batch_size=2,    gradient_accumulation_steps=2,    gradient_checkpointing=True,    num_train_epochs=1,    merge_weights=True,    token="<HF_TOKEN>")

The training job runs on the SageMaker training cluster. The training job took about 42 minutes, by distributing the computation across the 4 available GPUs on the selected instance type ml.g5.12xlarge.

We choose to merge the LoRA adapter with the base model. This decision was made during the training process by setting the merge_weights parameter to True in our train_fn() function. Merging the weights provides us with a single, cohesive model that incorporates both the base knowledge and the domain-specific adaptations we’ve made through fine-tuning.

By merging the model, we gain flexibility in our deployment options.

Model deployment

When deploying a fine-tuned model on AWS, multiple deployment strategies are available. In this post, we explore two deployment methods:

Model deployment with SageMaker real-time inference

SageMaker real-time inference is designed for having full control over the inference resources. It allows you to use a set of available instances and deployment options for hosting your model. By using the SageMaker built-in container Hugging Face Text Generation Inference (TGI), you can take advantage of the inference script and optimization options available in the container.

In this post, we deploy the fine-tuned model to a SageMaker endpoint for running inference, which will be used for evaluating the model in the next step.

We create the HuggingFaceModel object, which is a high-level SageMaker model class for working with Hugging Face models. The image_uri parameter specifies the container image URI for the model, and model_data points to the Amazon Simple Storage Service (Amazon S3) location containing the model artifact (automatically uploaded by the SageMaker training job). We also specify a set of environment variables to configure the number of GPUs (SM_NUM_GPUS), quantization methodology (QUANTIZE), and maximum input and total token lengths (MAX_INPUT_LENGTH and MAX_TOTAL_TOKENS).

model = HuggingFaceModel(    image_uri=image_uri,    model_data=f"s3://{bucket_name}/{job_name}/{job_name}/output/model.tar.gz",    role=get_execution_role(),    env={        'HF_MODEL_ID': "/opt/ml/model", # path to where sagemaker stores the model        'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica        'QUANTIZE': 'bitsandbytes',        'MAX_INPUT_LENGTH': '4096',        'MAX_TOTAL_TOKENS': '8192'    })

After creating the model object, we can deploy it to an endpoint using the deploy method. The initial_instance_count and instance_type parameters specify the number and type of instances to use for the endpoint. The container_startup_health_check_timeout and model_data_download_timeout parameters set the timeout values for the container startup health check and model data download, respectively.

predictor = model.deploy(    initial_instance_count=instance_count,    instance_type=instance_type,    container_startup_health_check_timeout=health_check_timeout,    model_data_download_timeout=3600)

It takes a few minutes to deploy the model before it becomes available for inference and evaluation. The endpoint is invoked using the AWS SDK with the boto3 client for sagemaker-runtime, or directly by using the SageMaker Python SDK and the predictor previously created, by using the predict API.

body = {        'inputs': prompt,        'parameters': {            "top_p": 0.9,            "temperature": 0.2,            "max_new_tokens": 512,            "return_full_text": False,            "stop": [                '<|eot_id|>',                '<|end_of_text|>'            ]        }    }response = predictor.predict(body)

Model deployment with Amazon Bedrock Custom Model Import

Amazon Bedrock Custom Model Import is a fully managed capability, currently in public preview, designed for importing and deploying custom language models. It allows you to interact with the deployed model both on-demand and by provisioning the throughput.

In this section, we use the Custom Model Import feature in Amazon Bedrock for deploying our fine-tuned model in the fully managed environment of Amazon Bedrock.

After defining the model and job_name variables, we import our model from the S3 bucket by supplying it in the Hugging Face weights format.

Next, we use a preexisting AWS Identity and Access Management (IAM) role that allows reading the binary file from Amazon S3 and create the import job resource in Amazon Bedrock for hosting our model.

It takes a few minutes to deploy the model, and it can be invoked using the AWS SDK with the boto3 client for bedrock-runtime by using the invoke_model API:

fine_tuned_model_id = “<MODEL_ARN>”body = {        "prompt": prompt,        "temperature": 0.1,        "top_p": 0.9,    }response = bedrock_client.invoke_model(        modelId=fine_tuned_model_id,        body=json.dumps(body))

Model evaluation

In this final step, we evaluate the fine-tuned model against the base models Meta Llama 3 8B Instruct and Meta Llama 3 70B Instruct on Amazon Bedrock. Our evaluation focuses on how well the model uses specific terminology for the automotive space and the improvements provided by fine-tuning in generating answers.

The fine-tuned model’s ability to understand components and error descriptions for diagnostics, as well as identify corrective actions and consequences in the generated answers, can be evaluated on two dimensions.

To evaluate the quality of the generated text and whether the vocabulary and terminology used are appropriate for the task and industry, we use the Bilingual Evaluation Understudy (BLEU) score. BLEU is an algorithm for evaluating the quality of text, by calculating n-gram overlap between the generated and the reference text.

To evaluate the accuracy of the generated text and see if the generated answer is similar to the expected one, we use the Normalized Levenshtein distance. This algorithm evaluates how close the calculated or measured values are to the actual value.

The evaluation dataset comprises 10 unseen examples of component diagnostics extracted from the original training dataset.

The prompt template for the evaluation is structured as follows:

<|begin_of_text|><|start_header_id|>user<|end_header_id|>Manufacturer: {row['MFGNAME']}Component: {row['COMPNAME']}Description of a defect:{row['DESC_DEFECT']}What are the consequences?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

BLEU score evaluation with base Meta Llama 3 8B and 70B Instruct

The following table and figures show the calculated values for the BLEU score comparison (higher is better) with Meta Llama 3 8B and 70 B Instruct.

Example Fine-Tuned Score Base Score: Meta Llama 3 8B Base Score: Meta Llama 3 70B
1 2733 0. 2936 5.10E-155 4.85E-155
2 3382 0.1619 0.058 1.134E-78
3 1198 0.2338 1.144E-231 3.473E-155
4 2942 0.94854 2.622E-231 3.55E-155
5 5151 1.28E-155 0 0
6 2101 0.80345 1.34E-78 1.27E-78
7 5178 0.94854 0.045 3.66E-155
8 1595 0.40412 4.875E-155 0.1326
9 2313 0.94854 3.03E-155 9.10E-232
10 557 0.89315 8.66E-79 0.1954

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model in the vocabulary and terminology used.

The analysis suggests that for the analyzed cases, the fine-tuned model outperforms the base model in the vocabulary and terminology used in the generated answer. The fine-tuned model appears to be more consistent in its performance.

Normalized Levenshtein distance with base Meta Llama 3 8B Instruct

The following table and figures show the calculated values for the Normalized Levenshtein distance comparison with Meta Llama 3 8B and 70B Instruct.

Example Fine-tuned Score Base Score – Llama 3 8B Base Score – Llama 3 70B
1 2733 0.42198 0.29900 0.27226
2 3382 0.40322 0.25304 0.21717
3 1198 0.50617 0.26158 0.19320
4 2942 0.99328 0.18088 0.19420
5 5151 0.34286 0.01983 0.02163
6 2101 0.94309 0.25349 0.23206
7 5178 0.99107 0.14475 0.17613
8 1595 0.58182 0.19910 0.27317
9 2313 0.98519 0.21412 0.26956
10 557 0.98611 0.10877 0.32620

By comparing the fine-tuned and base scores, we can assess the performance improvement (or degradation) achieved by fine-tuning the model on the specific task or domain.

The analysis shows that the fine-tuned model clearly outperforms the base model across the selected examples, suggesting the fine-tuning process has been quite effective in improving the model’s accuracy and generalization in understanding the specific cause of the component defect and providing suggestions on the consequences.

In the evaluation analysis performed for both selected metrics, we can also highlight some areas for improvement:

Clean up

After you complete your training and evaluation experiments, clean up your resources to avoid unnecessary charges. If you deployed the model with SageMaker, you can delete the created real-time endpoints using the SageMaker console. Next, delete any unused SageMaker Studio resources. If you deployed the model with Amazon Bedrock Custom Model Import, you can delete the imported model using the Amazon Bedrock console.

Conclusion

This post demonstrated the process of customizing SLMs on AWS for domain-specific applications, focusing on automotive terminology for diagnostics. The provided steps and source code show how to analyze data, fine-tune models, deploy them efficiently, and evaluate their performance against larger base models using SageMaker and Amazon Bedrock. We further highlighted the benefits of customization by enhancing vocabulary within specialized domains.

You can evolve this solution further by implementing proper ML pipelines and LLMOps practices through Amazon SageMaker Pipelines. SageMaker Pipelines enables you to automate and streamline the end-to-end workflow, from data preparation to model deployment, enhancing reproducibility and efficiency. You can also improve the quality of training data using advanced data processing techniques. Additionally, using the Reinforcement Learning from Human Feedback (RLHF) approach can align the model response to human preferences. These enhancements can further elevate the performance of customized language models across various specialized domains. You can find the sample code discussed in this post on the GitHub repo.


About the authors

Bruno Pistone is a Senior Generative AI and ML Specialist Solutions Architect for AWS based in Milan. He works with large customers helping them to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His expertise include: Machine Learning end to end, Machine Learning Industrialization, and Generative AI. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations

Gopi Krishnamurthy is a Senior AI/ML Solutions Architect at Amazon Web Services based in New York City. He works with large Automotive and Industrial customers as their trusted advisor to transform their Machine Learning workloads and migrate to the cloud. His core interests include deep learning and serverless technologies. Outside of work, he likes to spend time with his family and explore a wide range of music.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

语言模型 SLM 汽车行业 定制化 AWS
相关文章