AWS Machine Learning Blog 2024年07月24日
AWS AI chips deliver high performance and low cost for Llama 3.1 models on AWS
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS宣布Trainium和Inferentia支持Llama 3.1模型的微调和推理。Llama 3.1是多语言大型语言模型,有8B、70B和405B三种规模,支持长文本上下文和优化了推理性能。这些模型在多语言对话用例上表现优异,可用于搜索、图像生成、代码执行等工具调用,支持零样本工具使用。405B模型是公开可用的最大LLM,适用于企业级应用和研究开发。

🚀 Llama 3.1模型家族包括8B、70B和405B三种规模的多语言大型语言模型,均支持长达128k的文本上下文,并通过分组查询注意力(GQA)优化了推理性能。

🔧 Llama 3.1模型经过指令调整,优化了多语言对话用例,能够生成用于特定工具调用的文本,如搜索、图像生成、代码执行等,并支持零样本工具使用。

🌍 Llama 3.1 405B模型是据Meta称世界上公开可用的最大LLM,适用于合成数据生成、模型蒸馏等任务,擅长通用知识、长文本生成、多语言翻译等。

🤖 Llama 3.1模型架构上与Llama 3相同,采用优化的变压器架构,并通过监督式微调(SFT)和人类反馈的强化学习(RLHF)来符合人类偏好。

💻 AWS Trainium和Inferentia为Llama 3.1模型提供了高性能和低成本的解决方案,支持在Amazon Bedrock和SageMaker上使用,也可通过Amazon EC2 Trn1和Inf2实例进行自定义ML管道构建。

Today, we are excited to announce AWS Trainium and AWS Inferentia support for fine-tuning and inference of the Llama 3.1 models. The Llama 3.1 family of multilingual large language models (LLMs) is a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes. In a previous post, we covered how to deploy Llama 3 models on AWS Trainium and Inferentia based instances in Amazon SageMaker JumpStart. In this post, we outline how to get started with fine-tuning and deploying the Llama 3.1 family of models on AWS AI chips, to realize their price-performance benefits.

Overview of Llama 3.1 models

The Llama 3.1 family of multilingual LLMs are a collection of pre-trained and instruction tuned generative models in 8B, 70B, and 405B sizes (text in/text and code out). All models support long context length (128k) and are optimized for inference with support for grouped query attention (GQA).

The Llama 3.1 instruction tuned models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available publicly available chat models on common industry benchmarks. They have been trained to generate tool calls for a few specific tools for capabilities like search, image generation, code execution, and mathematical reasoning. In addition, they support zero-shot tool use.

Llama 3.1 405B is the world’s largest publicly available LLM according to Meta. The model sets a new standard for artificial intelligence (AI) and is ideal for enterprise-level applications and research and development. It’s ideal for tasks like synthetic data generation, where the outputs of the model can be used to improve smaller Llama models after fine-tuning, and model distillations to transfer knowledge to smaller models from the 405B model. This model excels at general knowledge, long-form text generation, multilingual translation, machine translation, coding, math, tool use, enhanced contextual understanding, and advanced reasoning and decision-making.

Architecturally, the core LLM for Llama 3 and Llama 3.1 has the same dense architecture. They are auto-regressive language models that use an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

The responsible use guide from Meta can assist you in implementing additional fine-tuning that may be necessary to customize and optimize the models with appropriate safety mitigations.

Trainium powers Llama 3.1 on Amazon Bedrock and Amazon SageMaker

The fastest way to get started with Llama 3.1 on AWS is through Amazon Bedrock, which is powered by our purpose-built AI infrastructure including AWS Trainium. Through its fully managed API, Amazon Bedrock delivers the benefits of our purpose-built AI infrastructure and simplifies access to these powerful models so you can focus on building differentiated AI applications.

If you need greater control over the underlying resources, you can fine-tune and deploy Llama 3.1 models with SageMaker. Trainium support for Llama 3.1 in SageMaker JumpStart is coming soon.

AWS Trainium and AWS Inferentia2 enable high performance and low cost for Llama 3.1 models

If you want to build your own ML pipelines for training and inference for greater flexibility and control, you can get started with Llama 3.1 on AWS AI chips using Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances. Let’s see how you can get started with the new Llama 3.1 8/70B models on Trainium using the AWS Neuron SDK.

Fine-tune Llama 3.1 on Trainium

To get started with fine-tuning either Llama 3.1 8B or Llama 3.1 70B, you can use the NeuronX Distributed library. NeuronX Distributed provides implementations of some of the more popular distributed training and inference techniques. To start fine-tuning, you can use the following samples:

Both samples are built on top of AWS ParallelCluster to manage the Trainium cluster infrastructure and Slurm for workload management. The following is the example Slurm command to initiate training for Llama3.1 70B:

sbatch --exclusive \--nodes 32 \--cpus-per-task 128 \--wrap="srun bash $(pwd)/run_llama3_70B_tp_pp.sh"

Inside the Slurm script, we launch a distributed training process on our cluster. In the runner scripts, we load the pre-trained weights and configuration provided by Meta, and launch the training process: 

torchrun $DISTRIBUTED_ARGS run_llama_nxd.py \    —train_batch_size $BS \    —use_meta_device_init 1 \    —training_dir $DATA_PATH \    —training_config $SCRIPT_DIR/${MODEL_SIZE}config_llama${LLAMA_VERSION} \    —max_steps $max_steps \    —seq_len $SEQ_LEN \    —pipeline_parallel_size $PP_DEGREE \    —tensor_parallel_size $TP_DEGREE \    —num_microbatches $NUM_MICROBATCHES \    —lr 0.000015 \    —min_lr 1e-06 \    —beta1 0.9 \    —beta2 0.95 \    —weight_decay 0.1 \    —warmup_steps 2000 \    —constant_steps 0 \    —use_zero1_optimizer 1 \    —use_selective_checkpoint 1 \    —use_flash_attention 1 \    —qkv_linear 1 \    —kv_replicator 4 \    —pretrained_weight 1 \    —save_load_xser 1 \    —checkpoint_dir "/shared/llama${LLAMA_VERSION}${MODEL_SIZE}/" \    —checkpoint_freq $checkpoint_freq \    —num_kept_checkpoint -1 \    —loading_step -1 \    —tb_dir $tb_dir |& tee $LOG_PATH/logexit ${PIPESTATUS[0]}

Deploy Llama 3.1 on Trainium

When your model is ready to deploy, you can do so by updating the model ID in the previous Llama 3 8B Neuron sample code:

model_id = "meta-llama/Meta-Llama-3.1-8B"neuron_model = LlamaForSampling.from_pretrained(model_id, neuron_config=neuron_config, batch_size=1, tp_degree=24, amp='bf16', n_positions=4096)neuron_model.to_neuron()

You can use the same sample inference code:

tokenizer = AutoTokenizer.from_pretrained(model_id)prompt = "Hello, I'm a language model and I like to"input_ids = tokenizer.encode(prompt, return_tensors="pt")# run inference with top-k samplingwith torch.inference_mode():    start = time.time()    generated_sequences = neuron_model.sample(input_ids, sequence_length=2048, top_k=50)    elapsed = time.time() - startgenerated_sequences = [tokenizer.decode(seq) for seq in generated_sequences]print(f'generated sequences {generated_sequences} in {elapsed} seconds')

For step-by-step details, refer to the new Llama 3.1 examples:

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker through the Hugging Face Model Hub. From the Llama 3.1 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium. Copy the example code into a SageMaker notebook, then choose Run.

import jsonimport sagemakerimport boto3from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uritry:    role = sagemaker.get_execution_role()except ValueError:    iam = boto3.client("iam")    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]# Hub Model configuration. https://huggingface.co/modelshub = {    "HF_MODEL_ID": "meta-llama/Meta-Llama-3.1-8B",    "HF_NUM_CORES": "2",    "HF_AUTO_CAST_TYPE": "fp16",    "MAX_BATCH_SIZE": "8",    "MAX_INPUT_LENGTH": "3686",    "MAX_TOTAL_TOKENS": "4096",    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",}assert hub["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>", "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"# create Hugging Face Model Classhuggingface_model = HuggingFaceModel(    image_uri=get_huggingface_llm_image_uri("huggingface-neuronx", version="0.0.23"),    env=hub,    role=role,)# deploy model to SageMaker Inferencepredictor = huggingface_model.deploy(    initial_instance_count=1,    instance_type="ml.inf2.xlarge",    container_startup_health_check_timeout=1800,    volume_size=512,)# send requestpredictor.predict(    {        "inputs": "What is is the capital of France?",        "parameters": {            "do_sample": True,            "max_new_tokens": 128,            "temperature": 0.7,            "top_k": 50,            "top_p": 0.95,        }    })

Additionally, if you want to use vLLM to deploy the models, you can refer to the continuous batching guide to create the environment. After you create the environment, you can use vLLM to deploy Llama 3.1 8/70B models on AWS Trainium or Inferentia. The following an example to deploy Llama 3.1 8B:

from vllm import LLM, SamplingParams# Sample prompts.prompts = [    "Hello, my name is",    "The president of the United States is",    "The capital of France is",    "The future of AI is",]# Create a sampling params object.sampling_params = SamplingParams(temperature=0.8, top_p=0.95)# Create an LLM.llm = LLM(    model="meta-llama/Meta-Llama-3.1-8B",    max_num_seqs=8,    # The max_model_len and block_size arguments are required to be same as max sequence length,    # when targeting neuron device. Currently, this is a known limitation in continuous batching    # support in transformers-neuronx.    max_model_len=128,    block_size=128,    # The device can be automatically detected when AWS Neuron SDK is installed.    # The device argument can be either unspecified for automated detection, or explicitly assigned.    device="neuron",    tensor_parallel_size=8)# Generate texts from the prompts. The output is a list of RequestOutput objects# that contain the prompt, generated text, and other information.outputs = llm.generate(prompts, sampling_params)# Print the outputs.for output in outputs:    prompt = output.prompt    generated_text = output.outputs[0].text    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Conclusion

AWS Trainium and Inferentia deliver high performance and low cost for fine-tuning and deploying Llama 3.1 models. We are excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, refer to Model Samples and Tutorials in AWS Neuron Documentation.


About the Authors

John Gray is a Sr. Solutions Architect in Annapurna Labs, AWS, based out of Seattle. In this role, John works with customers on their AI and machine learning use cases, architects solutions to cost-effectively solve their business problems, and helps them build a scalable prototype using AWS AI chips.

Pinak Panigrahi works with customers to build ML-driven solutions to solve strategic business problems on AWS. In his current role, he works on optimizing training and inference of generative AI models on AWS AI chips.

Kamran Khan, Head of Business Development for AWS Inferentina/Trianium at AWS. He has over a decade of experience helping customers deploy and optimize deep learning training and inference workloads using AWS Inferentia and AWS Trainium.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt Amazon EC2 accelerated computing infrastructure for their machine learning needs.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Llama 3.1 大型语言模型 多语言 AI芯片 AWS
相关文章