Deploy Meta Llama 3.1-8B on AWS Inferentia using Amazon EKS and vLLM

With the rise of large language models (LLMs) like Meta Llama 3.1, there is an increasing need for scalable, reliable, and cost-effective solutions to deploy and serve these models. AWS Trainium and AWS Inferentia based instances, combined with Amazon Elastic Kubernetes Service (Amazon EKS), provide a performant and low cost framework to run LLMs efficiently in a containerized environment.

In this post, we walk through the steps to deploy the Meta Llama 3.1-8B model on Inferentia 2 instances using Amazon EKS.

Solution overview

The steps to implement the solution are as follows:

Create the EKS cluster. Set up the Inferentia 2 node group. Install the Neuron device plugin and scheduling extension. Prepare the Docker image. Deploy the Meta Llama 3.18B model.

We also demonstrate how to test the solution and monitor performance, and discuss options for scaling and multi-tenancy.

Prerequisites

Before you begin, make sure you have the following utilities installed on your local machine or development environment. If you don’t have them installed, follow the instructions provided for each tool.

AWS Command Line Interface

In this post, the examples use an inf2.48xlarge instance; make sure you have a sufficient service quota to use this instance. For more information on how to view and increase your quotas, refer to Amazon EC2 service quotas.

Create the EKS cluster

If you don’t have an existing EKS cluster, you can create one using eksctl. Adjust the following configuration to suit your needs, such as the Amazon EKS version, cluster name, and AWS Region. Before running the following commands, make sure you authenticate towards AWS:

export AWS_REGION=us-east-1export CLUSTER_NAME=my-clusterexport EKS_VERSION=1.30export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

Then complete the following steps:

eks_cluster.yaml

cat > eks_cluster.yaml <<EOF---apiVersion: eksctl.io/v1alpha5kind: ClusterConfigmetadata:  name: $CLUSTER_NAME  region: $AWS_REGION  version: "$EKS_VERSION"addons:- name: vpc-cni  version: latestcloudWatch:  clusterLogging:    enableTypes: ["*"]    iam:  withOIDC: trueEOF

This configuration file contains the following parameters:

metadata.name

my-cluster

metadata.region

us-east-2

metadata.version

Review release notes for Kubernetes versions on standard support

addons.vpc-cni

latest

cloudWatch.clusterLogging

Amazon CloudWatch Logs

iam.withOIDC

eks_cluster.yaml

eksctl create cluster --config-file eks_cluster.yaml

This command will create the EKS cluster based on the configuration specified in the eks_cluster.yaml file. The process will take approximately 15–20 minutes to complete.

During the cluster creation process, eksctl will also create a default node group with a recommended instance type and configuration. However, in the next section, we create a separate node group with Inf2 instances, specifically for running the Meta Llama 3.1-8B model.

kubectl

aws eks update-kubeconfig —region $AWS_REGION —name $CLUSTER_NAME

Set up the Inferentia 2 node group

To run the Meta Llama 3.1-8B model, you’ll need to create an Inferentia 2 node group. Complete the following steps:

Amazon EKS optimized accelerated AMI

export ACCELERATED_AMI=$(aws ssm get-parameter \--name /aws/service/eks/optimized-ami/$EKS_VERSION/amazon-linux-2-gpu/recommended/image_id \--region $AWS_REGION \--query "Parameter.Value" \--output text)

eksctl

cat > eks_nodegroup.yaml <<EOF---apiVersion: eksctl.io/v1alpha5kind: ClusterConfigmetadata:  name: $CLUSTER_NAME  region: $AWS_REGION  version: "$EKS_VERSION"    managedNodeGroups:  - name: neuron-group    instanceType: inf2.48xlarge    desiredCapacity: 1    volumeSize: 512    ami: "$ACCELERATED_AMI"    amiFamily: AmazonLinux2    iam:      attachPolicyARNs:      - arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy      - arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly      - arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore      - arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess    overrideBootstrapCommand: |      #!/bin/bash      /etc/eks/bootstrap.sh $CLUSTER_NAMEEOF

eksctl create nodegroup --config-file eks_nodegroup.yaml

This will take approximately 5 minutes.

Install the Neuron device plugin and scheduling extension

To set up your EKS cluster for running workloads on Inferentia chips, you need to install two key components: the Neuron device plugin and the Neuron scheduling extension.

The Neuron device plugin is essential for exposing Neuron cores and devices as resources in Kubernetes. The Neuron scheduling extension facilitates the optimal scheduling of pods requiring multiple Neuron cores or devices.

For detailed instructions on installing and verifying these components, refer to Kubernetes environment setup for Neuron. Following these instructions will help you make sure your EKS cluster is properly configured to schedule and run workloads that require worker nodes, such as the Meta Llama 3.1-8B model.

Prepare the Docker image

To run the model, you’ll need to prepare a Docker image with the required dependencies. We use the following code to create an Amazon Elastic Container Registry (Amazon ECR) repository and then build a custom Docker image based on the AWS Deep Learning Container (DLC).

Set up environment variables:

export ECR_REPO_NAME=vllm-neuron

Create an ECR repository:

aws ecr create-repository --repository-name $ECR_REPO_NAME --region $AWS_REGION

Although the base Docker image already includes TorchServe, to keep things simple, this implementation uses the server provided by the vLLM repository, which is based on FastAPI. In your production scenario, you can connect TorchServe to vLLM with your own custom handler.

Create the Dockerfile:

cat > Dockerfile <<EOFFROM public.ecr.aws/neuron/pytorch-inference-neuronx:2.1.2-neuronx-py310-sdk2.20.0-ubuntu20.04# Clone the vllm repositoryRUN git clone https://github.com/vllm-project/vllm.git# Set the working directoryWORKDIR /vllmRUN git checkout v0.6.0# Set the environment variableENV VLLM_TARGET_DEVICE=neuron# Install the dependenciesRUN python3 -m pip install -U -r requirements-neuron.txtRUN python3 -m pip install .# Modify the arg_utils.py file to support larger block_size optionRUN sed -i "/parser.add_argument('--block-size',/ {N;N;N;N;N;s/\[8, 16, 32\]/[8, 16, 32, 128, 256, 512, 1024, 2048, 4096, 8192]/}" vllm/engine/arg_utils.py# Install rayRUN python3 -m pip install rayRUN pip install -U  triton>=3.0.0# Set the entry pointENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]EOF

Use the following commands to create an ECR repository, build your Docker image, and push it to the newly created repository. The account ID and Region are dynamically set using AWS CLI commands, making the process more flexible and avoiding hard-coded values.

# Authenticate Docker to your ECR registryaws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com# Build the Docker imagedocker build -t ${ECR_REPO_NAME}:latest .# Tag the imagedocker tag ${ECR_REPO_NAME}:latest $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest# Push the image to ECRdocker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/${ECR_REPO_NAME}:latest

Deploy the Meta Llama 3.1-8B model

With the setup complete, you can now deploy the model using a Kubernetes deployment. The following is an example deployment specification that requests specific resources and sets up multiple replicas:

cat > neuronx-vllm-deployment.yaml <<EOFapiVersion: apps/v1kind: Deploymentmetadata:  name: neuronx-vllm-deployment  labels:    app: neuronx-vllmspec:  replicas: 3  selector:    matchLabels:      app: neuronx-vllm  template:    metadata:      labels:        app: neuronx-vllm    spec:      schedulerName: my-scheduler      containers:      - name: neuronx-vllm        image: <replace with the url to the docker image you pushed to the ECR>        resources:          limits:            cpu: 32            memory: "64G"            aws.amazon.com/neuroncore: "8"          requests:            cpu: 32            memory: "64G"            aws.amazon.com/neuroncore: "8"        ports:        - containerPort: 8000        env:        - name: HF_TOKEN          value: <your huggingface token>        - name: FI_EFA_FORK_SAFE          value: "1"        args:        - "--model"        - "meta-llama/Meta-Llama-3.1-8B"        - "--tensor-parallel-size"        - "8"        - "--max-num-seqs"        - "64"        - "--max-model-len"        - "8192"        - "--block-size"        - "8192"EOF

Apply the deployment specification with kubectl apply -f neuronx-vllm-deployment.yaml.

This deployment configuration sets up multiple replicas of the Meta Llama 3.1-8B model using tensor parallelism (TP) of 8. In the current setup, we’re hosting three copies of the model across the available Neuron cores. This configuration allows for the efficient utilization of the hardware resources while enabling multiple concurrent inference requests.

The use of TP=8 helps in distributing the model across multiple Neuron cores, which improves inference performance and throughput. The specific number of replicas and cores used may vary depending on your particular hardware setup and performance requirements.

To modify the setup, update the neuronx-vllm-deployment.yaml file, adjusting the replicas field in the deployment specification and the NUM_NEURON_CORES environment variable in the container specification. Always verify that the total number of cores used (replicas * cores per replica) doesn’t exceed your available hardware resources and that the number of attention heads is evenly divisible by the TP degree for optimal performance.

The deployment also includes environment variables for the Hugging Face token and EFA fork safety. The args section (see the preceding code) configures the model and its parameters, including an increased max model length and block size of 8192.

Test the deployment

After you deploy the model, it’s important to monitor its progress and verify its readiness. Complete the following steps:

Check the deployment status:

kubectl get deployments

This will show you the desired, current, and up-to-date number of replicas.

Monitor the pods:

kubectl get pods -l app=neuronx-vllm -w

The -w flag will watch for changes. You’ll see the pods transitioning from "Pending" to "ContainerCreating" to "Running".

Check the logs of a specific pod:

kubectl logs <pod-name>

The initial startup process takes around 15 minutes. During this time, the model is being compiled for the Neuron cores. You’ll see the compilation progress in the logs.

To support proper management of your vLLM pods, you should configure Kubernetes probes in your deployment. These probes help Kubernetes determine when a pod is ready to serve traffic, when it’s alive, and when it has successfully started.

Add the following probe configurations to your container spec in the deployment YAML:

spec:  containers:  - name: neuronx-vllm    # ... other container configurations ...    readinessProbe:      httpGet:        path: /health        port: 8000      initialDelaySeconds: 1800      periodSeconds: 10    livenessProbe:      httpGet:        path: /health        port: 8000      initialDelaySeconds: 1800      periodSeconds: 15    startupProbe:      httpGet:        path: /health        port: 8000      initialDelaySeconds: 1800      failureThreshold: 30      periodSeconds: 10

The configuration is comprised of three probes:

Readiness probe

Liveness probe

Startup probe

These probes assume that your vLLM application exposes a /health endpoint. If it doesn’t, you’ll need to implement one or adjust the probe configurations accordingly.

With these probes in place, Kubernetes will do the following:

Only send traffic to pods that are ready Restart pods that are no longer alive Allow sufficient time for initial startup and compilation

This configuration helps facilitate high availability and proper functioning of your vLLM deployment.

Now you’re ready to access the pods.

neuronx-vllm

kubectl get pods -l app=neuronx-vllm

This command will output a list of pods, and you’ll need the name of the pod you want to forward.

kubectl port-forward

kubectl port-forward <pod-name> 8000:8000

This command forwards port 8000 on the pod to port 8000 on your local machine. You can now access the inference server at http://localhost:8000.

Because we’re forwarding a port directly from a single pod, requests will only be sent to that specific pod. As a result, traffic won’t be balanced across all replicas of your deployment. This is suitable for testing and development purposes, but it doesn’t utilize the deployment efficiently in a production scenario where load balancing across multiple replicas is crucial to handle higher traffic and provide fault tolerance.

In a production environment, a proper solution like a Kubernetes service with a LoadBalancer or Ingress should be used to distribute traffic across available pods. This facilitates the efficient utilization of resources, a balanced load, and improved reliability of the inference service.

You can test the inference server by making a request from your local machine. The following code is an example of how to make an inference call using curl:

curl -X POST http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{   "model": " meta-llama/Meta-Llama-3.1-8B",   "prompt": "Explain the theory of relativity.",   "max_tokens": 100 }'

This setup allows you to test and interact with your inference server locally without needing to expose your service publicly or set up complex networking configurations. For production use, make sure that load balancing and scalability considerations are addressed appropriately.

For more information about routing, see Route application and HTTP traffic with Application Load Balancers.

Monitor performance

AWS offers powerful tools to monitor and optimize your vLLM deployment on Inferentia chips. The AWS Neuron Monitor container, used with Prometheus and Grafana, provides advanced visualization of your ML application performance. Additionally, CloudWatch Container Insights for Neuron offers deep, Neuron-specific analytics.

These tools allow you to track Inferentia chip utilization, model performance, and overall cluster health. By analyzing this data, you can make informed decisions about resource allocation and scaling to meet your workload requirements.

Remember that the initial 15-minute startup time for model compilation is a one-time process per deployment, with subsequent restarts being faster due to caching.

To learn more about setting up and using these monitoring capabilities, see Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container.

Scaling and multi-tenancy

As your application’s demand grows, you may need to scale your deployment to handle more requests. Scaling your Meta Llama 3.1-8B deployment on Amazon EKS with Neuron cores involves two coordinated steps:

Increasing the number of nodes in your EKS node group to provide additional Neuron cores Increasing the number of replicas in your deployment to utilize these new resources

You can scale your deployment manually. Use the AWS Management Console or AWS CLI to increase the size of your EKS node group. When new nodes are available, scale your deployment with the following code:

kubectl scale deployment neuronx-vllm-deployment --replicas=<new-number>

Alternatively, you can set up auto scaling:

Configure auto scaling for your EKS node group to automatically add nodes based on resource demands Use Horizontal Pod Autoscaling (HPA) to automatically adjust the number of replicas in your deployment

You can configure the node group’s auto scaling to respond to increased CPU, memory, or custom metric demands, automatically provisioning new nodes with Neuron cores as needed. This makes sure that as the number of incoming requests grows, both your infrastructure and your deployment can scale accordingly.

Example scaling solutions include:

Cluster Autoscaler with Karpenter

Scale cluster compute with Karpenter and Cluster Autoscaler

Multi-cluster federation

You should consider the following when scaling:

Alignment of resources

Compilation time

Cost management

erformance testing

By coordinating the scaling of both your node group and your deployment, you can effectively handle increased request volumes while maintaining optimal performance. The auto scaling capabilities of both your node group and deployment can work together to automatically adjust your cluster’s capacity based on incoming request volumes, providing a more responsive and efficient scaling solution.

Clean up

Use the following code to delete the cluster created in this solution:

eksctl delete cluster --name $CLUSTER_NAME --region $AWS_REGION

Conclusion

Deploying LLMs like Meta Llama 3.1-8B at scale poses significant computational challenges. Using Inferentia 2 instances and Amazon EKS can help overcome these challenges by enabling efficient model deployment in a containerized, scalable, and multi-tenant environment.

This solution combines the exceptional performance and cost-effectiveness of Inferentia 2 chips with the robust and flexible landscape of Amazon EKS. Inferentia 2 chips deliver high throughput and low latency inference, ideal for LLMs. Amazon EKS provides dynamic scaling, efficient resource utilization, and multi-tenancy capabilities.

The process involves setting up an EKS cluster, configuring an Inferentia 2 node group, installing Neuron components, and deploying the model as a Kubernetes pod. This approach facilitates high availability, resilience, and efficient resource sharing for language model services, while allowing for automatic scaling, load balancing, and self-healing capabilities.

For the complete code and detailed implementation steps, visit the GitHub repository.

About the Authors

Dmitri Laptev is a Senior GenAI Solutions Architect at AWS, based in Munich. With 17 years of experience in the IT industry, his interest in AI and ML dates back to his university years, fostering a long-standing passion for these technologies. Dmitri is enthusiastic about cloud computing and the ever-evolving landscape of technology.

Maurits de Groot is a Solutions Architect at Amazon Web Services, based out of Amsterdam. He specializes in machine learning-related topics and has a predilection for startups. In his spare time, he enjoys skiing and bouldering.

Ziwen Ning is a Senior Software Development Engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with kickboxing, badminton, and other various sports, and immersing himself in music.

Jianying Lang is a Principal Solutions Architect at the AWS Worldwide Specialist Organization (WWSO). She has over 15 years of working experience in the HPC and AI fields. At AWS, she focuses on helping customers deploy, optimize, and scale their AI/ML workloads on accelerated computing instances. She is passionate about combining the techniques in HPC and AI fields. Jianying holds a PhD in Computational Physics from the University of Colorado at Boulder.