Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.

The following are benefits of using NVIDIA NeMo for distributed training:

Data parallelism Tensor parallelism Pipeline parallelism Sequence parallelism Expert parallelism Context parallelism

Selective activation recompute CPU offloading (activation, weights) Attention, including Flash Attention (FA 1/2, FA-cuDNN), Grouped Query Attention, Multi-Query Attention, and Sliding Window Attention Distributed optimizers, including Torch FSDP, Distributed Optimizer (zero-1)

Solution overview

You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.

Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.

The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

24xlarge

Prerequisites

You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

AWS Command Line Interface

kubectl

eksctl

helm

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.

Launch an EKS cluster

ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

p4de-cluster-config.yaml

git clone https://github.com/aws-samples/awsome-distributed-training.gitcd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKSeksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

capacityReservationID

availabilityZones

managedNodeGroups

c5.2xlarge

p4de.24xlarge

Retrieving Amazon EKS optimized Amazon Linux AMI IDs

GitHub repo

efaEnabled

true

Supported instance types

p5.48xlarge

AWS CLI scripts for EKS management

aws eks update-kubeconfig --region region-code --name my-cluster

kubectl get svcNAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGEkubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

Install the plugin with the following code:

helm repo add eks https://aws.github.io/eks-chartshelm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.ymlkubectl apply -f nvidia-device-plugin.yml

kubectl get pods --all-namespaces

kubectl get nodes

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

We can choose one of these private IP DNS names to further examine and describe the node as follows:

kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.

If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.

Create an FSx for Lustre file system

For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.

To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

AWS Identity and Access Management

export FSX_POLICY_NAME=fsx-csiwget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.jsonexport FSX_POLICY_DOC=file://fsx-policy.json# From EC2 Auto Scaling Groupexport EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )POLICY_ARN=$(aws iam create-policy --policy-name ${FSX_POLICY_NAME} --policy-document $FSX_POLICY_DOC --query "Policy.Arn" --output text)INSTANCE_PROFILE=$(aws iam list-instance-profiles --query InstanceProfiles[?InstanceProfileName=="'${EKS_INSTANCE_PROFILE_NAME}'"].{InstanceProfileName:InstanceProfileName} --output text)ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name ${INSTANCE_PROFILE} --query InstanceProfile.Roles[0].RoleName --output text)# Attach FSx Policy to role ${ROLE_NAME} ..."aws iam attach-role-policy --policy-arn ${POLICY_ARN} --role-name ${ROLE_NAME}

# From EC2 consoleexport MY_REGION=us-west-2# FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console export FSX_SUBNET_ID=subnet-0edecd850cff2cfad# From EC2 Auto Scaling Groupexport FSX_SECURITY_GROUP_NAME=eks-fsx-sg# Get VPC_ID from EKS consoleexport VPC_ID=vpc-04411d49af198a6ea# Create security groupexport SECURITY_GROUP_ID=$(aws ec2 create-security-group --vpc-id ${VPC_ID} --region ${MY_REGION} --group-name ${FSX_SECURITY_GROUP_NAME} --description "FSx for Lustre Security Group" --query "GroupId" --output text)export SUBNET_CIDR=$(aws ec2 describe-subnets --region ${MY_REGION} --query Subnets[?SubnetId=="'${FSX_SUBNET_ID}'"].{CIDR:CidrBlock} --output text)# Ingress ruleaws ec2 authorize-security-group-ingress --region ${MY_REGION} --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr ${SUBNET_CIDR}

FSX_SUBNET_ID

VPC_ID

SECURITY_GROUP_ID

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

echo "Installing FSx CSI driver ..."kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"echo "FSx pods in kube-system namespace ..."kubectl -n kube-system get pods | grep fsx

fsx-storage-class.yaml

fsx-pv.yaml

fsx-pvc.yaml

# Storage Classkubectl apply -f fsx-storage-class.yamlkubectl get sc# Persistent Volumekubectl apply -f fsx-pv.yaml# Persistent Volume Claimkubectl apply -f fsx-pvc.yaml

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo

For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidiahelm repo updatehelm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operatorkubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh# Configure RBAC resourceskubectl apply -f ./clusterrole-hpa-access.yamlkubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml# Add lease permissions fot mpi-operator cluster rolekubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0ENV AWS_OFI_NCCL_VERSION=1.8.1-awsENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWSexport AWS_REGION=us-west-2export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)## Docker Imageexport REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/export IMAGE=nemo-awsexport TAG=":24.01.framework"docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .echo "Logging in to $REGISTRY ..."aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY# Create registry if it does not existREGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)if [ "$REGISTRY_COUNT" == "0" ]; then        echo ""        echo "Creating repository ${IMAGE} ..."        aws ecr create-repository --repository-name ${IMAGE}fi# Push imagedocker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run containerdocker run -it ${REPOSITORY}${IMAGE}${TAG} bash# Copy filesdocker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.

Run NVIDIA NeMo on an EKS cluster

We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

launcher_scripts/conf/cluster/k8s.yaml

subPath

/fsx-shared

shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.volumes:  persistentVolumeClaim:    # This claim should be created before running    claimName: fsx-pvc    subPath: fsx-shared  # path is mirrored into pod (no leading slash b/c relative to root)# NOTE: These args will soon be deprecatednfs_server: null  # Hostname or IP address for the NFS server where data is stored.nfs_path: null  # Path to store data in the NFS server.ib_resource_name: null  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. Can also be a list, but must be same length as ib_countib_count: null  # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_nameib_network_annotation: ""  # Specify the networks as comma separated valuesdns_policy: null  # Specify a dnsPolicy to use in all pods, if necessary

sudo apt install python3-pippip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

NeMo-Megatron-Launcher/launcher_scripts/data/bpe

NeMo-Megatron-Launcher/launcher_scripts/data/nsfw

fsx-share-test.yaml

apiVersion: v1kind: Podmetadata:  name: fsx-share-testspec:  containers:  - name: fsx-share-test    image: ubuntu    command: ["/bin/bash"]    args: ["-c", "while true; do echo  \"hello from FSx\" - $(date -u) >> /fsx-shared/test.txt; sleep 120; done"]    volumeMounts:    - name: fsx-pv      mountPath: /fsx-shared  volumes:  - name: fsx-pv    persistentVolumeClaim:      claimName: fsx-pvc

kubectl apply -f fsx-share-test.yamlkubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/

A few files need to be updated for data preparation for it to work with the EKS cluster.

launcher_scripts/conf/config.yaml

k8s

gpt3/126m

data_preparation

launcher_scripts_path

/launcher_scripts

data_dir

/fsx-shared/data

base_results_dir

/fsx-shared/results

${REPOSITORY}${IMAGE}${TAG}

conf/data_preparation/gpt3/download_gpt3_pile.yaml

node_array_size

file_numbers

nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml

mpirun is not found

/opt/amazon/openmpi/bin/mpirun

/fsx-shared

volumes:          - name: fsx-pv            persistentVolumeClaim:              claimName: fsx-pvc

python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

helm list, kubectl get pods, and kubectl logs --follow

helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training

Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

conf/config.yaml

stages

training

conf/training/gpt3/126m.yaml

num_nodes

devices

use_distributed_sampler

False

replace_sampler_ddp

False

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:  data_impl: mock  splits_string: "99990,8,2"  seq_length: 2048  skip_warmup: True  num_workers: 2  dataloader_type: single # cyclic  reset_position_ids: False # Reset position ids after end-of-document token  reset_attention_mask: False # Reset attention mask after end-of-document token  eod_mask_loss: False # Mask loss for the end of document tokens  index_mapping_dir: null  data_prefix: [] # Should be weight path weight path... for a blended dataset# You can just comment the default “data_prefix” values like below.# - ${data_dir}/my-gpt3_00_text_document# - .0333

nemo_launcher/core/k8s_templates/training/training.yaml

python3 main.py

kubectl get pods

NAME                    READY   STATUS    RESTARTS   AGEnlp-training-worker-0   1/1     Running   0          168mnlp-training-worker-1   1/1     Running   0          168m

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.

Troubleshooting deployment failures

If deployment fails due to incorrect setup or configuration, complete the following debug steps:

kubectl logs --follow PODNAME and kubectl describe pod PODNAME

helm uninstall CHARTNAME

Pods should be spun down after removing the Helm chart.

kubectl get pods

kubectl delete PODNAME

Based on the error message, you may find errors from:

kubectl get pods -A

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.

Conclusion

In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.

About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.