AWS Machine Learning Blog 2024年07月17日
Accelerate your generative AI distributed training workloads with the NVIDIA NeMo Framework on Amazon EKS
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA NeMo Framework 是一款面向云的端到端框架,用于训练和部署具有数十亿甚至万亿参数的生成式 AI 模型。本文将介绍如何在 Amazon EKS 集群上运行分布式训练工作负载的步骤,并展示如何使用 NVIDIA NeMo Framework 提高训练效率和性能。

🚀 **NVIDIA NeMo Framework 的优势** NVIDIA NeMo Framework 提供了一套完整的工具、脚本和配方,以支持 LLM 训练的每个阶段,从数据准备到训练和部署。它提供了各种定制技术,并针对语言和图像应用程序的规模化推理进行了优化,使用多 GPU 和多节点配置。NVIDIA NeMo 简化了生成式 AI 模型的开发,使其对企业更具成本效益和效率。通过提供端到端管道、高级并行技术、节省内存的策略和分布式检查点,NVIDIA NeMo 确保 AI 模型训练是简化的、可扩展的和高性能的。 NVIDIA NeMo 的主要优势包括: * **端到端管道**:支持数据准备、训练等不同阶段,允许对自定义数据进行即插即用。 * **并行技术**:包括数据并行、张量并行、管道并行、序列并行、专家并行和上下文并行。 * **节省内存的技术**:包括选择性激活重新计算、CPU 卸载(激活、权重)、注意力机制(Flash Attention、分组查询注意力、多查询注意力和滑动窗口注意力)和分布式优化器(Torch FSDP、分布式优化器)。 * **数据加载器**:支持不同的架构。 * **分布式检查点**:支持模型训练过程中的检查点保存和恢复。

🪐 **Amazon EKS 作为分布式训练平台** Amazon EKS 是一款托管的 Kubernetes 服务,使在 AWS 上运行 Kubernetes 集群变得简单直接。它管理 Kubernetes 控制平面的可用性和可扩展性,并提供计算节点自动扩展和生命周期管理支持,以帮助您运行高可用性容器应用程序。 Amazon EKS 是运行分布式训练工作负载的理想平台,因为它与 AWS 服务的强大集成和性能特性。它与 Amazon FSx for Lustre(一种高吞吐量文件系统)无缝集成,使用 FSx CSI 驱动程序通过持久卷声明实现快速数据访问和管理。Amazon EKS 还与 Amazon CloudWatch 集成,以提供全面的日志记录和监控,提供对集群性能和资源利用率的洞察。它支持 Amazon Simple Storage Service (Amazon S3) 用于可扩展且持久的数据存储和管理,为大型数据集提供可访问性。通过 Elastic Fabric Adapter (EFA) 实现增强的网络性能,它在节点之间提供低延迟、高吞吐量连接。这些特性共同使 Amazon EKS 成为优化 AI 和机器学习 (ML) 训练工作流程的强大而高效的选择。

🌌 **解决方案概述** 本文介绍了在 EKS 集群上运行分布式训练工作负载的步骤。主要步骤如下: 1. **设置启用 EFA 的 2 节点 24xlarge 集群**。 2. **设置 FSx for Lustre 文件系统**,以便您拥有一个共享的数据存储库来存储训练数据集和模型检查点。 3. **设置 NVIDIA NeMo 环境**。 4. **修改 NVIDIA NeMo Kubernetes 清单以准备数据集并训练模型**。 本指南详细介绍了每个步骤,并提供代码示例和配置文件,帮助您快速开始使用 NVIDIA NeMo 和 Amazon EKS 进行分布式训练。

🚀 **先决条件** 您需要能够启动一个基于 CPU 的 Amazon Elastic Compute Cloud (Amazon EC2) 实例,您将使用它来创建 EKS 集群。当您的实例启动并运行后,SSH 到您的 EC2 实例并安装以下 CLI: * 最新版本的 AWS 命令行界面 (AWS CLI) * kubectl * eksctl * helm 这些步骤可能会因您使用的平台而异。请相应地查阅上述文档以在其他平台上安装 CLI。我们还要求您拥有 p4de.24xlarge 实例的容量预留,并拥有 capacityReservationID。

In today’s rapidly evolving landscape of artificial intelligence (AI), training large language models (LLMs) poses significant challenges. These models often require enormous computational resources and sophisticated infrastructure to handle the vast amounts of data and complex algorithms involved. Without a structured framework, the process can become prohibitively time-consuming, costly, and complex. Enterprises struggle with managing distributed training workloads, efficient resource utilization, and model accuracy and performance. This is where the NVIDIA NeMo Framework comes into play. In this post, we present a step-by-step guide to run distributed training workloads on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

NVIDIA NeMo Framework

NVIDIA NeMo is an end-to-end cloud-centered framework for training and deploying generative AI models with billions and trillions of parameters at scale. The NVIDIA NeMo Framework provides a comprehensive set of tools, scripts, and recipes to support each stage of the LLM journey, from data preparation to training and deployment. It offers a variety of customization techniques and is optimized for at-scale inference of models for both language and image applications, using multi-GPU and multi-node configurations. NVIDIA NeMo simplifies generative AI model development, making it more cost-effective and efficient for enterprises. By providing end-to-end pipelines, advanced parallelism techniques, memory-saving strategies, and distributed checkpointing, NVIDIA NeMo makes sure AI model training is streamlined, scalable, and high-performing.

The following are benefits of using NVIDIA NeMo for distributed training:

Solution overview

You can deploy and manage NVIDIA NeMo using either Slurm or Kubernetes orchestration platforms. Amazon EKS is a managed Kubernetes service that makes it straightforward to run Kubernetes clusters on AWS. It manages the availability and scalability of the Kubernetes control plane, and it provides compute node auto scaling and lifecycle management support to help you run highly available container applications.

Amazon EKS is an ideal platform for running distributed training workloads due to its robust integrations with AWS services and performance features. It seamlessly integrates with Amazon FSx for Lustre, a high-throughput file system, enabling fast data access and management using persistent volume claims with the FSx CSI driver. Amazon EKS also integrates with Amazon CloudWatch for comprehensive logging and monitoring, providing insights into cluster performance and resource utilization. It supports Amazon Simple Storage Service (Amazon S3) for scalable and durable data storage and management, providing accessibility for large datasets. Enhanced network performance is achieved with Elastic Fabric Adapter (EFA), which offers low-latency, high-throughput connectivity between nodes. These features collectively make Amazon EKS a powerful and efficient choice for optimizing AI and machine learning (ML) training workflows.

The following diagram shows the solution architecture.

In this post, we present the steps to run distributed training workloads on an EKS cluster. The high-level steps are as follows:

    Set up an EFA enabled 2-node 24xlarge cluster. Set up an FSx for Lustre file system so you can have a shared data repository for storing training dataset and model checkpoints. Set up an environment for NVIDIA NeMo. Modify the NVIDIA NeMo Kubernetes manifests to prepare a dataset and train a model.

Prerequisites

You need to be able to launch a CPU-based Amazon Elastic Compute Cloud (Amazon EC2) instance that you’ll use to create the EKS cluster. When your instance is up and running, SSH into your EC2 instance and install the following CLIs:

These steps may change if you are on a non-Linux platform. Consult the preceding documentation for installing the CLIs on other platforms accordingly. We also require that you have a capacity reservation with p4de.24xlarge instances and have the capacityReservationID.

Launch an EKS cluster

ECR p4de.24xlarge instances have the NVIDIA A100 80GB instances, which are highly popular for distributed training generative AI workloads. For more information, refer to Amazon EC2 Instance Types. In this section, we show how to create an EKS cluster with an On-Demand Capacity Reservation for p4de.24xlarge instances.

    We provide the cluster creation config in p4de-cluster-config.yaml. See the following code:
git clone https://github.com/aws-samples/awsome-distributed-training.gitcd awsome-distributed-training/3.test_cases/2.nemo-launcher/EKSeksctl create cluster -f p4de-cluster-config.yaml

The following are key points to note when creating this cluster:

    After the cluster is created, you can enable kubectl to communicate with your cluster by adding a new context to the kubectl config file:
    aws eks update-kubeconfig --region region-code --name my-cluster
    You can confirm communication with your cluster by running the following command:
    kubectl get svcNAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGEkubernetes   ClusterIP   10.100.0.1   <none>        443/TCP   28h

Next, you can install the AWS EFA Kubernetes Device Plugin. EFA is a network interface for EC2 instances that enhances the performance of inter-node communications, which is critical for distributed training workloads that involve GPUs. This plugin allows Kubernetes to recognize and utilize the EFA device, facilitating high-throughput, low-latency networking necessary for efficient distributed training and deep learning applications.

    Install the plugin with the following code:
helm repo add eks https://aws.github.io/eks-chartshelm install efa eks/aws-efa-k8s-device-plugin -n kube-system

The NVIDIA device plugin for Kubernetes enables GPU support within your EKS cluster by exposing the GPUs to the Kubernetes API server through the kubelet. It advertises the available GPU resources, allowing Kubernetes to schedule and manage GPU-accelerated workloads.

    Install the plugin with the following code:
    wget https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.3/nvidia-device-plugin.ymlkubectl apply -f nvidia-device-plugin.yml
    Run the following command to verify all the pods:
    kubectl get pods --all-namespaces

    You can run kubectl get nodes to verify the nodes.

Alternatively, you can use the EKS node viewer tool to view nodes, their costs, and their status in your cluster. After it’s installed, enter eks-node-viewer to get the following view.

The node viewer displays the IP addresses of our two p4de.24xlarge compute nodes.

    We can choose one of these private IP DNS names to further examine and describe the node as follows:
kubectl describe node ip-192-168-165-37.us-west-2.compute.internal

The preceding command describes a lot of detail of the node. To make sure EFA is installed correctly, make sure you see details as shown in the following screenshot.

For p4 nodes, you will see vpc.amazonaws.com/efa:4 and for p5.48xlarge nodes, you should see vpc.amazonaws.com/efa:32.

If EFA is enabled in the node group, make sure that a security group is attached to the nodes that allows a rule to allow all outgoing traffic originating from the same security group. This is required for EFA to work. For instructions, see Get started with EFA and MPI. This security group is intended for testing purposes only. For your production environments, we recommend that you create an inbound SSH rule that allows traffic only from the IP address from which you are connecting, such as the IP address of your computer, or a range of IP addresses in your local network.

Create an FSx for Lustre file system

For distributed training applications, typically hundreds of GPU instances are used, with each node containing multiple GPUs. It is crucial that all nodes can access a shared file system to train on the same dataset efficiently. For this purpose, a high-performance file system with high throughput and low latency is essential. We recommend using the FSx for Lustre file system for large-scale distributed training, because it meets these requirements and provides seamless data access for all nodes involved in the training process.

To have a FSx for Lustre file system mounted on your EKS cluster, complete the following steps:

    Use the following scripts to create an AWS Identity and Access Management (IAM) role and attach the FSx policy:
    export FSX_POLICY_NAME=fsx-csiwget https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/csi/fsx/fsx-policy.jsonexport FSX_POLICY_DOC=file://fsx-policy.json# From EC2 Auto Scaling Groupexport EKS_INSTANCE_PROFILE_NAME=(eks-1ec6fc6b-1a19-d65d-66ac-293ff0a20eb9 )POLICY_ARN=$(aws iam create-policy --policy-name ${FSX_POLICY_NAME} --policy-document $FSX_POLICY_DOC --query "Policy.Arn" --output text)INSTANCE_PROFILE=$(aws iam list-instance-profiles --query InstanceProfiles[?InstanceProfileName=="'${EKS_INSTANCE_PROFILE_NAME}'"].{InstanceProfileName:InstanceProfileName} --output text)ROLE_NAME=$(aws iam get-instance-profile --instance-profile-name ${INSTANCE_PROFILE} --query InstanceProfile.Roles[0].RoleName --output text)# Attach FSx Policy to role ${ROLE_NAME} ..."aws iam attach-role-policy --policy-arn ${POLICY_ARN} --role-name ${ROLE_NAME}
    Use the following script to create a security group that allows EKS nodes to access the file system:
    # From EC2 consoleexport MY_REGION=us-west-2# FSX_SUBNET_ID should be same ID the compute nodes are present in. You can get this from the EKS console export FSX_SUBNET_ID=subnet-0edecd850cff2cfad# From EC2 Auto Scaling Groupexport FSX_SECURITY_GROUP_NAME=eks-fsx-sg# Get VPC_ID from EKS consoleexport VPC_ID=vpc-04411d49af198a6ea# Create security groupexport SECURITY_GROUP_ID=$(aws ec2 create-security-group --vpc-id ${VPC_ID} --region ${MY_REGION} --group-name ${FSX_SECURITY_GROUP_NAME} --description "FSx for Lustre Security Group" --query "GroupId" --output text)export SUBNET_CIDR=$(aws ec2 describe-subnets --region ${MY_REGION} --query Subnets[?SubnetId=="'${FSX_SUBNET_ID}'"].{CIDR:CidrBlock} --output text)# Ingress ruleaws ec2 authorize-security-group-ingress --region ${MY_REGION} --group-id ${SECURITY_GROUP_ID} --protocol tcp --port 988 --cidr ${SUBNET_CIDR}
    Create a 1.2 TB Persistent_2 FSx for Lustre file system from the FSx for Lustre console in the same Availability Zone as your compute instances (FSX_SUBNET_ID), VPC of Amazon EKS (VPC_ID), and the security group you created (SECURITY_GROUP_ID). After the file system is created, note the file system ID, DNS name, and mount name from the file system details page.

Before mounting the file system, you need to install the FSx CSI driver that allows EKS clusters to manage the lifecycle of FSx for Lustre file systems.

    Install the FSx CSI driver as follows:
    echo "Installing FSx CSI driver ..."kubectl apply -k "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=master"echo "FSx pods in kube-system namespace ..."kubectl -n kube-system get pods | grep fsx
    Next, to mount the file system, provide scripts in the fsx-storage-class.yaml, fsx-pv.yaml and fsx-pvc.yaml files:
    # Storage Classkubectl apply -f fsx-storage-class.yamlkubectl get sc# Persistent Volumekubectl apply -f fsx-pv.yaml# Persistent Volume Claimkubectl apply -f fsx-pvc.yaml

You can check to make sure that the volumes are in Bound state.

Set up the environment for NVIDIA NeMo

For this post, we use the NVIDIA device plugin for Kubernetes, but if you need to install the GPU Operator, you can do so as follows:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidiahelm repo updatehelm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

To enable distributed training, we use the KubeFlow Training Operator, which is essential for managing and scheduling ML training jobs in a Kubernetes environment. This operator simplifies the process of running distributed training jobs by automating the deployment and scaling of the necessary components. See the following code:

# Deploy Kubeflow training operatorkubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.7.0"# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/training-operator/deploy.sh# Configure RBAC resourceskubectl apply -f ./clusterrole-hpa-access.yamlkubectl apply -f ./clusterrolebinding-training-operator-hpa-access.yaml

Additionally, we use the KubeFlow MPI Operator for preprocessing training data in parallel. The MPI Operator facilitates running Message Passing Interface (MPI) jobs, which are crucial for parallelizing the preprocessing tasks across multiple nodes, thereby speeding up the training process. See the following code:

kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml# From https://github.com/aws-samples/aws-do-eks/blob/main/Container-Root/eks/deployment/kubeflow/mpi-operator/clusterrole-mpi-operator.yaml# Add lease permissions fot mpi-operator cluster rolekubectl apply -f ./clusterrole-mpi-operator.yaml

The NVIDIA NeMo Framework is available publicly in the image nvcr.io/nvidia/nemo:24.01.framework. We provide an AWS optimized Dockerfile for use with P4 and P5 instances. We recommend the following library versions for optimal performance:

ENV EFA_INSTALLER_VERSION=1.30.0ENV AWS_OFI_NCCL_VERSION=1.8.1-awsENV NCCL_VERSION=2.19.4-1

You can build and push the image to Amazon Elastic Container Registry (Amazon ECR) as follows:

## AWSexport AWS_REGION=us-west-2export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)## Docker Imageexport REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/export IMAGE=nemo-awsexport TAG=":24.01.framework"docker build -t ${REGISTRY}${IMAGE}${TAG} -f 0.Dockerfile .echo "Logging in to $REGISTRY ..."aws ecr get-login-password | docker login --username AWS --password-stdin $REGISTRY# Create registry if it does not existREGISTRY_COUNT=$(aws ecr describe-repositories | grep ${IMAGE} | wc -l)if [ "$REGISTRY_COUNT" == "0" ]; then        echo ""        echo "Creating repository ${IMAGE} ..."        aws ecr create-repository --repository-name ${IMAGE}fi# Push imagedocker image push ${REGISTRY}${IMAGE}${TAG}

The NVIDIA NeMo Framework requires users to provide config files with job and model information. You can copy the launcher scripts from the container as follows:

# Run containerdocker run -it ${REPOSITORY}${IMAGE}${TAG} bash# Copy filesdocker cp -a <container-id>: /opt/NeMo-Megatron-Launcher/ <Path-to-save-launcher-scripts>

In a Slurm cluster implementation, the launcher scripts, data, and results folder could reside in the file system that both the head node (node from where jobs are submitted) and compute nodes access. But in this Amazon EKS implementation, the node that you used to create the EKS cluster doesn’t have access to EKS file system. To get around this, you can put the launcher scripts in the head node and the results and data folder in the file system that the compute nodes have access to.

Run NVIDIA NeMo on an EKS cluster

We’re now ready to set up NVIDIA NeMo Kubernetes manifests for data preparation and model training. For more information about running it on premises, see Running NeMo Framework on Kubernetes. There are some modifications to be done for it to run on Amazon EKS, as shown in the following steps. We provide the launcher scripts in the GitHub repo.

    Modify the launcher_scripts/conf/cluster/k8s.yaml file as follows. The subPath field is the path where FSx for Lustre is mounted, which is /fsx-shared in this case.
    shm_size: 512Gi  # Amount of system memory to allocate in Pods. Should end in "Gi" for gigabytes.volumes:  persistentVolumeClaim:    # This claim should be created before running    claimName: fsx-pvc    subPath: fsx-shared  # path is mirrored into pod (no leading slash b/c relative to root)# NOTE: These args will soon be deprecatednfs_server: null  # Hostname or IP address for the NFS server where data is stored.nfs_path: null  # Path to store data in the NFS server.ib_resource_name: null  # Specify the resource name for IB devices according to kubernetes, such as "nvidia.com/hostdev" for Mellanox IB adapters. Can also be a list, but must be same length as ib_countib_count: null  # Specify the number of IB devices to include per node in each pod. Can also be a list, but must be same length as ib_resource_nameib_network_annotation: ""  # Specify the networks as comma separated valuesdns_policy: null  # Specify a dnsPolicy to use in all pods, if necessary
    Install the required Python packages; this is required so that NeMo Launcher can submit jobs to the Kubernetes cluster:
sudo apt install python3-pippip install -r <Path-to- NeMo-Megatron-Launcher>/requirements.txt

Next, we copy the following folders from the container to the /fsx-shared/data folder:

    To copy files from EKS pods, you can start a pod just for this purpose. Create a file fsx-share-test.yaml as follows:
    apiVersion: v1kind: Podmetadata:  name: fsx-share-testspec:  containers:  - name: fsx-share-test    image: ubuntu    command: ["/bin/bash"]    args: ["-c", "while true; do echo  \"hello from FSx\" - $(date -u) >> /fsx-shared/test.txt; sleep 120; done"]    volumeMounts:    - name: fsx-pv      mountPath: /fsx-shared  volumes:  - name: fsx-pv    persistentVolumeClaim:      claimName: fsx-pvc
    Run this pod and copy the files:
    kubectl apply -f fsx-share-test.yamlkubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/bpe fsx-share-test: /fsx-shared/data/kubectl cp <Path-to- NeMo-Megatron-Launcher>/launcher_scripts/data/nsfw fsx-share-test: /fsx-shared/data/

A few files need to be updated for data preparation for it to work with the EKS cluster.

    Modify the launcher_scripts/conf/config.yaml file:
      For cluster, use k8s. For training, use gpt3/126m. For stages, this should be just data_preparation and no other stages. For launcher_scripts_path, use the path to the NeMo Megatron launch scripts, which should end with /launcher_scripts. For data_dir, use /fsx-shared/data (the location to store and read the data). For base_results_dir, use /fsx-shared/results (the location to store the results, checkpoints, and logs). For container, use ${REPOSITORY}${IMAGE}${TAG}
    Modify the conf/data_preparation/gpt3/download_gpt3_pile.yaml file:
      Set node_array_size to 2. Set file_numbers to “0-5”. With five files, it should be around 350 GB of data
    Modify the nemo_launcher/core/k8s_templates/data_preparation/data-prep.yaml file:
      If you get the error that mpirun is not found, add the full path to the executable /opt/amazon/openmpi/bin/mpirun. Add /fsx-shared in the container volume mount path. Add the volume:
volumes:          - name: fsx-pv            persistentVolumeClaim:              claimName: fsx-pvc
    Launch the data preparation job:
    python3 main.py

This script creates a Helm chart for the selected stage (in this case, data_preparation) and runs the Helm chart automatically. Refer to Run NeMo Framework on Kubernetes for an explanation of the data preparation process. Make sure python3 is installed.

    You can monitor your job status and logs using three commands: helm list, kubectl get pods, and kubectl logs --follow).
    When the job is finished, you can remove the Helm chart:
    helm uninstall download-gpt3-pile

You can see the downloaded the data in the /fsx-shared folder by running in one of the pods as kubectl exec -it nlp-worker-0 bash.

Training

Now that our data preparation is complete, we’re ready to train our model with the created dataset. Complete the following steps:

    Modify a parameter in the conf/config.yaml file:
      Set stages to training and no other stages.
    Modify parameters in conf/training/gpt3/126m.yaml:
      Set num_nodes to 2. Set devices to 1. On line 18, change use_distributed_sampler: False to replace_sampler_ddp: False.

Optionally, if you want to use a mock dataset instead of real dataset for testing purposes, you can modify the data section as follows. You are essentially changing data_impl: mmap to data_impl: mock and assigning an empty list to data_prefix.

data:  data_impl: mock  splits_string: "99990,8,2"  seq_length: 2048  skip_warmup: True  num_workers: 2  dataloader_type: single # cyclic  reset_position_ids: False # Reset position ids after end-of-document token  reset_attention_mask: False # Reset attention mask after end-of-document token  eod_mask_loss: False # Mask loss for the end of document tokens  index_mapping_dir: null  data_prefix: [] # Should be weight path weight path... for a blended dataset# You can just comment the default “data_prefix” values like below.# - ${data_dir}/my-gpt3_00_text_document# - .0333
    Modify the parameters in the nemo_launcher/core/k8s_templates/training/training.yaml file: Run python3 main.py to start training and you should see the training pods by running kubectl get pods as follows:
    NAME                    READY   STATUS    RESTARTS   AGEnlp-training-worker-0   1/1     Running   0          168mnlp-training-worker-1   1/1     Running   0          168m

In addition to monitoring your job using helm list, kubectl get pods, and kubectl logs –follow, you can also SSH into your pod with kubectl exec and use nvidia-smi to check GPU status.

    When the job is finished, you can delete the helm chart:
    helm uninstall gpt3-126m

Model checkpoints are saved at /fsx-shared/results/checkpoints along with other training logs and TensorBoard events. By default, checkpoints are saved at every 2,000 steps. You can modify the conf/training/gpt3/126m.yaml file to make changes in the training setup.

Troubleshooting deployment failures

If deployment fails due to incorrect setup or configuration, complete the following debug steps:

    Find the error message by running kubectl logs --follow PODNAME and kubectl describe pod PODNAME. Stop any running jobs by removing the Helm chart. This can be done by running helm uninstall CHARTNAME.

Pods should be spun down after removing the Helm chart.

    You can double-check by running kubectl get pods. If pods are not spun down, you can manually stop them by running kubectl delete PODNAME.

Based on the error message, you may find errors from:

Clean up

It’s important to spin down resources after model training in order to avoid costs associated with running idle instances. To clean up our setup, we must delete the FSx for Lustre file system before deleting the cluster because it’s associated with a subnet in the cluster’s VPC.

    To delete the file system integration with the EKS cluster, run the following command:
    kubectl delete -f ./fsx-storage-class.yaml

Not only will this delete the persistent volume, it will also delete the EFS file system and all the data on the file system will be lost.

    When Step 1 is complete, delete the cluster by using the following script:
    eksctl delete cluster -f p4de-cluster-config.yaml

This will delete all the existing pods, remove the cluster, and delete the VPC you created in the beginning.

Conclusion

In this post, we demonstrated how to train generative AI models at scale using the NeMo Framework within an EKS cluster. We covered the challenges of training LLMs and how NeMo’s comprehensive tools and optimizations address these challenges, making the process more efficient and cost-effective. With NeMo, you can manage and scale distributed training workloads effectively. This post works with P4de instances. Another popular instance for generative AI distributed training workloads is the p5.48xlarge instance with the NVIDIA H100 80 GB GPU. To add a P5 node group to an existing EKS cluster, refer to AWS CLI scripts for EKS management.

To help you get started, we have published a GitHub repository that provides step-by-step instructions for creating an EKS cluster with P4de instances, mounting an FSx for Lustre file system, and running distributed training workloads with NeMo. This guide empowers you to harness the full potential of NeMo and Amazon EKS for your AI model training needs.


About the authors

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring platform. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Wenhan Tan is a Solutions Architect at Nvidia assisting customers to adopt Nvidia AI solutions at large-scale. His work focuses on accelerating deep learning applications and addressing inference and training challenges.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA NeMo Amazon EKS 分布式训练 生成式 AI 大型语言模型
相关文章