Introducing Amazon EKS support in Amazon SageMaker HyperPod

We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.

FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:

“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”

– Observea

“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Articul8 AI

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

The post is organized into the following three sections:

Overview of Amazon EKS support in SageMaker HyperPod

HyperPod cluster setup and node resiliency features

Training job resiliency with the job auto resume functionality

Kubernetes CLI (kubectl)

HyperPod CLI (hyperpod)

Overview of EKS support in SageMaker HyperPod

This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.

Architecture overview

Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:

Amazon EKS VPC

EKS control plane

Kubernetes API

Network Load Balancer

HyperPod VPC

elastic network interface

SageMaker user VPC

Amazon FSx for Lustre

Amazon Simple Storage Service

data repository association

Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.

HyperPod-managed resiliency features

Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:

Deep health checks

AWS Trainium

Elastic Fabric Adapter (EFA)

Automated node recovery

managed, lightweight, and non-invasive checks

Job auto resume

Kubeflow Training Operator for PyTorch

User experiences

In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:

Admin experience

Scientist experience

HyperPod CLI

.yaml

kubectl

Kueue

managed MLflow

SageMaker distributed training libraries

KubeRay

HyperPod compute setup and node resiliency features

In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.

Prerequisites

You need to have the following in place prior to the HyperPod compute deployment:

EKS cluster

prerequisites

AWS CloudFormation template

Custom resources

HyperPod compute setup

With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:

cat > cluster-config.json << EOL{    "ClusterName": "ml-cluster",    "Orchestrator": {        "Eks": {            "ClusterArn": "${EKS_CLUSTER_ARN}"        }    },    "InstanceGroups": [        {            "InstanceGroupName": "worker-group-1",            "InstanceType": "ml.p5.48xlarge",            "InstanceCount": 4,            "LifeCycleConfig": {                "SourceS3Uri": "s3://${BUCKET_NAME}",                "OnCreate": "on_create.sh"            },            "ExecutionRole": "${EXECUTION_ROLE}",            "ThreadsPerCore": 1,            "OnStartDeepHealthChecks": [                "InstanceStress",                "InstanceConnectivity"            ]        }    ],    "VpcConfig": {        "SecurityGroupIds": [            "$SECURITY_GROUP"        ],        "Subnets": [            "$SUBNET_ID"        ]    },    "NodeRecovery": "Automatic"}EOL

The provided configuration file contains two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”]

“NodeRecovery”: “Automatic”

You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):

aws sagemaker create-cluster \    --cli-input-json file://cluster-config.json{    "ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters --output table

This command displays the cluster details, including the cluster name, status, and creation time:

-----------------------------------------------------------------------------------------------------------------------|                                                    ListClusters                                                     |+---------------------------------------------------------------------------------------------------------------------+||                                                 ClusterSummaries                                                  |||+----------------------------------------------------------------+--------------+----------------+------------------+|||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    |||+----------------------------------------------------------------+--------------+----------------+------------------+|||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  |||+----------------------------------------------------------------+--------------+----------------+------------------+|

Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.

Node resiliency features

To gain further insight into the instances, you can use kubectl get nodes and examine the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:

# kubectl get nodes --show-labels=trueNAME                         ...  LABELShyperpod-i-023cfe933b3b34369 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...hyperpod-i-045961b6424401838 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...hyperpod-i-074b81fdb5bf52e19 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...hyperpod-i-0ae97710b3033cdb1 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...

The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:

# Example1{"level": "error","ts": "2024-08-15T21:15:22Z","msg": "Encountered FaultyInstance. Replace the Instance. Region: us-east-2,InstanceType: p5.48xlarge. ERROR:Bandwidth has less than threshold: Expected minimumthreshold :80,NCCL Test output Bw: 30"}# Example2{"level": "error","ts": "2024-08-15T21:15:22Z","msg": "Encountered Unknownerror. Replace the Instance. Region: us-east-2,InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm test"}

You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check label on each node:

amazonaws.com/deep-health-check: InProgress

amazonaws.com/deep-health-check: Passed

amazonaws.com/deep-health-check: Failed

If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.

For complete list of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:

Example log group name

/aws/sagemaker/Clusters/<cluster_name>/<cluster_id>

Example log stream name

SagemakerHealthMonitoringAgent/<your_node_group_name>/<instance_id>

The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:

# Example1{    "level": "info",    "ts": "2024-09-06T03:15:11Z",    "msg": "NPD caught ",    "condition type: ": "KernelDeadlock",    "with condition details ": {        "type": "KernelDeadlock",        "status": "False",        "transition": "2024-09-06T03:15:11.539932213Z",        "reason": "KernelHasNoDeadlock",        "message": "kernel has no deadlock"    },    "HealthMonitoringAgentDetectionEvent": "HealthEvent"}# Example2{    "level": "info",    "ts": "2024-09-06T03:15:11Z",    "msg": "NPD caught ",    "condition type: ": "NvidiaErrorTerminate",    "with condition details ": {        "type": "NvidiaErrorTerminate",        "status": "False",        "transition": "2024-09-06T03:15:11.539932283Z",        "reason": "NvidiaNoErrorRequiredTerminate",        "message": "Nvidia no error required terminate"    },    "HealthMonitoringAgentDetectionEvent": "HealthEvent"}

The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to avoid scheduling pods, and then the node is replaced or rebooted.

You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.

You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Training job resiliency with the job auto resume functionality

In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.

To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations and nodeSelector:

apiVersion: "kubeflow.org/v1"kind: PyTorchJobmetadata:    name: fsdpjob    namespace: kubeflow    # config for HyperPod job auto-resume    annotations: {        sagemaker.amazonaws.com/enable-job-auto-resume: "true",        sagemaker.amazonaws.com/job-max-retry-count: "2"    }spec:  pytorchReplicaSpecs:  ......  Worker:      replicas: 10      restartPolicy: OnFailure      template:          spec:            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable ......

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true" and sagemaker.amazonaws.com/job-max-retry-count: "2", SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.

Submit the PyTorchJob using the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

In the event of a hardware failure, the Kubeflow training job restarts as follows:

Start Time: 2024-07-11T05:53:10ZEnable job auto-resume 27Events:Type Reason Age FromMessage---- ------ ---- ----Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-worker-0Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-worker-1Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-master-0Warning PyTorchJobRestarting 7m59s pytorchjob-controllerPyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-worker-0Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-worker-1Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-master-0Warning PyTorchJobRestarting 7m58s pytorchjob-controllerPyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed

When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:

hyperpod start-job \    --config-file ./config.yaml \   --auto-resume true \    --max-retry 2

Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.

Clean up

To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.

Conclusion

With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.

For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.

About the authors

Keita Watanabe is a Senior GenAI Specialist Solutions Architect in the world-wide specialist organization at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect in the world-wide specialist organization at AWS. In his role, he focuses on helping customers with the orchestration and scaling of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing generative AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in cloud-side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Manoj Ravi is a Senior Product Manager on the Amazon SageMaker team. He is passionate about building next-gen AI products and works on applications and tools to make foundation model development and deployment effortless for customers. He holds an MBA from the Haas School of Business and a master’s degree from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.