AWS Machine Learning Blog 2024年09月12日
Introducing Amazon EKS support in Amazon SageMaker HyperPod
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker HyperPod 现已支持 Amazon EKS,为基础模型 (FM) 开发提供了一种目的构建的基础设施,其核心是弹性。此功能允许将 SageMaker HyperPod 管理的计算无缝添加到 EKS 集群中,并使用自动节点和作业弹性功能来实现基础模型开发。

✌ **基础模型 (FM) 的弹性训练**: 由于基础模型通常在拥有数百或数千个加速器的超大规模计算集群上进行训练,单个加速器的故障可能会导致整个训练过程停止。SageMaker HyperPod 旨在通过提供自动节点恢复和作业自动恢复功能,来解决此问题。

✍ **HyperPod 计算设置和节点弹性功能**: 为了将 HyperPod 管理的计算集成到 EKS 集群中,需要在部署 HyperPod 计算之前满足一些先决条件,包括 EKS 集群和自定义资源的部署。HyperPod 使用 Helm Charts 简化了部署过程。通过指定一个 JSON 配置文件,可以创建 HyperPod 计算,该文件包含关键配置,如“OnStartDeepHealthChecks”和“NodeRecovery”。

✎ **训练作业弹性**: SageMaker HyperPod 提供了作业自动恢复功能,使用 Kubeflow Training Operator for PyTorch 来恢复和继续训练作业,即使在中断或故障的情况下。该扩展可以确保作业在节点被替换后等待并重新启动。

✏ **用户体验**: HyperPod 为管理员和科学家提供了无缝的用户体验,这对于管理大型集群和运行大规模训练作业至关重要。管理员可以使用 API 和控制台体验来创建和管理 EKS 集群中的节点组,并能够 SSH 到集群节点。科学家可以使用 HyperPod CLI 提交训练作业,并管理作业,而无需使用 kubectl。

✌ **架构概述**: Amazon EKS 上的 HyperPod 支持 EKS 集群(作为 Kubernetes 控制平面)与 HyperPod 计算(作为一组工作节点)之间的 1:1 映射。架构包括三个虚拟私有云 (VPC):Amazon EKS VPC、HyperPod VPC 和 SageMaker 用户 VPC。跨帐户 ENI 允许 HyperPod 计算实例与帐户中的其他 AWS 服务(如 Amazon ECR 和 Amazon CloudWatch)进行通信。

✍ **HyperPod 管理的弹性功能**: HyperPod 提供了三种功能,以确保集群保持健康状态,并在意外中断的情况下继续进行训练作业。这些功能包括深度健康检查、自动节点恢复和作业自动恢复。

✎ **第三方工具集成**: HyperPod 支持使用第三方工具,例如 KubeRay,该工具在 Kubernetes API 上运行,使您可以将您在其他 Kubernetes 集群中使用的首选作业提交和管理功能引入您的 HyperPod 环境。

✏ **性能改进**: HyperPod 允许您使用 SageMaker 分布式训练库,这些库通过提供高达 20% 的性能改进,可以增强训练性能。

We are thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) support in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This capability allows for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, using automated node and job resiliency features for foundation model (FM) development.

FMs are typically trained on large-scale compute clusters with hundreds or thousands of accelerators. Under such circumstances, hardware failures pose a significant challenge, because a single accelerator failure among thousands can halt the entire training process. For example, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs experienced 419 unexpected interruptions, with 78% attributed to confirmed or suspected hardware issues, and with 58.7% of these interruptions being GPU-related problems, including NVLink failures and HBM3 memory failures.

Since its inception, SageMaker HyperPod was designed with a focus on managed resiliency features to mitigate such hardware failures, enabling FM builders such as Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM training and inference on Slurm clusters. With the EKS support in HyperPod, you can now also benefit from the resiliency features on Kubernetes clusters by managing machine learning (ML) workloads using the HyperPod compute and managed Kubernetes control plane on the EKS cluster.

AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new feature set to manage their ML model development lifecycle:

“Through our use of SageMaker HyperPod, our customers and internal teams no longer have to worry about operating and configuring the Kubernetes control plane, and SageMaker HyperPod provides the network performance and optimized configurations to support complex HPC workloads. With Amazon EKS support in SageMaker HyperPod, we can reduce time we spent for undifferentiated heavy lifting in infrastructure management and reduce operational costs by over 30%.”

– Observea

“As a Kubernetes house, we are now thrilled to welcome the launch of Amazon EKS support for SageMaker HyperPod. This is a game changer for us as it integrates seamlessly with our existing training pipelines and makes it even easier for us to manage and operate our large-scale Kubernetes clusters. In addition, this also helps our end customers as we are now able to package and productize this capability into our GenAI platform, enabling our customers to run their own training and fine-tuning workloads in a more streamlined manner.”

– Articul8 AI

This post is designed for Kubernetes cluster administrators and ML scientists, providing an overview of the key features that SageMaker HyperPod introduces to facilitate large-scale model training on an EKS cluster.

The post is organized into the following three sections:

Overview of EKS support in SageMaker HyperPod

This section provides a high-level overview of Amazon EKS support in SageMaker HyperPod, introduces three key resiliency features HyperPod compute provides on the EKS cluster, and discusses how SageMaker HyperPod provides smooth user experiences for admins and scientists.

Architecture overview

Amazon EKS support in HyperPod supports a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes control plane) and a HyperPod compute (attached as a group of worker nodes). You have three virtual private clouds (VPCs) in this architecture, hosting different types of resources:

Cross-account ENIs also bridge communication between HyperPod compute instances and other AWS services on your account, such as Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.

The following diagram illustrates the high-level architecture of Amazon EKS support in HyperPod.

HyperPod-managed resiliency features

Amazon EKS support in HyperPod provides the following three capabilities to make sure the cluster stays healthy and training jobs continue under unexpected interruptions:

User experiences

In addition to the aforementioned managed resiliency features, SageMaker HyperPod provides smooth user experiences for both admins and scientists that are critical for managing a large cluster and running large-scale training jobs on them as part of the Amazon EKS integration:

HyperPod compute setup and node resiliency features

In this section, we provide a detailed guide on integrating HyperPod managed compute into your EKS cluster as Kubernetes worker nodes, and discuss how its built-in resiliency features provide infrastructure stability.

Prerequisites

You need to have the following in place prior to the HyperPod compute deployment:

HyperPod compute setup

With the aforementioned resources successfully deployed, you’re now prepared to create the HyperPod compute. The cluster configuration is specified using a JSON file; the following code provides an example:

cat > cluster-config.json << EOL{    "ClusterName": "ml-cluster",    "Orchestrator": {        "Eks": {            "ClusterArn": "${EKS_CLUSTER_ARN}"        }    },    "InstanceGroups": [        {            "InstanceGroupName": "worker-group-1",            "InstanceType": "ml.p5.48xlarge",            "InstanceCount": 4,            "LifeCycleConfig": {                "SourceS3Uri": "s3://${BUCKET_NAME}",                "OnCreate": "on_create.sh"            },            "ExecutionRole": "${EXECUTION_ROLE}",            "ThreadsPerCore": 1,            "OnStartDeepHealthChecks": [                "InstanceStress",                "InstanceConnectivity"            ]        }    ],    "VpcConfig": {        "SecurityGroupIds": [            "$SECURITY_GROUP"        ],        "Subnets": [            "$SUBNET_ID"        ]    },    "NodeRecovery": "Automatic"}EOL

The provided configuration file contains two key highlights:

You can create a HyperPod compute with the following aws command (you need version 2.17.47 or newer):

aws sagemaker create-cluster \    --cli-input-json file://cluster-config.json{    "ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"}

To verify the cluster status, you can use the following command:

aws sagemaker list-clusters --output table 

This command displays the cluster details, including the cluster name, status, and creation time:

-----------------------------------------------------------------------------------------------------------------------|                                                    ListClusters                                                     |+---------------------------------------------------------------------------------------------------------------------+||                                                 ClusterSummaries                                                  |||+----------------------------------------------------------------+--------------+----------------+------------------+|||                           ClusterArn                           | ClusterName  | ClusterStatus  |  CreationTime    |||+----------------------------------------------------------------+--------------+----------------+------------------+|||  arn:aws:sagemaker:us-east-2:111111111111:cluster/wccy5z4n4m49 |  ml-cluster  |  Creating      |  1723724079.337  |||+----------------------------------------------------------------+--------------+----------------+------------------+|

Alternatively, you can verify the cluster status through the SageMaker console. After a brief period, you can observe that the status for all nodes transitions to Running.

Node resiliency features

To gain further insight into the instances, you can use kubectl get nodes and examine the node labels. The sagemaker.amazonaws.com/node-health-status label reveals the life stage of each node. For instance, nodes with the ml.m5.2xlarge instance type are labeled as Schedulable, indicating that they have successfully passed the regular HyperPod health check. Conversely, nodes with the ml.p5.48xlarge instance type are labeled as Unschedulable, indicating that they have entered the initial deep health checks. The following code shows an example:

# kubectl get nodes --show-labels=trueNAME                         ...  LABELShyperpod-i-023cfe933b3b34369 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...hyperpod-i-045961b6424401838 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...hyperpod-i-074b81fdb5bf52e19 ...  beta.kubernetes.io/instance-type=ml.p5.48xlarge,sagemaker.amazonaws.com/node-health-status=Unschedulable, ...hyperpod-i-0ae97710b3033cdb1 ...  beta.kubernetes.io/instance-type=ml.m5.2xlarge,sagemaker.amazonaws.com/node-health-status=Schedulable,  ...

The deep health check logs are stored in the CloudWatch log group at /aws/sagemaker/Clusters/<cluster_name>/<cluster_id>. The log streams are logged at DeepHealthCheckResults/<log_stream_id>. When the deep health checks identify an issue, the output log provides detailed information, including the instance ID that failed the deep health checks and the specific failure reason. For example:

# Example1{"level": "error","ts": "2024-08-15T21:15:22Z","msg": "Encountered FaultyInstance. Replace the Instance. Region: us-east-2,InstanceType: p5.48xlarge. ERROR:Bandwidth has less than threshold: Expected minimumthreshold :80,NCCL Test output Bw: 30"}# Example2{"level": "error","ts": "2024-08-15T21:15:22Z","msg": "Encountered Unknownerror. Replace the Instance. Region: us-east-2,InstanceType: p5.48xlarge. ERROR: Crash detected in dcgm test"}

You can check the progress of the deep health check with the following values for the sagemaker.amazonaws.com/deep-health-check label on each node:

If a node fails the deep health checks, it will be replaced. Otherwise, it will be marked with the Schedulable label:

sagemaker.amazonaws.com/node-health-status: Schedulable

When you want to manually replace a specific node in your cluster, you can do so by manually modifying the label.

For complete list of resilience-related Kubernetes labels, please refer AWS documentation.

Even after the initial deep health checks, HyperPod periodically runs regular health checks. To view the health events detected by the HyperPod health monitoring agent, you can check the CloudWatch stream log:

The SagemakerHealthMonitoringAgent log stream for each node contains only the detection events from the health monitoring agent. For example:

# Example1{    "level": "info",    "ts": "2024-09-06T03:15:11Z",    "msg": "NPD caught ",    "condition type: ": "KernelDeadlock",    "with condition details ": {        "type": "KernelDeadlock",        "status": "False",        "transition": "2024-09-06T03:15:11.539932213Z",        "reason": "KernelHasNoDeadlock",        "message": "kernel has no deadlock"    },    "HealthMonitoringAgentDetectionEvent": "HealthEvent"}# Example2{    "level": "info",    "ts": "2024-09-06T03:15:11Z",    "msg": "NPD caught ",    "condition type: ": "NvidiaErrorTerminate",    "with condition details ": {        "type": "NvidiaErrorTerminate",        "status": "False",        "transition": "2024-09-06T03:15:11.539932283Z",        "reason": "NvidiaNoErrorRequiredTerminate",        "message": "Nvidia no error required terminate"    },    "HealthMonitoringAgentDetectionEvent": "HealthEvent"}

The deep health checks or the health monitor agent identify issues in a certain node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule to avoid scheduling pods, and then the node is replaced or rebooted.

You can monitor the health status of HyperPod nodes through CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps collect, aggregate, and summarize metrics and logs from containerized applications and microservices, providing detailed insights into performance, health, and status metrics for CPU, GPU, Trainium, EFA, and file system up to the container level. For the complete list of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you can also check the individual node health status and the total number of schedulable and unschedulable nodes, as shown in the following screenshots.

You can find the Container Insights set up guide in Amazon EKS Support in Amazon SageMaker HyperPod Workshop.

Training job resiliency with the job auto resume functionality

In addition to infrastructure resiliency features, you can use the use job auto resume capability using the Kubeflow Training Operator for PyTorch to maintain the recovery and continuation of training jobs in the event of interruptions or failures. The job auto resume feature attempts to continue the job, whereas the HyperPod node auto recovery functionality works on resolving node failures (node reboot or replacement as needed) to minimize training downtime. This section demonstrates the job auto resume feature using a PyTorch FSDP example on the awsome-distributed-training repository.

To enable the job auto resume feature, you create a PyTorchJob with the fsdp.yaml manifest, which includes the following annotations and nodeSelector:

apiVersion: "kubeflow.org/v1"kind: PyTorchJobmetadata:    name: fsdpjob    namespace: kubeflow    # config for HyperPod job auto-resume    annotations: {        sagemaker.amazonaws.com/enable-job-auto-resume: "true",        sagemaker.amazonaws.com/job-max-retry-count: "2"    }spec:  pytorchReplicaSpecs:  ......  Worker:      replicas: 10      restartPolicy: OnFailure      template:          spec:            nodeSelector: sagemaker.amazonaws.com/node-health-status: Schedulable ......

With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true" and sagemaker.amazonaws.com/job-max-retry-count: "2", SageMaker HyperPod resumes interrupted training jobs up to two times and schedules the resumed jobs onto healthy nodes. These healthy nodes are identified by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable, ensuring that only nodes that have passed basic health checks and are available for running workloads are used for resumed jobs.

Submit the PyTorchJob using the kubectl command:

kubectl apply -f fsdp.yaml

With the job auto resume feature enabled, if a job fails due to a hardware failure or any transient issues during training, SageMaker HyperPod initiates the node replacement workflow and restarts the job after the faulty nodes are replaced. You can verify the status of job auto resume by describing the PyTorchJob:

kubectl describe pytorchjob -n kubeflow <job-name>

In the event of a hardware failure, the Kubeflow training job restarts as follows:

Start Time: 2024-07-11T05:53:10ZEnable job auto-resume 27Events:Type Reason Age FromMessage---- ------ ---- ----Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-worker-0Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-worker-1Normal SuccessfulCreateService 9m45s pytorchjob-controllerCreated service: pt-job-1-master-0Warning PyTorchJobRestarting 7m59s pytorchjob-controllerPyTorchJob pt-job-1 is restarting because 1 Master replica(s) failed.Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-worker-0Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-worker-1Normal SuccessfulCreatePod 7m58s (x2 over 9m45s) pytorchjob-controllerCreated pod: pt-job-1-master-0Warning PyTorchJobRestarting 7m58s pytorchjob-controllerPyTorchJob pt-job-1 is restarting because 1 Worker replica(s) failed

When you submit a training job with the HyperPod CLI, you can also request the job to be auto resumed in the following way:

hyperpod start-job \    --config-file ./config.yaml \   --auto-resume true \    --max-retry 2

Refer to config.yaml for full configuration. For other CLI options, refer to the documentation on Github repository.

Clean up

To delete your SageMaker HyperPod compute, use either the SageMaker console or the following AWS Command Line Interface (AWS CLI) command:

aws sagemaker delete-cluster --cluster-name <cluster_name>

Cluster deletion can take a few minutes. You can confirm successful deletion after you see no clusters on the SageMaker console.

Conclusion

With the support for Amazon EKS in SageMaker HyperPod, customers who have standardized their FM development workflows on Kubernetes can adopt SageMaker HyperPod and manage their cluster resources using a familiar Kubernetes interface in SageMaker HyperPod. When training an FM, SageMaker HyperPod automatically monitors cluster health, and when an infrastructure fault such as a GPU failure occurs, SageMaker HyperPod automatically remediates the issue and restarts the training process from the last saved checkpoint, without any human intervention. Amazon EKS further enhances this capability by running deep health checks. Whenever a new instance is added to the SageMaker HyperPod compute, it undergoes a deep health check process to identify and replace potentially problematic instances. SageMaker HyperPod then automatically replaces or reboots nodes identified as faulty and resumes training processes in the event of unexpected interruptions, involving node replacement and job resubmission.

For an end-to-end tutorial on cluster management and FM training, visit the Amazon EKS Support in Amazon SageMaker HyperPod Workshop. For more information on infrastructure deployment and additional distributed training test cases, refer to the awsome-distributed-training repository. If you’re interested in deploying HyperPod with step-by-step commands, you can start from the aws-do-hyperpod repository.


About the authors

Keita Watanabe is a Senior GenAI Specialist Solutions Architect in the world-wide specialist organization at Amazon Web Services, where he helps develop machine learning solutions using OSS projects such as Slurm and Kubernetes. His background is in machine learning research and development. Prior to joining AWS, Keita worked in the ecommerce industry as a research scientist developing image retrieval systems for product search. Keita holds a PhD in Science from the University of Tokyo.

Alex Iankoulski is a full-stack software and infrastructure architect who likes to do deep, hands-on work. He is currently a Principal Solutions Architect in the world-wide specialist organization at AWS. In his role, he focuses on helping customers with the orchestration and scaling of ML and AI workloads on container-powered AWS services. He is also the author of the open source do framework and a Docker captain who loves applying container technologies to accelerate the pace of innovation while solving the world’s biggest challenges. During the past 10 years, Alex has worked on democratizing generative AI and ML, combating climate change, and making travel safer, healthcare better, and energy smarter.

Tomonori Shimomura is a Senior Solutions Architect on the Amazon SageMaker team, where he provides in-depth technical consultation to SageMaker customers and suggests product improvements to the product team. Before joining Amazon, he worked on the design and development of embedded software for video game consoles, and now he leverages his in-depth skills in cloud-side technology. In his free time, he enjoys playing video games, reading books, and writing software.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Manoj Ravi is a Senior Product Manager on the Amazon SageMaker team. He is passionate about building next-gen AI products and works on applications and tools to make foundation model development and deployment effortless for customers. He holds an MBA from the Haas School of Business and a master’s degree from Carnegie Mellon University. In his spare time, Manoj enjoys playing tennis and pursuing landscape photography.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker HyperPod Amazon EKS 基础模型训练 弹性 Kubernetes
相关文章