AWS Machine Learning Blog 2024年07月26日
Node problem detection and recovery for AWS Neuron nodes within Amazon EKS clusters
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

这篇文章介绍了 AWS Neuron 节点问题检测与恢复 DaemonSet,该组件可以在 AWS Trainium 和 AWS Inferentia 上的 Amazon EKS 中自动检测并报告各种节点级问题,从而提高 ML 训练的可靠性并减少因硬件故障造成的浪费。

✌ **节点问题检测**:该组件通过持续监控工作节点的内核消息 (kmsg) 日志来检测 Neuron 设备故障。如果检测到与 Neuron 设备相关的错误消息,它将在 Kubernetes API 服务器上将 NodeCondition 更改为 NeuronHasError。

✍ **节点恢复代理**:该组件定期检查节点问题检测器公开的 Prometheus 指标。当发现表明 Neuron 设备出现问题的节点条件时,它将采取自动操作。首先,它将标记相关自动扩展组中的受影响实例为不健康,这将调用自动扩展组以停止实例并启动替换。此外,节点恢复代理将发布 Amazon CloudWatch 指标,供用户监控和提醒这些事件。

✊ **测试节点问题检测和恢复解决方案**:在安装插件后,您可以通过运行 kubectl describe node 来查看 Neuron 条件。通过在实例中注入错误日志来模拟设备错误。

✋ **先决条件**:在开始之前,请确保您已在计算机上安装了以下工具: - 最新版本的 AWS 命令行界面 (AWS CLI) - eksctl - kubectl - Terraform - Session Manager 插件

Implementing hardware resiliency in your training infrastructure is crucial to mitigating risks and enabling uninterrupted model training. By implementing features such as proactive health monitoring and automated recovery mechanisms, organizations can create a fault-tolerant environment capable of handling hardware failures or other issues without compromising the integrity of the training process.

In the post, we introduce the AWS Neuron node problem detector and recovery DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This component can quickly detect rare occurrences of issues when Neuron devices fail by tailing monitoring logs. It marks the worker nodes in a defective Neuron device as unhealthy, and promptly replaces them with new worker nodes. By accelerating the speed of issue detection and remediation, it increases the reliability of your ML training and reduces the wasted time and cost due to hardware failure.

This solution is applicable if you’re using managed nodes or self-managed node groups (which use Amazon EC2 Auto Scaling groups) on Amazon EKS. At the time of writing this post, automatic recovery of nodes provisioned by Karpenter is not yet supported.

Solution overview

The solution is based on the node problem detector and recovery DaemonSet, a powerful tool designed to automatically detect and report various node-level problems in a Kubernetes cluster.

The node problem detector component will continuously monitor the kernel message (kmsg) logs on the worker nodes. If it detects error messages specifically related to the Neuron device (which is the Trainium or AWS Inferentia chip), it will change NodeCondition to NeuronHasError on the Kubernetes API server.

The node recovery agent is a separate component that periodically checks the Prometheus metrics exposed by the node problem detector. When it finds a node condition indicating an issue with the Neuron device, it will take automated actions. First, it will mark the affected instance in the relevant Auto Scaling group as unhealthy, which will invoke the Auto Scaling group to stop the instance and launch a replacement. Additionally, the node recovery agent will publish Amazon CloudWatch metrics for users to monitor and alert on these events.

The following diagram illustrates the solution architecture and workflow.

In the following walkthrough, we create an EKS cluster with Trn1 worker nodes, deploy the Neuron plugin for the node problem detector, and inject an error message into the node. We then observe the failing node being stopped and replaced with a new one, and find a metric in CloudWatch indicating the error.

Prerequisites

Before you start, make sure you have installed the following tools on your machine:

Deploy the node problem detection and recovery plugin

Complete the following steps to configure the node problem detection and recovery plugin:

    Create an EKS cluster using the data on an EKS Terraform module:
    git clone https://github.com/awslabs/data-on-eks.gitexport TF_VAR_region=us-east-2export TF_VAR_trn1_32xl_desired_size=4export TF_VAR_trn1_32xl_min_size=4cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x install.sh./install.shaws eks --region us-east-2 describe-cluster --name trainium-inferentia# Creates k8s config file to authenticate with EKSaws eks --region us-east-2 update-kubeconfig --name trainium-inferentiakubectl get nodesNAME STATUS ROLES AGE VERSIONip-100-64-161-213.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fddeip-100-64-227-31.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fddeip-100-64-70-179.us-east-2.compute.internal Ready 31d v1.29.0-eks-5e0fdde
    Install the required AWS Identity and Access Management (IAM) role for the service account and the node problem detector plugin. Create a policy as shown below. Update the Resource key value to match your node group ARN that contains the Trainium and AWS Inferentia nodes, and update the ec2:ResourceTag/aws:autoscaling:groupName key value to match the Auto Scaling group name.

You can get these values from the Amazon EKS console. Choose Clusters in the navigation pane, open the trainium-inferentia cluster, choose Node groups, and locate your node group.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.# Create npd-policy-trimmed.jsoncat << EOF > npd-policy-trimmed.json{    "Version": "2012-10-17",    "Statement": [        {            "Action": [                "autoscaling:SetInstanceHealth",                "autoscaling:DescribeAutoScalingInstances"            ],            "Effect": "Allow",            "Resource": <arn of the Auto Scaling group corresponding to the Neuron nodes for the cluster>        },        {            "Action": [                "ec2:DescribeInstances"            ],            "Effect": "Allow",            "Resource": "*",            "Condition": {                "ForAllValues:StringEquals": {                    "ec2:ResourceTag/aws:autoscaling:groupName": <name of the Auto Scaling group corresponding to the Neuron nodes for the cluster>                }            }        },        {            "Action": [                "cloudwatch:PutMetricData"            ],            "Effect": "Allow",            "Resource": "*",            "Condition": {                "StringEquals": {                    "cloudwatch:Namespace": "NeuronHealthCheck"                }            }        }    ]}EOF

This component will be installed as a DaemonSet in your EKS cluster.

# To create the policy, aws cli can be used as shown below where npd-policy-trimmed.json is the policy json constructed from the template above.aws iam create-policy  \--policy-name NeuronProblemDetectorPolicy \--policy-document file://npd-policy-trimmed.json# Note the ARNCLUSTER_NAME=trainium-inferentia # Your EKS Cluster Name AWS_REGION=us-east-2ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output text)POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:policy/NeuronProblemDetectorPolicyeksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent \  --region $AWS_REGIONeksctl create podidentityassociation \    --cluster $CLUSTER_NAME \    --namespace neuron-healthcheck-system \    --service-account-name node-problem-detector \    --permission-policy-arns="$POLICY_ARN" \    --region $AWS_REGION    # Install the Neuron NPD and recovery plugin kubectl create ns neuron-healthcheck-systemcurl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f - curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f - curl https://raw.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f -# Expected result (with 4 Neuron nodes in cluster):kubectl get pod -n neuron-healthcheck-systemNAME READY STATUS RESTARTS AGEnode-problem-detector-49p6w 2/2 Running 0 31snode-problem-detector-j7wct 2/2 Running 0 31snode-problem-detector-qr6jm 2/2 Running 0 31snode-problem-detector-vwq8x 2/2 Running 0 31s

The container images in the Kubernetes manifests are stored in public repository such as registry.k8s.io and public.ecr.aws. For production environments, it’s recommended that customers limit external dependencies that impact these areas and host container images in a private registry and sync from images public repositories. For detailed implementation, please refer to the blog post: Announcing pull through cache for registry.k8s.io in Amazon Elastic Container Registry.

By default, the node problem detector will not take any actions on failed node. If you would like the EC2 instance to be terminated automatically by the agent, update the DaemonSet as follows:

kubectl edit -n neuron-healthcheck-system ds/node-problem-detector...   env:   - name: ENABLE_RECOVERY     value: "true"

Test the node problem detector and recovery solution

After the plugin is installed, you can see Neuron conditions show up by running kubectl describe node. We simulate a device error by injecting error logs in the instance:

# Verify node conditions on any node. Neuron conditions should show up.kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep Conditions: -A7Conditions:  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----             ------  -----------------                 ------------------                ------                       -------  NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error  MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has sufficient memory available  DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk pressure  PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has sufficient PID available  Ready            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting ready status# To get provider idkubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep -i provider | sed -E 's/.*\/([^\/]+)$/\1/'i-0381404aa69eae3f6# SSH into to the worker node and simulate the hardware error on the neuron deviceaws ssm start-session --target i-0381404aa69eae3f6 --region us-east-2Starting session with SessionId: lindarr-0069460593240662ash-4.2$sh-4.2$ sudo bash[root@ip-192-168-93-211 bin]# echo "test NEURON_HW_ERR=DMA_ERROR test" >> /dev/kmsg

Around 2 minutes later, you can see that the error has been identified:

kubectl describe node ip-100-64-58-151.us-east-2.compute.internal | grep 'Conditions:' -A7
Conditions:  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message  ----             ------  -----------------                 ------------------                ------                       -------  NeuronHealth     True    Fri, 29 Mar 2024 17:42:43 +0800   Fri, 29 Mar 2024 17:42:38 +0800   NeuronHasError_DMA_ERROR     test NEURON_HW_ERR=DMA_ERROR test...Events:  Type     Reason                    Age   From            Message  ----     ------                    ----  ----            -------  Warning  NeuronHasError_DMA_ERROR  36s   kernel-monitor  Node condition NeuronHealth is now: True, reason: NeuronHasError_DMA_ERROR, message: "test NEURON_HW_ERR=DMA_ERROR test"

Now that the error has been detected by the node problem detector, and the recovery agent has automatically taken the action to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:

# Verify the Node scheduling is disabled.kubectl get node NAME                                           STATUS                        ROLES    AGE    VERSIONip-100-64-1-48.us-east-2.compute.internal      Ready                         <none>   156m   v1.29.0-eks-5e0fddeip-100-64-103-26.us-east-2.compute.internal    Ready                         <none>   94s    v1.29.0-eks-5e0fddeip-100-64-239-245.us-east-2.compute.internal   Ready                         <none>   154m   v1.29.0-eks-5e0fddeip-100-64-52-40.us-east-2.compute.internal     Ready                         <none>   156m   v1.29.0-eks-5e0fddeip-100-64-58-151.us-east-2.compute.internal    NotReady,SchedulingDisabled   <none>   27h    v1.29.0-eks-5e0fdde

You can open the CloudWatch console and verify the metric for NeuronHealthCheck. You can see the CloudWatch NeuronHasError_DMA_ERROR metric has the value 1.

After replacement, you can see a new worker node has been created:

# The new node with age 28s is the new nodekubectl get node NAME                                           STATUS   ROLES    AGE   VERSIONip-192-168-65-77.us-east-2.compute.internal    Ready    <none>   28s   v1.29.0-eks-5e0fddev1.28.5-eks-5e0fddeip-192-168-81-176.us-east-2.compute.internal   Ready    <none>   9d    v1.29.5-eks-5e0fddeip-192-168-91-218.us-east-2.compute.internal   Ready    <none>   9d    v1.29.0-eks-5e0fddeip-192-168-94-83.us-east-2.compute.internal    Ready    <none>   9d    v1.29.0-eks-5e0fdde

Let’s look at a real-world scenario, in which you’re running a distributed training job, using an MPI operator as outlined in Llama-2 on Trainium, and there is an irrecoverable Neuron error in one of the nodes. Before the plugin is deployed, the training job will become stuck, resulting in wasted time and computational costs. With the plugin deployed, the node problem detector will proactively remove the problem node from the cluster. In the training scripts, it saves checkpoints periodically so that the training will resume from the previous checkpoint.

The following screenshot shows example logs from a distributed training job.

The training has been started. (You can ignore loss=nan for now; it’s a known issue and will be removed. For immediate use, refer to the reduced_train_loss metric.)

The following screenshot shows the checkpoint created at step 77.

Training stopped after one of the nodes has a problem at step 86. The error was injected manually for testing.

After the faulty node was detected and replaced by the Neuron plugin for node problem and recovery, the training process resumed at step 77, which was the last checkpoint.

Although Auto Scaling groups will stop unhealthy nodes, they may encounter issues preventing the launch of replacement nodes. In such cases, training jobs will stall and require manual intervention. However, the stopped node will not incur further charges on the associated EC2 instance.

If you want to take custom actions in addition to stopping instances, you can create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights query like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to evaluate the alarms. The following screenshots show an example.

Clean up

To clean up all the provisioned resources for this post, run the cleanup script:

# neuron-problem-detector-role-$CLUSTER_NAMEeksctl delete podidentityassociation \--service-account-name node-problem-detector \--namespace neuron-healthcheck-system \--cluster $CLUSTER_NAME \--region $AWS_REGION# delete the EKS Clustercd data-on-eks/ai-ml/trainium-inferentia./cleanup.sh

Conclusion

In this post, we showed how the Neuron problem detector and recovery DaemonSet for Amazon EKS works for EC2 instances powered by Trainium and AWS Inferentia. If you’re running Neuron based EC2 instances and using managed nodes or self-managed node groups, you can deploy the detector and recovery DaemonSet in your EKS cluster and benefit from improved reliability and fault tolerance of your machine learning training workloads in the event of node failure.


About the authors

Harish Rao is a senior solutions architect at AWS, specializing in large-scale distributed AI training and inference. He empowers customers to harness the power of AI to drive innovation and solve complex challenges. Outside of work, Harish embraces an active lifestyle, enjoying the tranquility of hiking, the intensity of racquetball, and the mental clarity of mindfulness practices.

Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.

Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.

Darren Lin is a Cloud Native Specialist Solutions Architect at AWS who focuses on domains such as Linux, Kubernetes, Container, Observability, and Open Source Technologies. In his spare time, he likes to work out and have fun with his family.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS Neuron 节点问题检测 恢复 DaemonSet Amazon EKS ML 训练
相关文章