AWS Machine Learning Blog 2024年09月19日
Accelerate pre-training of Mistral’s Mathstral model with highly resilient clusters on Amazon SageMaker HyperPod
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker HyperPod 是一款专为加速大型语言模型 (LLM) 训练而设计的服务,它通过持续监控集群健康状况、修复和替换故障节点以及使用用户定义的检查点来自动恢复训练,从而最大限度地减少训练中断。HyperPod 还提供了一系列功能,例如备用节点池、集群放置组、预配置深度学习 AMI 以及可重复使用的缩放脚本,以简化 LLM 训练,并确保高效、可靠的训练过程。此外,HyperPod 集成了实时仪表板,以监控节点运行状况、GPU 利用率、网络流量和其他关键指标,从而提供对训练作业的全面可视性。

😁 **备用节点池,无额外成本:** SageMaker HyperPod 为用户提供一个备用节点池,用于在训练过程中自动替换故障节点。这些备用节点不会产生额外费用,确保大型训练作业不会因故障而中断或延迟。HyperPod 的自动恢复功能可以动态地将不健康的节点替换为备用节点,从而确保工作负载的无缝继续。

😊 **集群放置组,优化训练:** 每个实例组都在同一个网络脊柱内的集群放置组中启动,以获得最佳的节点间延迟并最大化节点之间的带宽。这对于像分布式训练这样的紧密耦合工作负载非常理想,因为低延迟通信对于同步梯度更新和确保模型训练在多个 GPU 上有效扩展至关重要。

😉 **预配置深度学习 AMI,包含必要库:** SageMaker HyperPod 代理运行一个 SageMaker HyperPod DL AMI,该 AMI 基于 AWS 深度学习基础 GPU AMI(Ubuntu 20.04)构建。SageMaker HyperPod DL AMI 包含支持开源工具(如 Slurm)和依赖项的附加软件包。此外,还包括 SageMaker HyperPod 集群软件包,这些软件包支持集群运行状况检查和自动恢复等功能。

🤩 **可重复使用的缩放脚本,快速实验:** HyperPod 提供了一组可扩展且可重复使用的脚本,简化了启动多个训练运行的过程。这些脚本简化了基础设施设置和部署,可以轻松地适应不同的训练场景或并行运行多个作业,使大规模训练更易于管理。通过减少重复性任务并提供可重复使用的自动化,这些脚本使用户能够快速扩展或缩减、测试不同的模型变体并更快地迭代,从而提高生产力并降低运营开销。

🥳 **自动恢复功能:** 当节点出现故障时,SageMaker HyperPod 会自动用备用池中的健康节点替换它,并从上次保存的检查点恢复作业,对训练造成最小的干扰。这对于长时间运行的训练作业尤其重要,因为即使是轻微的中断也会导致重大延迟。

🤯 **实时性能仪表板,一键式设置:** SageMaker HyperPod 与实时仪表板无缝集成,用于监控节点运行状况、GPU 利用率、网络流量和其他关键指标。只需点击几下即可完成此操作,从而提供对训练作业的全面可视性,并允许团队实时优化性能。

😮 **可观察性:** 为了实现对 SageMaker HyperPod 集群资源和软件组件的全面可观察性,可以将集群与 Amazon Managed Service for Prometheus 和 Amazon Managed Grafana 集成。与 Amazon Managed Service for Prometheus 的集成支持导出与 SageMaker HyperPod 集群资源相关的指标,从而提供对其性能、利用率和运行状况的洞察。与 Amazon Managed Grafana 的集成支持通过各种 Grafana 仪表板可视化这些指标,这些仪表板提供直观的界面来监控和分析集群的行为。通过使用这些服务,可以获得对 SageMaker HyperPod 集群的集中和统一视图,从而促进对分布式训练工作负载的主动监控、故障排除和优化。

😨 **Mathstral 模型:** Mathstral 是一个专为数学推理和科学发现而设计的模型,基于原始的 Mistral 7B 模型,并具有 32k 上下文窗口。Mathstral 的发布符合 Mistral AI 广泛支持学术和科学研究的努力,尤其是他们与 Project Numina 的合作。作为一个 7B 模型,与用于数学推理的类似模型相比,Mathstral 在数学推理生成方面的性能和延迟空间上树立了新标准。Mathstral 可以通过更多的推理时间计算实现明显更好的结果。

😥 **PyTorch FSDP:** 在分布式数据并行 (DDP) 训练中,每个处理器工作进程拥有模型的一个副本并处理一批数据。然后,它使用全简化来汇总不同工作进程上的梯度。在 DDP 中,模型权重和优化器状态在所有工作进程中复制。DDP 在每个 GPU 上维护模型的完整副本,并要求每个 GPU 上有足够的内存来存储整个模型。对于训练更大的 LLM,建议使用 FSDP 等方法,因为这些 LLM 需要不止一个 GPU。它是一种数据并行,它将模型参数、优化器状态和梯度在 DDP 进程之间进行分片。这种方法减少了单个 GPU 上的内存需求,并将内存负载分布在多个 GPU 上。通过 FSDP 提高的效率,研究人员和开发人员可以使用更少的 GPU,从而最大限度地降低运营成本并实现更快的模型收敛。

😭 **解决方案概述:** 下图显示了作为 SageMaker HyperPod 部署的资源的体系结构图,用于我们训练 Mathstral 模型的用例。在您的帐户中,您将有一个配备公共和私有子网的 VPC,以及一个通过数据存储库链接与您的 FSxL 文件系统同步的 S3 存储桶。在服务团队帐户中,您的 P4de 实例集群以及头节点和登录节点将为您提供服务,以便您将训练作业提交到您的集群。

😱 **先决条件:** 在本文的背景下,我们使用四个 p4de.24xlarge 实例。您可以在 Amazon EC2 P4 实例中找到有关 p4de.24xlarge 实例类型的更多信息。为了获得最佳的节点间延迟,我们将这些实例一起在一个集群中启动,并且只在一个实例组上运行作业。您也可以使用各种其他实例类型来遵循本文。有关在分区组中访问实例的更多信息,请参阅本文中的“入门”部分。请注意,Mathstral 7B 在全精度 (FP32) 中的大小约为 26GB,因此您需要确保您的集群配置具有足够的 GPU 内存来加载它。

🥶 **结论:** SageMaker HyperPod 是一款强大的服务,可以简化大型语言模型的训练,并确保高效、可靠的训练过程。通过其备用节点池、集群放置组、预配置深度学习 AMI、可重复使用的缩放脚本、自动恢复功能和实时性能仪表板,SageMaker HyperPod 可以帮助数据科学家和 ML 工程师专注于模型开发,而不是基础设施管理。通过将 SageMaker HyperPod 与 Amazon Managed Service for Prometheus 和 Amazon Managed Grafana 集成,可以实现对训练作业的全面可观察性,从而促进主动监控、故障排除和优化。这些功能共同有助于创建一个更可靠、更高效的训练环境,最大限度地减少停机时间并优化资源使用。

😢 **未来展望:** SageMaker HyperPod 将继续发展,以支持更广泛的训练场景和用例。未来,我们可以期待看到 SageMaker HyperPod 支持更广泛的实例类型、更先进的故障恢复机制以及更丰富的可观察性功能。这些改进将进一步简化大型语言模型的训练,并使研究人员和开发人员能够专注于推动人工智能的界限。

😥 **致谢:** 感谢 Mistral AI 的团队为本文提供了宝贵的见解和支持。他们的工作对于推动大型语言模型的进步至关重要。

😱 **免责声明:** 本文仅供参考,并不构成任何形式的投资建议。

😈 **版权声明:** 本文版权归作者所有,转载请注明出处。

😨 **联系方式:** 如果您有任何问题或建议,请随时联系作者。

😩 **结束语:** 希望本文能帮助您更好地了解 SageMaker HyperPod,并为您的大型语言模型训练提供帮助。

😭 **作者:** [您的姓名]

In recent years, FM sizes have been increasing. It is important to consider the massive amount of compute often required to train these models. The compute clusters used in these scenarios are composed of more than thousands of AI accelerators such as GPUs or AWS Trainium and AWS Inferentia, custom machine learning (ML) chips designed by Amazon Web Services (AWS) to accelerate deep learning workloads in the cloud.

When using compute clusters of massive size, a single failure can often throw a training job off course and may require multiple hours of discovery and remediation from customers. According to a report from OPT-175B training, about 178,000 GPU hours were wasted due to various training failures, amounting to 16 percent of the total training time. Similarly, a study by Meta AI and Carnegie Melon university found that, in the worst cases, 43 percent of compute time was wasted because of overheads due to hardware failures. This can adversely impact a customer’s ability to keep up with the pace of innovation in generative AI and can also increase the time-to-market for their models.

Amazon SageMaker HyperPod is a service that is purpose-built to accelerate FM training, removing the undifferentiated heavy-lifting involved in managing and optimizing a large training compute cluster. With SageMaker HyperPod, you can train FMs for weeks to months without disruption. To make FM training more resilient to hardware failures, SageMaker HyperPod continually monitors cluster health, repairs and replaces faulty nodes without disrupting training, and uses customer-defined checkpoints to automatically resume training from the last point of failure.

Why SageMaker HyperPod?

SageMaker HyperPod offers several benefits that make it a good choice for FM training:

In this post, we present to you an in-depth guide to starting a continual pre-training job using PyTorch Fully Sharded Data Parallel (FSDP) for Mistral AI’s Mathstral model with SageMaker HyperPod. We review components of the Slurm orchestrated SageMaker HyperPod cluster setup, primarily focusing on the resiliency and feature set of SageMaker HyperPod, including automatic fault detection and integration with open source tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Overview of SageMaker HyperPod resiliency

Some of the health check metrics used by SageMaker HyperPod include:

SageMaker HyperPod continuously performs health checks on crucial components, including GPUs, AWS Trainium cores, and EFA networking devices. This proactive approach allows for the HyperPod health check agent to identify various hardware failures or potential performance degradation. When hardware failures are detected, SageMaker HyperPod identifies faulty instances and is also able to use its auto-resume functionality to initiate a replacement process without manual intervention. This feature automatically detects hardware failures, seamlessly replaces faulty instances, and resumes jobs from the last saved checkpoint. In addition, SageMaker HyperPod offers you the ability to manually replace a node in the case that you have a node stuck with an issue but is not being fixed by the SageMaker HyperPod auto-resume functionality. You can manually change the state of the node to fail, and SageMaker HyperPod will replace it with a healthy instance. For a more in-depth dive into resiliency with SageMaker HyperPod, refer to the Resiliency section of this post.

Overview of SageMaker HyperPod observability

To achieve comprehensive observability into your SageMaker HyperPod cluster resources and software components, you can integrate your cluster with Amazon Managed Service for Prometheus and Amazon Managed Grafana. The integration with Amazon Managed Service for Prometheus enables the export of metrics related to your SageMaker HyperPod cluster resources, providing insights into their performance, utilization, and health. The integration with Amazon Managed Grafana enables the visualization of these metrics through various Grafana dashboards that offer intuitive interface for monitoring and analyzing the cluster’s behavior. By using these services, you gain a centralized and unified view of your SageMaker HyperPod cluster, facilitating proactive monitoring, troubleshooting, and optimization of your distributed training workloads. The Observability section of this post goes into more detail on which metrics are exported and what the dashboards look like in Amazon Managaed Grafana.

This post is primarily focused on Amazon Managed Service for Prometheus and Amazon Managed Grafana for observability. To explore more observability integrations with SageMaker HyperPod like Nvidia Nsight, refer to the validation and observability folder of the awsome-distributed-training GitHub repo.

These resiliency and observability features collectively contribute to a more reliable and efficient training environment, minimize downtime, and optimize resource usage. By directly integrating with Amazon Managed Service for Prometheus and Amazon Managed Grafana and abstracting the management of hardware failures and job resumption, SageMaker HyperPod allows data scientists and ML engineers to focus on model development rather than infrastructure management.

Mathstral model from Mistral AI

Mathstral is a model designed for math reasoning and scientific discovery, is based on the original Mistral 7B model, and features a 32k context window. The release of Mathstral aligns with Mistral AI’s broader effort to support academic and scientific research, particularly through their collaboration with Project Numina. As a 7B model, Mathstral sets a new standard on the performance and latency space for math and reasoning generation compared to similar models used for math and reasoning. Mathstral can achieve significantly better results with more inference-time computation.

Overview of PyTorch FSDP

In distributed data parallel (DDP) training, each process or worker owns a replica of the model and processes a batch of data. Then, it uses all-reduce to sum up gradients over different workers. In DDP, the model weights and optimizer states are replicated across all workers. DDP maintains a full copy of the model on each GPU and requires enough memory on each GPU to store the entire model. For training larger FMs, using an approach like FSDP is recommended, since these FMs require more than a single GPU. It is a type of data parallelism that shards model parameters, optimizer states, and gradients across DDP ranks. This approach reduces the memory requirements on individual GPUs and distributes the memory load across GPUs. With FSDP enhanced efficiency, researchers and developers use fewer GPUs, thereby minimizing operational costs and achieving faster model convergence.

When training with FSDP, the GPU memory footprint is smaller than when training with DDP across all workers. This makes the training of some very large models feasible by allowing them to be loaded into memory with a lower memory footprint. However, this comes at the cost of increased communication volume. For more information on FSDP, refer to PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel.

Solution overview

The following image shows the architecture diagram for the resources deployed as part of Sagemaker HyperPod for our use case of training the Mathstral model. In your account, you will have a VPC provisioned with a public and private subnet, and an S3 bucket synced to your FSxL file system via a data repository link. In the service team account, your cluster of P4de instances is provisioned, along with the head node, and the login node, for you to submit the training job to your cluster.

Prerequisites

In the context of this post, we use four p4de.24xlarge instances. You can find more information on the p4de.24xlarge instance type at Amazon EC2 P4 Instances. To get the best inter-node latency, we launch these instances together in a cluster and only run jobs on a single instance group. You can also use a variety of other instance types to follow along with this post.

For more information on getting access to instances in a partition group, refer to the Getting Started section in this post. Note that Mathstral 7B at full precision (FP32) is approximately 26 GB in size so you need to make sure that your cluster configuration has sufficient GPU memory to load the model into memory along with the gradients, activations, and moments. This should account for a total of 107 GB in addition to the training assets required to kick off a job successfully. For demonstration purposes, we use FSDP for this continued pre-training job.

The following sections describe setting up your infrastructure and environment with SageMaker HyperPod. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop. The prerequisites and cluster setup parts of this workshop go over all the required components needed in order to set up your cluster. The workshop also provides resources to troubleshoot commonly faced issues during setup.

Set up your infrastructure

Deploy HyperPod VPC stack

To set up your cluster, you first need to create some resources. The following resources can be created by deploying this SageMaker HyperPod VPC CloudFormation stack. By default usw2-az4 is specified as the Availability Zone. Change this to reflect the Availability Zone where you have your cluster. This VPC stack creates the following resources:

Deploy the observability stack

In order to use the observability integration with SageMaker HyperPod, you need to deploy the SageMaker HyperPod Observability CloudFormation stack, which can then be used to monitor your cluster metrics in real time.

Set up your environment

Let’s move on to environment setup. In order to deploy this solution, you need to use a Linux-based development environment. This section briefly describes the steps required to set up your cluster. For detailed instructions and code, we recommend that you follow along with the Amazon SageMaker HyperPod workshop.

Set up your cluster

This section guides you through the process of deploying a cluster to train with. You need to set up the following:

A B C D E F
1 Instance size GPU devices Total GPU memory VCPUs CPU memory EFA bandwidth
2 p4de.24xlarge 8 640 gb 96 1152 gb 400 Gbps

Set up the AWS CLI

Before creating the cluster and its associated resources, you need to set up the AWS Command Line Interface (AWS CLI) using the latest version (or version 2.17.1 at a minimum).

To check the AWS CLI version, use the following command.

aws --version

To update the AWS CLI to the latest version, use the following command.

sudo ./aws/install –update

The AWS CLI plugin for Session Manager, a capability of AWS Systems Manager, must be installed to access your cluster. To use Amazon Linux 2 to install Session Manager, use the following command:

sudo yum install -y https://s3.amazonaws.com/session-manager-downloads/plugin/latest/linux_64bit/session-manager-plugin.rpm

For detailed steps on installing and setting up the AWS CLI, follow the steps provided in the Install AWS CLI section of the Amazon SageMaker HyperPod workshop.

Source environment variables

An important part of the setup is to source in all the environment variables, using the output from the VPC CloudFormation stack deployed in a previous step. Use the following command.

curl 'https://static.us-east-1.prod.workshops.aws/public/e3e1b2f1-8140-43eb-a316-e76f569119dd/static/scripts/create_config.sh' --output create_config.sh
bash create_config.sh
source env_vars

Once you have sourced them in, confirm that they were correctly set using the following command.

cat env_vars

Set up lifecycle scripts

SageMaker HyperPod uses a collection of lifecycle scripts  to bootstrap the cluster. These scripts are responsible for several actions, including setting up Slurm and mounting the FSx for Lustre file system. You need to customize these scripts in order to mount your FSx for Lustre file system. For detailed steps on setting up these lifecycle scripts, refer to the Set Up Lifecycle Scripts section of the workshop.

Make sure to complete the Enable Optional Lifecycle scripts section after step 4 of the Set Up Lifecycle scripts section because this is needed in order to enable installation of exporter services on the cluster. This is required because you need the exporter services on the cluster to emit metrics to Amazon Managed Service for Prometheus.

Additionally, the observability stack requires the following two AWS managed IAM policies to be added to your AmazonSagemakerClusterExecutionRole prior to creating your cluster.

aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam attach-role-policy --role-name $ROLENAME --policy-arn arn:aws:iam::aws:policy/AWSCloudFormationReadOnlyAccess
Once you have uploaded the lifecycle scripts to Amazon S3, you can then create your cluster.

Create your cluster

To create your cluster, you need your cluster configuration. Because you use p4de.24xlarge for this example, copy the following cluster configuration.

source env_varscat > cluster-config.json << EOL{    "ClusterName": "ml-cluster",    "InstanceGroups": [      {        "InstanceGroupName": "controller-machine",        "InstanceType": "ml.m5.12xlarge",        "InstanceStorageConfigs": [          {            "EbsVolumeConfig": {              "VolumeSizeInGB": 500            }          }        ],        "InstanceCount": 1,        "LifeCycleConfig": {          "SourceS3Uri": "s3://${BUCKET}/src",          "OnCreate": "on_create.sh"        },        "ExecutionRole": "${ROLE}",        "ThreadsPerCore": 1      },      {        "InstanceGroupName": "worker-group-1",        "InstanceType": "ml.p4de.24xlarge",        "InstanceCount": 4,        "LifeCycleConfig": {          "SourceS3Uri": "s3://${BUCKET}/src",          "OnCreate": "on_create.sh"        },        "ExecutionRole": "${ROLE}",        "ThreadsPerCore": 1      }    ],    "VpcConfig": {      "SecurityGroupIds": ["$SECURITY_GROUP"],      "Subnets":["$SUBNET_ID"]    }}EOL

If you use a different instance type for your cluster, refer to the Create Cluster section of the workshop to create your cluster-config.json file.

SageMaker HyperPod also gives you the ability to update your clusters to increase the size of an existing worker group or create a new worker group to add additional instance-types to your cluster. For steps on updating the cluster to create additional worker groups that use other instance types, refer to the section in the workshop to create Heterogenous Clusters.

Once you’ve created the cluster-config.json file, follow the Create Cluster steps in the workshop to create the FSX for Lustre configuration (provisioning_parameters.json) file and upload it to Amazon S3. Then, you can validate the configuration using the validate-config.py file in the awsome-distributed-training GitHub repo.

Once this validation is completed, you can create your cluster. Use the following command.

aws sagemaker create-cluster \    --cli-input-json file://cluster-config.json \    --region $AWS_REGION

To check the state of your cluster, run the following command.

aws sagemaker list-clusters --output table

You should then be able to observe the cluster creating.

-----------------------------------------------------------------------------------------------------------------------------------------------------|                                                                          ListClusters                                                             |+---------------------------------------------------------------------------------------------------------------------------------------------------+||                                                                       ClusterSummaries                                                          |||+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|||                        ClusterArn                              |    ClusterName       | ClusterStatus |               CreationTime              |||+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+||| arn:aws:sagemaker:us-west-2:{cluster arn}                      |  ml-cluster          | Creating      | time taken to create                    |||+----------------------------------------------------------------+----------------------+---------------+-----------------------------------------+|

Now that you’ve created a cluster, you can monitor the status in the SageMaker console. This will show you cluster status, running instances, and node groups and allow you to modify the cluster. In the SageMaker HyperPod console, find your cluster and select it, as shown in the following screenshot.

Once the Cluster status changes to InService, you can connect using Secure Shell (SSH). Make sure that you completed the step in Set up the AWS CLI to install the SSM plugin. You can then take the easy-ssh.sh file from the repo to simplify the SSM command connect to the controller-machine using SSH.

curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.shchmod +x easy-ssh.sh./easy-ssh.sh -c controller-machine ml-cluster

Use the following command to switch to the ubuntu user.

sudo su - ubuntu

Refer to the Get to know your Cluster  section in the SageMaker HyperPod workshop to familiarize yourself with the commands you need to use in the later sections.

Finally, set up SSH access to the compute nodes. To do this, add a key-value pair to the /fsx/ubuntu directory. Because all the compute nodes mount this directory, you only have to do this once for ubuntu to access all the compute nodes. For instructions, refer to the SSH Access to compute section of the workshop.

Congrats on setting up your environment! Now that you’ve completed the necessary steps, you can move on to your training job.

Run your pre-training job

Follow these steps on your cluster head node:

    Navigate to your shared FSx for Lustre file system. If you followed the tutorial linked previously, it will be location at /fsx. Use the following command to clone the awsome-distributed-training repo.
cd /fsxgit clone https://github.com/aws-samples/awsome-distributed-training/cd awsome-distributed-training/3.test_cases/10.FSDP
    Run the create_conda_env.sh script.

This script will first download and install Miniconda, then create a Conda environment called pt_fsdp. The Conda environment installs PyTorch on AWS, which is a package that is built to run PyTorch workloads on AWS. Specifically, it lets you use EFA out of the box, since OFI-NCCL is pre-built in the Conda package. PyTorch on AWS also provides the latest versions of CUDA, cuDNN, and NCCL for the best performance on GPU-based instances. Dependencies required to run your FSDP training job will be installed in this Conda environment, and since this Conda environment is created on the /fsx file system, it’ll be shared across all your training nodes.

bash 0.create_conda_env.sh

For this training job, you use the C4 dataset, which is several hundred gigabytes. Instead of downloading the whole thing, the create_streaming_dataloaders function will stream the dataset from HuggingFace, so there’s no data prep required for running this training.

If want to use your own dataset instead, you can format it as a HuggingFace dataset and pass its location to the --dataset_path argument.

Launch training

The script to launch the Mathstral training job can be found in 3.distributed-training-mistral-mathstral.sbatch. Depending on the number of nodes in your cluster, you are can adjust them by modifying #SBATCH --nodes=4. Because you are using four p4de.24xlarge instances, it has been set to 4.

For the purpose of this post, you need to make sure that the FI_EFA variables for EFA are exported in the 3.distributed-training-mistral-mathstral.sbatch file. If you use instances not enabled for remote direct memory access (RDMA), such as the g5.12xlarge, comment out lines 21–22 of this file. These instances have EFA between nodes, but do not have the GPU direct RDMA access of p4d/e and p5 instances. In this walkthrough, we are using p4de instances, so we leave these lines uncommented.

## Plenty of EFA level variables## Comment out for non-efa instances (G5, G4d, P3)export FI_EFA_USE_DEVICE_RDMA=1 # use for p4deexport FI_LOG_LEVEL=1export FI_PROVIDER=efaexport NCCL_DEBUG=INFO

Under User Variables, make sure to adjust GPUS_PER_NODE to match the number of GPUs on your instance type (8 for p4de).

You can also adjust the training parameters in TRAINING_ARGS. Additional parameters can be found in model/arguments.py.

We use the same directory for both --checkpoint_dir and --resume_from_checkpoint. If there are multiple checkpoints, --resume_from_checkpoint will automatically select the most recent one. This way, if the training is interrupted for any reason, it will automatically pick up the most recent checkpoint.

Note: You may change these hyperparameters in the 3.distributed-training-mistral-mathstral.sbatch file. We are using arbitrary hyperparameters here for the sake of demonstration.

declare -a TRAINING_ARGS=(    --train_batch_size=1 \    --val_batch_size=1 \    --max_steps=5000 \    --seed=42 \    --grad_clip=1.0 \    --weight_decay=0.2 \    --beta1=0.9 \    --beta2=0.95 \    --activation_checkpointing=1 \    --intermediate_size=14336 \    --num_key_value_heads=8 \    --logging_freq=1 \    --max_context_width=32768 \    --vocab_size=32768 \    --hidden_width=4096 \    --num_layers=32 \    --num_heads=32 \    --resid_pdrop=0.1 \    --embd_pdrop=0.1 \    --attn_pdrop=0.1 \    --summary_first_pdrop=0.1 \    --initializer_range=0.02 \    --model_type="mistral" \    --rotary_pct=0.25 \    --rotary_emb_base=10000 \    --lr=0.0001 \    --lr_decay_style="cosine" \    --min_lr=1e-5 \    --warmup=0.0032 \    --plateau=0.0 \    --dataset="c4" \    --tokenizer="mistralai/mathstral-7B-v0.1" \    --epochs=3 \    --checkpoint_dir="./checkpoints/mathstral-7B" \    --resume_from_checkpoint="./checkpoints/mathstral-7B" \    --checkpoint_freq=50 \    --validation_freq=500 \    --dataset_config_name="en" \    --limit_all_gathers=1 \    --sharding_strategy="full" \ # https://pytorch.org/docs/stable/fsdp.html    --offload_activations=1)

To launch your training, run the following command.

sbatch 3.distributed-training-mistral-mathstral.sbatch

You’ll find a new file in the FSDP directory of the form slurm-[job-number].out. This will be continuously updated with your training logs. Don’t be worried if you notice a long stream of NCCL logs (we prefer to use NCCL_DEBUG=INFO for verbose logging). After about a minute, you should observe your Mathstral model training, with an output similar to the following.

...+ TORCHRUN=./pt_fsdp/bin/torchrun+ export TRAIN_SCRIPT=./train.py+ TRAIN_SCRIPT=./train.py+ TRAINING_ARGS=(--train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type="mistral" --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style="cosine" --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset="c4" --tokenizer="mistralai/mathstral-7B-v0.1" --epochs=3 --checkpoint_dir="./checkpoints/mathstral-7B" --resume_from_checkpoint="./checkpoints/mathstral-7B" --checkpoint_freq=50 --validation_freq=500 --dataset_config_name="en" --limit_all_gathers=1 --sharding_strategy="full" \ # https://pytorch.org/docs/stable/fsdp.html --offload_activations=1)+ declare -a TRAINING_ARGS+ AUTO_RESUME=+ '[' -d /opt/sagemaker_cluster ']'+ echo 'Detected Hyperpod cluster.. enabling --auto-resume=1'Detected Hyperpod cluster.. enabling --auto-resume=1+ AUTO_RESUME=--auto-resume=1+ srun --auto-resume=1 -l ./pt_fsdp/bin/torchrun --nproc_per_node=8 --nnodes=4 --rdzv_id=35 --rdzv_backend=c10d --rdzv_endpoint=ip-10-2-39-253 ./train.py --train_batch_size=1 --val_batch_size=1 --max_steps=5000 --seed=42 --grad_clip=1.0 --weight_decay=0.2 --beta1=0.9 --beta2=0.95 --activation_checkpointing=1 --intermediate_size=14336 --num_key_value_heads=8 --logging_freq=1 --max_context_width=32768 --vocab_size=32768 --hidden_width=4096 --num_layers=32 --num_heads=32 --resid_pdrop=0.1 --embd_pdrop=0.1 --attn_pdrop=0.1 --summary_first_pdrop=0.1 --initializer_range=0.02 --model_type=mistral --rotary_pct=0.25 --rotary_emb_base=10000 --lr=0.0001 --lr_decay_style=cosine --min_lr=1e-5 --warmup=0.0032 --plateau=0.0 --dataset=c4 --tokenizer=mistralai/mathstral-7B-v0.1 --epochs=3 --checkpoint_dir=./checkpoints/mathstral-7B --resume_from_checkpoint=./checkpoints/mathstral-7B --checkpoint_freq=50 --validation_freq=500 --dataset_config_name=en --limit_all_gathers=1 --sharding_strategy=full ' #' https://pytorch.org/docs/stable/fsdp.html --offload_activations=1...3: 2024-07-19 03:31:38 I [train.py:155] Creating Model3: 2024-07-19 03:33:08 I [train.py:171] Created model with total parameters: 7248023552 (7.25 B)3:...3: 2024-07-19 03:33:23 I [train.py:209] Wrapped model with FSDP3: 2024-07-19 03:33:23 I [train.py:226] Created optimizer3: 2024-07-19 03:33:23 I [checkpoint.py:70] No Checkpoints Found...3: 2024-07-19 03:33:35 I [train.py:102] Batch 0 Loss: 11.19900, Speed: 5.10 samples/sec, lr: 0.0000063: 2024-07-19 03:33:38 I [train.py:102] Batch 1 Loss: 11.18291, Speed: 10.96 samples/sec, lr: 0.0000133: 2024-07-19 03:33:40 I [train.py:102] Batch 2 Loss: 11.09163, Speed: 11.22 samples/sec, lr: 0.0000193: 2024-07-19 03:33:43 I [train.py:102] Batch 3 Loss: 10.86621, Speed: 11.19 samples/sec, lr: 0.0000253: 2024-07-19 03:33:46 I [train.py:102] Batch 4 Loss: 10.58236, Speed: 11.12 samples/sec, lr: 0.0000313: 2024-07-19 03:33:49 I [train.py:102] Batch 5 Loss: 10.08024, Speed: 11.18 samples/sec, lr: 0.0000383: 2024-07-19 03:33:52 I [train.py:102] Batch 6 Loss: 10.15507, Speed: 11.23 samples/sec, lr: 0.0000443: 2024-07-19 03:33:55 I [train.py:102] Batch 7 Loss: 9.97296, Speed: 10.42 samples/sec, lr: 0.0000503: 2024-07-19 03:33:58 I [train.py:102] Batch 8 Loss: 10.13596, Speed: 11.21 samples/sec, lr: 0.0000563: 2024-07-19 03:34:01 I [train.py:102] Batch 9 Loss: 9.93156, Speed: 11.10 samples/sec, lr: 0.000063

Observability

SageMaker HyperPod can optionally be integrated with Amazon Managed Service for Prometheus and Amazon Managed Grafana to export metrics about your cluster and cluster-nodes to an Amazon Managed Grafana dashboard.

For more details about configuring Amazon Managed Service for Prometheus and Amazon Managed Grafana, refer to the Prometheus Configuration and Amazon Managed Grafana sections in the SageMaker HyperPod workshop.

Slurm Exporter dashboard

The Amazon Managed Grafana Slurm dashboard (ID: 4323) provides visualization options for monitoring Slurm clusters. Prometheus Slurm exporter is installed on the controller node of the cluster. Some of the metrics exported include:

The following screenshot of the exporter dashboard shows the continued pre-training job for Mathstral being completed successfully.

Node Exporter dashboard

The Amazon Managed Grafana Node Exporter Full dashboard (ID: 1860) offers visualization options for monitoring system metrics collected by the Prometheus Node Exporter installed on the cluster nodes. Some of the key metrics you can visualize include:

DCGM Exporter dashboard

The Amazon Managed Grafana NVIDIA DCGM Exporter dashboard (ID: 12239) offers visualization options for monitoring NVIDIA GPU metrics collected by the DCGM Exporter. Some of the key metrics you can visualize include:

EFA Metrics dashboard

The Amazon Managed Grafana EFA Metrics dashboard (ID: 20579) offers visualization options for monitoring EFA related metrics collected by the EFA Node Exporter. Some of the key visualizations include:

FSx Metrics dashboard

The Amazon Managed Grafana FSx for Lustre dashboard (ID: 20906) offers visualization options for monitoring Amazon FSx for Lustre file system related metrics collected by Amazon CloudWatch. Some of the key visualizations include:

These metrics provide insights into various aspects of your FSx for Lustre file systems.

Resiliency

As mentioned previously, one of the value propositions of SageMaker HyperPod is that it provides a variety of cluster resiliency features such as cluster health checks, auto-resume, and the option to manually replace faulty nodes.

Based on the status of these health checks, SageMaker HyperPod detects whether nodes in the cluster are healthy or not. If a node is deemed unhealthy by any of the health checks, SageMaker HyperPod uses its auto-resume feature to automatically replace the faulty node, without any manual intervention.

Additionally, users have the option to implement checkpointing in their training procedure. Checkpointing, combined with auto-resume, means that once a faulty node is replaced, the training job can resume from the last saved checkpoint. This way, despite a hardware failure, a user’s training job can run with minimal loss in progress.

In this section, we demonstrate the resiliency and auto-resume feature of SageMaker HyperPod by simulating a hardware failure scenario and pointing you towards some logs that indicate the success of a replacement job. We use the same submitted FSDP training job, which has the following two important components enabled:

    Checkpointing is enabled and implemented The --auto-resume=1 flag is set. You can verify this in the SLURM .out

This section in the provided sbatch file sets the --auto-resume=1 flag.

AUTO_RESUME=""if [ -d "/opt/sagemaker_cluster" ]; then    echo "Detected Hyperpod cluster.. enabling --auto-resume=1"    AUTO_RESUME="--auto-resume=1"fi
srun ${AUTO_RESUME} -l ${TORCHRUN} "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

The sbatch file has the checkpointing flags checkpoint_freq, checkpoint_dir, resume_from_checkpoint, which tell the job how often to write checkpoints, where to write the checkpoints to, and what directory to read checkpoints from in case of failure, respectively.

Assuming that you already have your training job submitted, wait until a few checkpoints are written to the ./checkpoints directory (or the directory name you specified for checkpoint_freq. You can check whether any checkpoints were written by running ls -lt checkpoints/. This should return an output that resembles the following.

total 74-rw-rw-r--  1 ubuntu ubuntu     1 Dec  9 00:21 latest_checkpointed_iteration.txtdrwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:20 iter_0000002drwxrwxr-x 10 ubuntu ubuntu 33280 Dec  9 00:11 iter_0000001

You may also check the progress of your training job by running tail -f slurm-<job-id>.log, where <job-id> can be derived by running squeue. You should observe an output that resembles the following.

1:  iteration        1/  508626 | consumed samples:          288 | elapsed time per iteration (ms): 440352.6 | learning rate: 0.000E+00 | global batch size:   288 | loss scale: 4294967296.0 | number of skipped iterations:   1 | number of nan iterations:   0 |0: saving checkpoint at iteration       1 to /fsx/checkpoints0:   successfully saved checkpoint at iteration       1 to /fsx/checkpoints1: (min, max) time across ranks (ms):1:     save-checkpoint ................................: (81611.24, 81611.82)

Once you’ve confirmed that your training job is running and that you have checkpoints written, you are ready to simulate a hardware failure.

As part of the output of running squeue, you have received an output that resembles the following.

JOBID PARTITION     NAME   USER ST  TIME  NODES NODELIST(REASON)32          dev interact ubuntu  R  0:02      4  ip-10-2-9-98,...

This tells you what jobs are running and on what nodes. Locate your training job and choose any of the nodes except the first node on the list of nodes allocated to your job (this is the node that you will be injecting an error into). This is very important because PyTorch uses node 0 (that is, the first node) as the coordination node for your training job.

Once you’ve identified the node to inject the error onto, connect to it using SSH with following command.

ssh <NODE ip>

You can inject an ECC error by running the following command.

dcgmi test --inject --gpuid 0 -f 319 -v 4

This simulates a double-bit error (DBE) in the GPU of your chosen node. Additionally, to kill the training job to simulate a job failure, you can take the process id (PID) of any of the python processes running. The Python processes are the training job processes running your FSDP training job. The -9 flag here is the signal number for the SIGKILL signal, which forces a process to stop, without giving it a chance to clean up, or perform any other actions.

ps -aux | grep pythonkill -9 <PID>

Once the ECC error is injected and the Python process has stopped, you can exit out of your compute node. In the meantime, you can get the output of the slurmctld.log file using the following command.

tail -f /var/log/slurm/slurmctld.log

In there, you can observe the following lines, which show a failed job or node.

 [2024-07-19T04:13:03.313] sched: Allocate JobId=35 NodeList-ip-10-2-39-253, ip-10-2-40-102, ip-10-2-76-26, ip-10-2-108-162 #CPUs=192 Partition=dev [2024-07-19T04:50:31.682] _slurm_rpc_submit_batch_job: JobId=35 InitPrio=1 usec=727 [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 reason set to: Action: Replace  [2024-07-19T04:50:31.803] update_node: node ip-10-2-39-253 state set to FAILING

Pay attention to the line that says update_node: node ip-10-2-39-253 reason set to: Action:Replace, which is the log that says that the node has failed and requires replacement.

If you look at your <slurm-job>.out file, you should observe logs like the following.

[Auto Resume] Info: JobID: 35 StepID: 0 Initiating communication with cluster agent to diagnose health of nodes[Auto Resume] Info: JobID: 35 StepID: 0 Response from cluster agent: JobID=35, ResumeAction=RETRYSTEP[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - replacing nodes[Auto Resume] Info: JobID: 35 StepID: 0 Job failed - Dropping unhealthy nodes[Auto Resume] Info: JobID: 35 StepID: 0 Succesfully shrink job to retain healthy nodes ...srun: job 35 queued and waiting for resources

This shows that job 35 (your training job) is paused and a new job (job 35) has initiated the replacement process. You can verify this by running squeue, where you will observe an auto-res. This is the auto-resume job that is initiated by SageMaker HyperPod to replace your faulty node.

JOBID PARTITION  NAME      USER ST    TIME NODES NODELIST(REASON)35    dev    auto-res    ubuntu PD    0:00     4 (Resources)...

You can also monitor your SageMaker HyperPod cluster using the AWS console. Under Instances, you should observe one of the nodes in worker-group-1 in Pending state, as shown in the following screenshot. This shows that the node is about to get replaced.

Once your node is replaced, you can observe the slurmctld.log file. Be on the alert for the following line:

update_node: node <YOUR-NODE-IP-ADDRESS> reason set to: AWS:Replaced

You can also verify that your node was successfully replaced using the HyperPod cluster tab in the Amazon SageMaker console.

Once your node is replaced, squeue should no longer display the auto-res job and should only display your original training job. The node is successfully replaced, without any manual intervention.

Because you enabled checkpointing, you can verify that the training job resumes from the latest checkpoint. In your <slurm-job>.out file, find the following lines, which show that a checkpoint was detected in the checkpoint directory (./checkpoints) and that the latest checkpoint was loaded, respectively.

...Loading checkpoint from checkpoints/mathstral-10steps ......Checkpoint loaded from checkpoints/mathstral-10steps ......

If you continue to monitor your <slurm-job>.out file, you should observe that your training job has resumed from the latest checkpoint.

Clean up

    To delete your cluster, enter the following command.
aws sagemaker delete-cluster --cluster-name ml-cluster

Once you are done deleting the cluster, make sure it is deleted in the SageMaker HyperPod clusters section under SageMaker.

    To use the console to delete your SageMaker HyperPod VPC and Observability CloudFormation stacks, follow the directions at Delete a stack from the CloudFormation console. Alternatively, use the AWS CLI by entering the following command. Replace my-stack with the name of your stacks.
aws cloudformation delete-stack \    --stack-name my-stack

Conclusion

In this post, we provided a comprehensive guide on using Amazon SageMaker HyperPod for training large-scale models such as Mistral AI’s Mathstral using PyTorch Fully Sharded Data Parallel (FSDP). The process highlighted the efficiency of distributed training on SageMaker HyperPod, showcasing the critical role of resiliency and observability features in maintaining uninterrupted, scalable training environments.

Because of the integration with tools such as Amazon Managed Service for Prometheus and Amazon Managed Grafana for real-time monitoring, along with the robust cluster management capabilities of SageMaker HyperPod, ML practitioners can focus on model development rather than infrastructure management. The detailed steps for setting up the infrastructure, deploying the observability stack, and running a training job demonstrate how SageMaker HyperPod helps tackle the complexities of distributed training.

Moreover, the automatic health checks and the auto-resume feature significantly reduce downtime and minimize the impact of hardware failures so that large-scale training jobs can proceed with minimal interruptions. This level of resilience is crucial for maintaining the pace of innovation in AI research, especially when dealing with massive FMs.

By following the outlined procedures and using the powerful tools provided by AWS, data scientists and engineers can optimize their training workflows, reduce operational overhead, and accelerate the development of state-of-the-art models.

Getting Started

Interested in getting started with SageMaker HyperPod? Reach out to your AWS Account Team or email aws-frameworks-gtm@amazon.com. To begin experimenting with other examples on SageMaker HyperPod, refer to the awsome-distributed-training GitHub repo and the Amazon SageMaker HyperPod workshop.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Aman Shanbhag is an Associate Specialist Solutions Architect on the ML Frameworks team at Amazon Web Services, where he helps customers and partners with deploying ML Training and Inference solutions at scale. Before joining AWS, Aman graduated from Rice University with degrees in Computer Science, Mathematics, and Entrepreneurship.

Armando Diaz is a Solutions Architect at AWS. He focuses on generative AI, AI/ML, and data analytics. At AWS, Armando helps customers integrate cutting-edge generative AI capabilities into their systems, fostering innovation and competitive advantage. When he’s not at work, he enjoys spending time with his wife and family, hiking, and traveling the world.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with key GenAI foundation model providers, AWS service teams, strategic customers, founders, universities, venture ecosystems, and Amazon to develop technology strategy that enables the next generation of artificial intelligence, machine learning, and accelerated computing on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Web Services (AWS) focusing on Gen AI model training and inference. He is partnering with top foundation model builders, strategic customers, and AWS service teams to enable distributed training and inference at scale on AWS and lead joint GTM motions. Before AWS, Anoop has held several leadership roles at startups and large corporations, primarily focusing on silicon and system architecture of AI infrastructure.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker HyperPod 大型语言模型 LLM 训练 分布式训练 故障恢复 可观察性 Amazon Managed Service for Prometheus Amazon Managed Grafana Mathstral
相关文章