AWS Machine Learning Blog 2024年12月05日
Scale ML workflows with Amazon SageMaker Studio and Amazon SageMaker HyperPod
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了将Amazon SageMaker Studio和Amazon SageMaker HyperPod相集成,以解决从初始原型到大规模生产部署机器学习工作流的难题。该集成提供了全面的环境,支持从开发到大规模部署的整个机器学习生命周期,文中详细阐述了实现该解决方案的步骤及前提条件。

🎯设置环境及权限以访问Amazon HyperPod集群

📚创建JupyterLab空间并挂载文件系统,避免数据迁移和代码更改

👀通过SageMaker Studio发现并查看HyperPod集群信息

💻使用样本笔记本展示如何在Slurm集群上运行训练任务

📈可通过SageMaker Studio UI监控任务,优化资源利用

Scaling machine learning (ML) workflows from initial prototypes to large-scale production deployment can be daunting task, but the integration of Amazon SageMaker Studio and Amazon SageMaker HyperPod offers a streamlined solution to this challenge. As teams progress from proof of concept to production-ready models, they often struggle with efficiently managing growing infrastructure and storage needs. This integration addresses these hurdles by providing data scientists and ML engineers with a comprehensive environment that supports the entire ML lifecycle, from development to deployment at scale.

In this post, we walk you through the process of scaling your ML workloads using SageMaker Studio and SageMaker HyperPod.

Solution overview

Implementing the solution consists of the following high-level steps:

    Set up your environment and the permissions to access Amazon HyperPod clusters in SageMaker Studio. Create a JupyterLab space and mount an Amazon FSx for Lustre file system to your space. This eliminates the need for data migration or code changes as you scale. This also mitigates potential reproducibility issues that often arise from data discrepancies across different stages of model development. You can now use SageMaker Studio to discover the SageMaker HyperPod clusters, and view cluster details and metrics. When you have access to multiple clusters, this information can help you compare the specifications of each cluster, current utilization, and queue status of the clusters to identify the one that meets the requirements of your specific ML task. We use a sample notebook to show how to connect to the cluster and run a Meta Llama 2 training job with PyTorch FSDP on your Slurm cluster. After you submit the long-running job to the cluster, you can monitor the tasks directly through the SageMaker Studio UI. This can help you get real-time insights into your distributed workflows and allow you to quickly identify bottlenecks, optimize resource utilization, and improve overall workflow efficiency.

This integrated approach not only streamlines the transition from prototype to large-scale training but also enhances overall productivity by maintaining a familiar development experience even as you scale up to production-level workloads.

Prerequisites

Complete the following prerequisite steps:

    Create a SageMaker HyperPod Slurm cluster. For instructions, refer to the Amazon SageMaker HyperPod workshop or Tutorial for getting started with SageMaker HyperPod. Make sure you have the latest version of the AWS Command Line Interface (AWS CLI). Create a user in the Slurm head node or login node with an UID greater than 10000. Refer to Multi-User for instructions to create a user. Tag the SageMaker HyperPod cluster with the key hyperpod-cluster-filesystem. This is the ID for the FSx for Lustre file system associated with the SageMaker HyperPod cluster. This is needed for Amazon SageMaker Studio to mount FSx for Lustre onto Jupyter Lab and Code Editor spaces. Use the following code snippet to add a tag to an existing SageMaker HyperPod cluster:
    aws sagemaker add-tags --resource-arn <cluster_ARN> \--tags Key=hyperpod-cluster-filesystem,Value=<fsx_id>

Set up your permissions

In the following sections, we outline the steps to create an Amazon SageMaker domain, create a user, set up a SageMaker Studio space, and connect to the SageMaker HyperPod cluster. By the end of these steps, you should be able to connect to a SageMaker HyperPod Slurm cluster and run a sample training workload. To follow the setup instructions, you need to have admin privileges. Complete the following steps:

    Create a new AWS Identity and Access Management (IAM) execution role with AmazonSageMakerFullAccess attached to the role. Also attach the following JSON policy to the role, which enables SageMaker Studio to access the SageMaker HyperPod cluster. Make sure the trust relationship on the role allows the sagemaker.amazonaws.com service to assume this role.
{    "Version": "2012-10-17",                "Statement": [        {            "Effect": "Allow",            "Action": [                "ssm:StartSession",                "ssm:TerminateSession"            ],            "Resource": "*"            }{            "Effect": "Allow",            "Action": [                "sagemaker:CreateCluster",                "sagemaker:ListClusters"            ],            "Resource": "*"            },        {            "Effect": "Allow",            "Action": [                "sagemaker:DescribeCluster",                "sagemaker:DescribeClusterNode",                "sagemaker:ListClusterNodes",                "sagemaker:UpdateCluster",                "sagemaker:UpdateClusterSoftware"            ],            "Resource": "arn:aws:sagemaker:region:account-id:cluster/*"            }    ]}
    In order to use the role you created to access the SageMaker HyperPod cluster head or login node using AWS Systems Manager, you need to add a tag to this IAM role, where Tag Key = “SSMSessionRunAs” and Tag Value = “<posix user>”. The POSIX user is the user that is set up on the Slurm head node. Systems Manager uses this user to exec into the head node. When you activate Run As support, it prevents Session Manager from starting sessions using the ssm-user account on a managed node. To enable Run As in Session Manager, complete the following steps:
      On the Session Manager console, choose Preferences, then choose Edit. Don’t specify any user name. The user name will be picked from the role tag SSMSessionRunAs that you created earlier. In the Linux shell profile section, enter /bin/bash. Choose Save.
    Create a new SageMaker Studio domain with the execution role created earlier along with other necessary parameters required to access the SageMaker HyperPod cluster. Use the following script to create the domain and replace the export variables accordingly. Here, VPC_ID and Subnet_ID are the same as the SageMaker HyperPod cluster’s VPC and subnet. The EXECUTION_ROLE_ARN is the role you created earlier.
export DOMAIN_NAME=<domain name>export VPC_ID=vpc_id-for_hp_clusterexport SUBNET_ID=private_subnet_idexport EXECUTION_ROLE_ARN=execution_role_arnexport FILE_SYSTEM_ID=fsx idexport USER_UID=10000export USER_GID=1001export REGION=us-east-2cat > user_settings.json << EOL{    "ExecutionRole": "$EXECUTION_ROLE_ARN",    "CustomPosixUserConfig":    {        "Uid": $USER_UID,        "Gid": $USER_GID    },    "CustomFileSystemConfigs":    [        {            "FSxLustreFileSystemConfig":            {                "FileSystemId": "$FILE_SYSTEM_ID",                "FileSystemPath": "$FILE_SYSTEM_PATH"            }        }    ]}EOLaws sagemaker create-domain \--domain-name $DOMAIN_NAME \--vpc-id $VPC_ID \--subnet-ids $SUBNET_ID \--auth-mode IAM \--default-user-settings file://user_settings.json \--region $REGION 

The UID and GID in the preceding configuration are set to 10000 and 1001 as default; this can be overridden according to the user created in Slurm, and this UID/GID is used to give permissions to the FSx for Lustre file system. Also, setting this at the domain level gives each user the same UID. In order to have a separate UID for each user, consider setting CustomPosixUserConfig while creating the user profile.

    After you create the domain, you need to attach SecurityGroupIdForInboundNfs created as part of domain creation to all ENIs of the FSx Lustre volume:
      Locate the Amazon Elastic File System (Amazon EFS) file system associated with the domain and corresponding security group attached to It. You can find the EFS file system on the Amazon EFS console; it’s tagged with the domain ID, as shown in the following screenshot. Collect the corresponding security group, which will be named inbound-nfs-<domain-id> and can be found on the Network tab.
      On the FSx for Lustre console, choose To see all the ENIs, see the Amazon EC2 Console. This will show all the ENIs attached to FSx for Lustre. Alternatively, you can find ENIs using the AWS CLI or by calling the fsx:describeFileSystems For each ENI, attach the SecurityGroupIdForInboundNfs of the domain to it.

Alternately, you can use the following script to automatically find and attach security groups to the ENIs associated with the FSx for Lustre volume. Replace the REGION, DOMAIN_ID, and FSX_ID attributes accordingly.

#!/bin/bashexport REGION=us-east-2export DOMAIN_ID=d-xxxxxexport FSX_ID=fs-xxxexport EFS_ID=$(aws sagemaker describe-domain --domain-id $DOMAIN_ID --region $REGION --query 'HomeEfsFileSystemId' --output text)export MOUNT_TARGET_ID=$(aws efs describe-mount-targets --file-system-id $EFS_ID --region $REGION --query 'MountTargets[0].MountTargetId' --output text)export EFS_SG=$(aws efs describe-mount-target-security-groups --mount-target-id $MOUNT_TARGET_ID --query 'SecurityGroups[0]' --output text)echo "security group associated with the Domain $EFS_SG"echo "Adding security group to FSxL file system ENI's"# Get the network interface IDs associated with the FSx file systemNETWORK_INTERFACE_IDS=$(aws fsx describe-file-systems --file-system-ids $FILE_SYSTEM_ID --query "FileSystems[0].NetworkInterfaceIds" --output text)# Iterate through each network interface and attach the security groupfor ENI_ID in $NETWORK_INTERFACE_IDS; doaws ec2 modify-network-interface-attribute --network-interface-id $ENI_ID --groups $EFS_SGecho "Attached security group $EFS_SG to network interface $ENI_ID"done

Without this step, application creation will fail with an error.

    After you create the domain, you can use the domain to create a user profile. Replace the DOMAIN_ID value from the one created in the previous step.
export DOMAIN_ID=d-xxxexport USER_PROFILE_NAME=testexport REGION=us-east-2aws sagemaker create-user-profile \--domain-id $DOMAIN_ID \--user-profile-name$USER_PROFILE_NAME \--region $REGION

Create a JupyterLab space and mount the FSx for Lustre file system

Create a space using the FSx for Lustre file system with the following code:

export SPACE_NAME=hyperpod-spaceexport DOMAIN_ID=d-xxxexport USER_PROFILE_NAME=testexport FILE_SYSTEM_ID=fs-xxxexport REGION=us-east-2aws sagemaker create-space --domain-id $DOMAIN_ID \--space-name $SPACE_NAME \--space-settings "AppType=JupyterLab,CustomFileSystems=[{FSxLustreFileSystem={FileSystemId=$FILE_SYSTEM_ID}}]" \--ownership-settings OwnerUserProfileName=$USER_PROFILE_NAME  --space-sharing-settings SharingType=Private  \--region $REGION

Create an application using the space with the following code:

export SPACE_NAME=hyperpod-spaceexport DOMAIN_ID=d-xxxexport APP_NAME=test-appexport INSTANCE_TYPE=ml.t3.mediumexport REGION=us-east-2export IMAGE_ARN=arn:aws:sagemaker:us-east-2:081975978581:image/sagemaker-distribution-cpuaws sagemaker create-app --space-name $SPACE_NAME \--resource-spec '{"InstanceType":"$INSTANCE_TYPE","SageMakerImageArn":"$IMAGE_ARN"}' \--domain-id  $DOMAIN_ID --app-type JupyterLab --app-name $APP_NAME --region $REGION

Discover clusters in SageMaker Studio

You should now have everything ready to access the SageMaker HyperPod cluster using SageMaker Studio. Complete the following steps:

    On the SageMaker console, choose Admin configurations, Domains. Locate the user profile you created and launch SageMaker Studio. Under Compute in the navigation pane, choose HyperPod clusters.

Here you can view the SageMaker HyperPod clusters available in the account.

    Identify the right cluster for your training workload by looking at the cluster details and the cluster hardware metrics.

You can also preview the cluster by choosing the arrow icon.

You can also go to the Settings and Details tabs to find more information about the cluster.

Work in SageMaker Studio and connect to the cluster

You can also launch either JupyterLab or Code Editor, which mounts the cluster FSx for Lustre volume for development and debugging.

    In SageMaker Studio, choose Get started in and choose JupyterLab. Choose a space that has the FSx for Lustre file system mounted to get a consistent, reproducible environment.

The Cluster Filesystem column identifies which space has the cluster file system mounted.

This should launch JupyterLab with the FSx for Lustre volume mounted. By default, you should see the getting started notebook in your home folder, which has step-by-step instructions to run a Meta Llama 2 training job with PyTorch FSDP on the Slurm cluster. This example notebook demonstrates how you can use SageMaker Studio notebooks to transition from prototyping your training script to scaling up your workloads across multiple instances in the cluster environment. Additionally, you should see the FSx for Lustre file system you mounted to your JupyterLab space under /home/sagemaker-user/custom-file-systems/fsx_lustre.

Monitor the tasks on SageMaker Studio

You can go to SageMaker Studio and choose the cluster to view a list of tasks currently in the Slurm queue.

You can choose a task to get additional task details such as the scheduling and job state, resource usage details, and job submission and limits.

You can also perform actions such as release, requeue, suspend, and hold on these Slurm tasks using the UI.

Clean up

Complete the following steps to clean up your resources:

    Delete the space:
aws —region <REGION> sagemaker delete-space \--domain-id <DomainId> \--space-name <SpaceName>
    Delete the user profile:
aws —region <REGION> sagemaker delete-user-profile \--domain-id <DomainId> \--user-profile-name <UserProfileName>
    Delete the domain. To retain the EFS volume, specify HomeEfsFileSystem=Retain.
aws —region <REGION> sagemaker delete-domain \--domain-id <DomainId> \--retention-policy HomeEfsFileSystem=Delete
    Delete the SageMaker HyperPod cluster. Delete the IAM role you created.

Conclusion

In this post, we explored an approach to streamline your ML workflows using SageMaker Studio. We demonstrated how you can seamlessly transition from prototyping your training script within SageMaker Studio to scaling up your workload across multiple instances in a cluster environment. We also explained how to mount the cluster FSx for Lustre volume to your SageMaker Studio spaces to get a consistent reproducible environment.

This approach not only streamlines your development process but also allows you to initiate long-running jobs on the clusters and conveniently monitor their progress directly from SageMaker Studio.

We encourage you to try this out and share your feedback in the comments section.

Special thanks to Durga Sury (Sr. ML SA), Monidipa Chakraborty (Sr. SDE), and Sumedha Swamy (Sr. Manager PMT) for their support to the launch of this post.


About the Authors

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker team. He specializes in large language model training workloads, helping customers build LLM workloads using SageMaker HyperPod, SageMaker training jobs, and SageMaker distributed training. Outside of work, he enjoys running, hiking, and cooking.

Pooja Karadgi is a Senior Technical Product Manager at Amazon Web Services. At AWS, she is a part of the Amazon SageMaker Studio team and helps build products that cater to the needs of administrators and data scientists. She began her career as a software engineer before making the transition to product management. Outside of work, she enjoys crafting travel planners in spreadsheets, in true MBA fashion. Given the time she invests in creating these planners, it’s clear that she has a deep love for traveling, alongside a strong passion for hiking.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker 机器学习 工作流规模化 HyperPod集群 SageMaker Studio
相关文章