AWS Machine Learning Blog 03月20日
Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨如何将NVIDIA NeMo Framework 2.0与SageMaker HyperPod整合,以实现大型语言模型的高效训练,涵盖了设置流程及步骤指南。

NVIDIA NeMo Framework是开发AI模型的端到端解决方案,提供多种功能

介绍了将NeMo 2.0与SageMaker HyperPod集成的关键步骤

部署该方案需设置SageMaker HyperPod先决条件,包括配置网络、存储等

使用NeMo Framework启动训练作业,以高效训练LLaMA模型

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

    Sign in to the AWS Management Console using the AWS account you want to deploy the SageMaker HyperPod cluster in. You will create a VPC, subnets, an FSx Lustre volume, an Amazon Simple Storage Service (Amazon S3) bucket, and IAM role as pre-requisites; so make sure that your IAM role or user for console access has permissions to create these resources. Use the CloudFormation template to go to your AWS CloudFormation console and launch the solution template. Template parameters:
      Change the Availability Zone to match the AWS Region where you’re deploying the template. See Availability Zone IDs for the AZ ID for your Region. All other parameters can be left as default or changed as needed for your use case.
    Select the acknowledgement box in the Capabilities section and create the stack.

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

    Install and configure the AWS Command Line Interface (AWS CLI). If you already have it installed, verify that the version is at least 2.17.1 by running the following command:
$ aws --version
    Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.
$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh# Change the region below to the region you wish to use$ AWS_REGION=us-east-1 bash create_config.sh$ source env_vars# Confirm environment variables$ cat env_vars
    Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.
$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c# upload script$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src
    Create a cluster config file for setting up the cluster. The following is an example of creating a cluster config from a template. The example cluster config is for g5.48xlarge compute nodes accelerated by 8 x NVIDIA A10G GPUs. See Create Cluster for cluster config examples of additional Amazon Elastic Compute Cloud (Amazon EC2) instance types. A cluster config file contains the following information:
      Cluster name It defines three instance groups
        Login-group: Acts as the entry point for users and administrators. Typically used for managing jobs, monitoring and debugging. Controller-machine: This is the head node for the Hyperpod Slurm cluster. It manages the overall orchestration of the distributed training process and handles job scheduling and communication within nodes. Worker-group: The group of nodes that executes the actual model training workload
      VPC configuration
$ cd 3.test_cases/22.nemo-run/slurm$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json $ cp cluster-config-template.json cluster-config.json# Replace the placeholders in the cluster config$ source env_vars$ sed -i "s/\$BUCKET/${BUCKET}/g" cluster-config.json$ sed -i "s/\$ROLE/${ROLE}/g" cluster-config.json $ sed -i "s/\$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json$ sed -i "s/\$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json
    Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.
$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)$ cat > provisioning_parameters.json << EOL{"version": "1.0.0","workload_manager": "slurm","controller_group": "controller-machine","login_group": "login-group","worker_groups": [{             "instance_group_name": "worker-group-1",            "partition_name": ${instance_type}  }  ],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com","fsx_mountname": "${FSX_MOUNTNAME}"}EOL# copy to the S3 Bucket$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/
    Create the SageMaker HyperPod cluster
$ aws sagemaker create-cluster \    --cli-input-json file://cluster-config.json --region $AWS_REGION
    Use the following code or the console to check the status of the cluster. The status should be Creating. Wait for the cluster status to be InService proceeding
$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

    Install the AWS SSM Session Manager Plugin. Create a local key pair that can be added to the cluster by the helper script for easier SSH access and run the following SSH helper script.
$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh$ chmod +x easy-ssh.sh$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

    View the existing partition and nodes per partition
$ sinfo
    List the jobs that are in the queue or running.
$ squeue
    SSH to the compute nodes.
# First ssh into the cluster head node as ubuntu user$ ssh ml-cluster#SSH into one of the compute nodes$ salloc -N 1$ ssh $(srun hostname)#Exit to the head node$ exit#Exit again to cancel the srun job above$ exit
    Clone the code sample GitHub repository onto the cluster controller node (head node).
$ cd /fsx/ubuntu$ git clone https://github.com/aws-samples/awsome-distributed-training/$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

    NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.
$ python3.10 -m venv temp-env$ source temp-env/bin/activate$ bash venv.sh
    To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:
$ mkdir -p /fsx/ubuntu/temp/megatron$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:   return run.Config(       llm.Llama3Config8B,              rotary_base=500_000,            seq_length=1024,            num_layers=12,              hidden_size=768,            ffn_hidden_size=2688,           num_attention_heads=16,             init_method_std=0.023,   )

The following function defines the Slurm executor.

def slurm_executor(   account: str,      partition: str,      nodes: int,      user: str = "local",      host: str = "local",      remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",      time: str = "01:00:00",      custom_mounts: Optional[list[str]] = None,       custom_env_vars: Optional[dict[str, str]] = None,       container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",     retries: int = 0,) -> run.SlurmExecutor: 

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")       # Run the experiment       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

    After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed
$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .
    After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.
$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

    If some of the nodes appear “down” or “down*” as shown below, we can see that both the two nodes are shown in down* status

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

    Delete the SageMaker HyperPod cluster.
    $ aws sagemaker delete-cluster --cluster-name ml-cluster
    Delete the CloudFormation stack created in the prerequisites.
    $ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References


About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA NeMo SageMaker HyperPod 模型训练 高效集成
相关文章