Running NVIDIA NeMo 2.0 Framework on Amazon SageMaker HyperPod

This post is cowritten with Abdullahi Olaoye, Akshit Arora and Eliuth Triana Isaza at NVIDIA.

As enterprises continue to push the boundaries of generative AI, scalable and efficient model training frameworks are essential. The NVIDIA NeMo Framework provides a robust, end-to-end solution for developing, customizing, and deploying large-scale AI models, while Amazon SageMaker HyperPod delivers the distributed infrastructure needed to handle multi-GPU, multi-node workloads seamlessly.

In this blog post, we explore how to integrate NeMo 2.0 with SageMaker HyperPod to enable efficient training of large language models (LLMs). We cover the setup process and provide a step-by-step guide to running a NeMo job on a SageMaker HyperPod cluster.

NVIDIA NeMo Framework Overview

The NVIDIA NeMo Framework is an end-to-end solution for developing cutting edge generative AI models such as LLMs, vision language models (VLMs), video and speech models, and others.

At its core, NeMo Framework provides model builders with:

Comprehensive development tools

Advanced customization

Optimized infrastructure

Enterprise-grade features

Advanced parallelism techniques Memory optimization strategies Distributed checkpointing Streamlined deployment pipelines

By consolidating these powerful features into a unified framework, NeMo significantly reduces the complexity and cost associated with generative AI development. NeMo Framework 2.0 is a flexible, IDE-independent Python-based framework that enables flexible integration in each developer’s workflow. The framework provides capabilities such as code completion, type checking and programmatic extensions and configuration customization. The NeMo Framework includes NeMo-Run, a library designed to that streamline the configuration, execution, and management of machine learning experiments across various computing environments.

The end-to-end NeMo Framework includes the following key features that streamline and accelerate AI development:

Data curation

NeMo Curator

Training and customization

Alignment

NeMo Aligner

Solution overview

In this post, we show you how to efficiently train large-scale generative AI models with NVIDIA NeMo Framework 2.0 using SageMaker HyperPod, a managed distributed training service designed for high-performance workloads. This solution integrates NeMo Framework 2.0 with the scalable infrastructure of SageMaker HyperPod, creating seamless orchestration of multi-node, multi-GPU clusters.

The key steps to deploying this solution include:

Setting up SageMaker HyperPod prerequisites

AWS Identity and Access Management (IAM)

Launching the SageMaker HyperPod cluster

Configuring the environment

Building a custom container

Running NeMo model training

Architecture diagram

The architecture, shown in the preceding diagram shows an Amazon SageMaker HyperPod Cluster.

Prerequisites

First, you deploy a SageMaker HyperPod cluster before running the job. But to deploy the cluster, you need to create some prerequisite resources.

Note that there is a cost associated with running a SageMaker HyperPod cluster, see the Amazon SageMaker AI Pricing (HyperPod pricing in On-demand pricing) for more information.

The following prerequisite steps are adapted from the Amazon SageMaker HyperPod workshop, which you can visit for additional information.

Use the following steps to deploy the prerequisite resources.

Amazon Simple Storage Service (Amazon S3)

CloudFormation template

AWS CloudFormation

Availability Zone

Availability Zone IDs

Capabilities

It takes about 10 minutes for the CloudFormation stack creation to complete. The following figure shows the deployment timeline of the CloudFormation stack deployment for the prerequisite infrastructure components.

Launch the training job

With the prerequisite infrastructure deployed in your AWS account, you next deploy the SageMaker HyperPod cluster that you’ll use for the model training example. For the model training job, you will use the NeMo Framework to launch training jobs efficiently.

Step 1: Set up a SageMaker HyperPod cluster

After the prerequisite resources are successfully deployed, create a SageMaker HyperPod cluster.

The deployment steps are adapted from the SageMaker HyperPod workshop, which you can review for additional information.

Install

configure

AWS Command Line Interface (AWS CLI)

2.17.1

$ aws --version

Configure the environment variables that using outputs from the CloudFormation stack deployed earlier.

$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/create_config.sh# Change the region below to the region you wish to use$ AWS_REGION=us-east-1 bash create_config.sh$ source env_vars# Confirm environment variables$ cat env_vars

Download the lifecycle scripts and upload them to the S3 bucket created in the prerequisites. SageMaker HyperPod uses lifecycle scripts to bootstrap a cluster. Examples of actions the lifecycle script manages include setting up Slurm and mounting the FSx Lustre filesystem.

$ git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c# upload script$ aws s3 cp --recursive 1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

Create Cluster

Amazon Elastic Compute Cloud (Amazon EC2)

Login-group

Controller-machine

Worker-group

$ cd 3.test_cases/22.nemo-run/slurm$ curl -O https://awsome-distributed-training.s3.amazonaws.com/blog-assets/nemo2.0-hyperpod/cluster-config-template.json $ cp cluster-config-template.json cluster-config.json# Replace the placeholders in the cluster config$ source env_vars$ sed -i "s/\$BUCKET/${BUCKET}/g" cluster-config.json$ sed -i "s/\$ROLE/${ROLE}/g" cluster-config.json $ sed -i "s/\$SECURITY_GROUP/${SECURITY_GROUP}/g" cluster-config.json$ sed -i "s/\$SUBNET_ID/${SUBNET_ID}/g" cluster-config.json

Create a config file based on the following example with the cluster provisioning parameters and upload it to the S3 bucket.

$ instance_type=$(jq '.InstanceGroups[] | select(.InstanceGroupName == "worker-group-1").InstanceType' cluster-config.json)$ cat > provisioning_parameters.json << EOL{"version": "1.0.0","workload_manager": "slurm","controller_group": "controller-machine","login_group": "login-group","worker_groups": [{             "instance_group_name": "worker-group-1",            "partition_name": ${instance_type}  }  ],  "fsx_dns_name": "${FSX_ID}.fsx.${AWS_REGION}.amazonaws.com","fsx_mountname": "${FSX_MOUNTNAME}"}EOL# copy to the S3 Bucket$ aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Create the SageMaker HyperPod cluster

$ aws sagemaker create-cluster \    --cli-input-json file://cluster-config.json --region $AWS_REGION

Creating

InService

$ aws sagemaker list-clusters --output table

The following screenshot shows the results of the –output table command showing the cluster status as Creating.

The following screenshot shows the Cluster Management page and status of the cluster in the Amazon SageMaker AI console.

The following screenshot shows the results of the –output table command showing the cluster status as InService.

Step 2: SSH into the cluster

After the cluster is ready (that is, has a status of InService), you can connect to it using the AWS Systems Manager Session Manager and an SSH helper script. See SSH into Cluster for more information

AWS SSM Session Manager Plugin

$ ssh-keygen -t rsa -q -f "$HOME/.ssh/id_rsa" -N ""$ curl -O https://raw.githubusercontent.com/aws-samples/awsome-distributed-training/main/1.architectures/5.sagemaker-hyperpod/easy-ssh.sh$ chmod +x easy-ssh.sh$ ./easy-ssh.sh -c controller-machine ml-cluster

Step 3: Interact with the cluster and clone the repository

After connecting to the cluster, you can validate that the command is properly configured by running several commands. See Get to know your Cluster for more information.

View the existing partition and nodes per partition

$ sinfo

List the jobs that are in the queue or running.

$ squeue

SSH to the compute nodes.

# First ssh into the cluster head node as ubuntu user$ ssh ml-cluster#SSH into one of the compute nodes$ salloc -N 1$ ssh $(srun hostname)#Exit to the head node$ exit#Exit again to cancel the srun job above$ exit

Clone the code sample GitHub repository onto the cluster controller node (head node).

$ cd /fsx/ubuntu$ git clone https://github.com/aws-samples/awsome-distributed-training/$ cd awsome-distributed-training && git checkout e67fc352de83e13ad6a34b4a33da11c8a71b4d9c$ cd 3.test_cases/22.nemo-run/slurm

Now, you’re ready to run your NeMo Framework Jobs on the SageMaker HyperPod cluster.

Step 4: Build the job container

The next step is to build the job container. By using a container, you can create a consistent, portable, and reproducible environment, helping to ensure that all dependencies, configurations, and optimizations remain intact. This is particularly important for high-performance computing (HPC) and AI workloads, where variations in the software stack can impact performance and compatibility.

To have a fully functioning and optimized environment, you need to add AWS-specific networking dependencies (EFA, OFI plugin, update NCCL, and NCCL tests) to the NeMo Framework container from NVIDIA GPU Cloud (NGC) Catalog. After building the Docker image, you will use Enroot to create a squash file from it. A squash file is a compressed, read-only file system that encapsulates the container image in a lightweight format. It helps reduce storage space, speeds up loading times, and improves efficiency when deploying the container across multiple nodes in a cluster. By converting the Docker image into a squash file, you can achieve a more optimized and performant execution environment, especially in distributed training scenarios.

Make sure that you have a registered account with NVIDIA and can access NGC. Retrieve the NGC API key following the instructions from NVIDIA. Use the following command to configure NGC. When prompted, use $oauthtoken for the login username and the API key from NGC for the password.

$ docker login nvcr.io

You can use the following command to build the Docker file and create a SquashFS image.

$ docker build --progress=plain -t nemo_hyperpod:24.12 -f Dockerfile .$ sudo enroot import -o /fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh dockerd://nemo_hyperpod:24.12

Step 5: Set up NeMo-Run and other dependencies on the head node

Before continuing:

NeMo-Run requires python3.10, verify that this is installed on the head node before proceeding. You can use the following steps to set up Nemo-Run dependencies using a virtual environment. The steps create and activate a virtual environment then execute the venv.sh script to install the dependencies. Dependencies being installed include the NeMo toolkit, NeMo-Run, PyTorch, Megatron-LM, and others.

$ python3.10 -m venv temp-env$ source temp-env/bin/activate$ bash venv.sh

To prepare for the pre-training of the LLaMA model in an offline mode and to help ensure consistent tokenization, use the widely adopted GPT-2 vocabulary and merges files. This approach helps avoid potential issues related to downloading tokenizer files during training:

$ mkdir -p /fsx/ubuntu/temp/megatron$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_vocab$ wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -O /fsx/ubuntu/temp/megatron/megatron-gpt-345m_merges

Step 6: Launch the pretraining job using NeMo-Run

Run the training script to start the LLaMA pretraining job. The training script run.py defines the configuration for a LLaMA 180M parameter model, defines a Slurm executor, defines the experiment, and launches the experiment.

The following function defines the model configuration.

def small_llama_cfg() -> llm.GPTConfig:   return run.Config(       llm.Llama3Config8B,              rotary_base=500_000,            seq_length=1024,            num_layers=12,              hidden_size=768,            ffn_hidden_size=2688,           num_attention_heads=16,             init_method_std=0.023,   )

The following function defines the Slurm executor.

def slurm_executor(   account: str,      partition: str,      nodes: int,      user: str = "local",      host: str = "local",      remote_job_dir: str = "/fsx/ubuntu/nemo2-sm-hyperpod/tmp/",      time: str = "01:00:00",      custom_mounts: Optional[list[str]] = None,       custom_env_vars: Optional[dict[str, str]] = None,       container_image: str = "/fsx/ubuntu/nemo-hyperpod-24-12-02102025.sqsh",     retries: int = 0,) -> run.SlurmExecutor:

The following function runs the experiment.

with run.Experiment(exp_name, log_level="INFO") as exp:       exp.add(pretrain_recipe, executor=executor, tail_logs=True, name="training")       # Run the experiment       exp.run(detach=True)

Use the following command to run the training job.

$ python run.py --nodes 2 --max_steps 1000

The –nodes argument specifies the number of nodes to use during the pretraining job, while the –max_steps argument specifies the maximum number of training iterations. This is useful for controlling the duration of training.

The following figure shows the logs of a running training job.

You can download the training logs from the cluster to your local machine and use machine learning visualization tools like TensorBoard to visualize your experimentation. See Install TensorFlow 2 for information about installing TensorBoard. The following is an example of downloading logs from the cluster and visualizing the logs on TensorBoard.

After installing TensorBoard, download the log files from the cluster to your workstation where TensorBoardis installed

$ rsync -aP ml-cluster:/path/to/logs/checkpoints/tb_logs/events.out.tfevents.1741213162.ip-10-1-7-21.55692.0 .

After the logs are downloaded, you can launch TensorBoard with the log files in the current directory.

$ tensorboard --logdir .

Below is a tensorboard screenshot for a training job. There we can see the reduced_train_loss which shows a decreasing loss curve over the training steps.

Troubleshooting

“down”

“down*”

down*

Solution: login to them and run sudo systemctl restart slurmd. As shown below, the two nodes went to an idle state.

Clean up

Use the following steps to clean up the infrastructure created for this post and avoid incurring ongoing costs. You can also find cleanup instructions in Cleanup.

$ aws sagemaker delete-cluster --cluster-name ml-cluster

$ aws cloudformation wait stack-delete-complete --stack-name sagemaker-hyperpod

Conclusion

Using the NVIDIA NeMo 2.0 framework on SageMaker HyperPod offers a scalable, cost-efficient, and streamlined approach to training large-scale generative AI models. By following the step-by-step deployment process, you can use the power of distributed computing with minimal setup complexity.

References

NVIDIA NeMo Framework documentation

NVIDIA NeMo Framework Github repository

AWS SageMaker HyperPod documentation

AWS SageMaker HyperPod workshop

About the authors

Abdullahi Olaoye is a Senior AI Solutions Architect at NVIDIA, specializing in integrating NVIDIA AI libraries, frameworks, and products with cloud AI services and open-source tools to optimize AI model deployment, inference, and generative AI workflows. He collaborates with AWS to enhance AI workload performance and drive adoption of NVIDIA-powered AI and generative AI solutions.

Greeshma Nallapareddy is a Sr. Business Development Manager at AWS working with NVIDIA on go-to-market strategy to accelerate AI solutions for customers at scale. Her experience includes leading solutions architecture teams focused on working with startups.

Akshit Arora is a senior data scientist at NVIDIA, where he works on deploying conversational AI models on GPUs at scale. He’s a graduate of University of Colorado at Boulder, where he applied deep learning to improve knowledge tracking on a K-12 online tutoring service. His work spans multilingual text-to-speech, time series classification, ed-tech, and practical applications of deep learning.

Ankur Srivastava is a Sr. Solutions Architect in the ML Frameworks Team. He focuses on helping customers with self-managed distributed training and inference at scale on AWS. His experience includes industrial predictive maintenance, digital twins, probabilistic design optimization and has completed his doctoral studies from Mechanical Engineering at Rice University and post-doctoral research from Massachusetts Institute of Technology.

Eliuth Triana Isaza is a Developer Relations Manager at NVIDIA empowering Amazon AI MLOps, DevOps, Scientists, and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing generative AI foundation models spanning from data curation, GPU training, model inference, and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.