Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK

Amazon SageMaker has redesigned its Python SDK to provide a unified object-oriented interface that makes it straightforward to interact with SageMaker services. The new SDK is designed with a tiered user experience in mind, where the new lower-level SDK (SageMaker Core) provides access to full breadth of SageMaker features and configurations, allowing for greater flexibility and control for ML engineers. The higher-level abstracted layer is designed for data scientists with limited AWS expertise, offering a simplified interface that hides complex infrastructure details.

In this two-part series, we introduce the abstracted layer of the SageMaker Python SDK that allows you to train and deploy machine learning (ML) models by using the new ModelTrainer and the improved ModelBuilder classes.

In this post, we focus on the ModelTrainer class for simplifying the training experience. The ModelTrainer class provides significant improvements over the current Estimator class, which are discussed in detail in this post. We show you how to use the ModelTrainer class to train your ML models, which includes executing distributed training using a custom script or container. In Part 2, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.

Benefits of the ModelTrainer class

The new ModelTrainer class has been designed to address usability challenges associated with Estimator class. Moving forward, ModelTrainer will be the preferred approach for model training, bringing significant enhancements that greatly improve the user experience. This evolution marks a step towards achieving a best-in-class developer experience for model training. The following are the key benefits:

Improved intuitiveness

Simplified script mode and BYOC

Simplified distributed training

command

SourceCode

Improved hyperparameter contracts

SM_HPS

To further explain each of these benefits, we demonstrate with examples in the following sections, and finally show you how to set up and run distributed training for the Meta Llama 3.1 8B model using the new ModelTrainer class.

Launch a training job using the ModelTrainer class

The ModelTrainer class simplifies the experience by letting you customize the training job, including providing a custom script, directly providing a command to run the training job, supporting local mode, and much more. However, you can spin up a SageMaker training job in script mode by providing minimal parameters—the SourceCode and the training image URI.

The following example illustrates how you can launch a training job with your own custom script by providing just the script and the training image URI (in this case, PyTorch), and an optional requirements file. Additional parameters such as the instance type and instance size are automatically set by the SDK to preset defaults, and parameters such as the AWS Identity and Access Management (IAM) role and SageMaker session are automatically detected from the current session and user’s credentials. Admins and users can also overwrite the defaults using the SDK defaults configuration file. For the detailed list of pre-set values, refer to the SDK documentation.

from sagemaker.modules.train import ModelTrainerfrom sagemaker.modules.configs import SourceCode, InputData# image URI for the training jobpytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"# you can find all available images here# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html# define the script to be runsource_code = SourceCode(    source_dir="basic-script-mode",    requirements="requirements.txt",    entry_script="custom_script.py",)# define the ModelTrainermodel_trainer = ModelTrainer(    training_image=pytorch_image,    source_code=source_code,    base_job_name="script-mode",)# pass the input datainput_data = InputData(    channel_name="train",    data_source=training_input_path,  #s3 path where training data is stored)# start the training jobmodel_trainer.train(input_data_config=[input_data], wait=False)

With purpose-built configurations, you can now reuse these objects to create multiple training jobs with different hyperparameters, for example, without having to re-define all the parameters.

Run the job locally for experimentation

To run the preceding training job locally, you can simply set the training_mode parameter as shown in the following code:

from sagemaker.modules.train.model_trainer import Mode...model_trainer = ModelTrainer(    training_image=pytorch_image,    source_code=source_code,    base_job_name="script-mode-local",    training_mode=Mode.LOCAL_CONTAINER,)model_trainer.train()

The training job runs remotely because training_mode is set to Mode.LOCAL_CONTAINER. If not explicitly set, the ModelTrainer runs a remote SageMaker training job by default. This behavior can also be enforced by changing the value to Mode.SAGEMAKER_TRAINING_JOB. For a full list of the available configs, including compute and networking, refer to the SDK documentation.

Read hyperparameters in your custom script

The ModelTrainer supports multiple ways to read the hyperparameters that are passed to a training job. In addition to the existing support to read the hyperparameters as command line arguments in your custom script, ModelTrainer also supports reading the hyperparameters as individual environment variables, prefixed with SM_HPS_<hyperparameter-key>, or as a single environment variable dictionary, SM_HPS.

Suppose the following hyperparameters are passed to the training job:

hyperparams = {    "learning_rate": 1e-5,    "epochs": 2,}model_trainer = ModelTrainer(    ...    hyperparameters=hyperparams,    ...)

You have the following options:

Option 1

SM_HPS

def main():    hyperparams = json.loads(os.environ["SM_HPS"])    learning_rate = hyperparams.get("learning_rate")    epochs = hyperparams.get("epochs", 1)    ...

Option 2

SM_HP

def main():    learning_rate = float(os.environ.get("SM_HP_LEARNING_RATE", 3e-5))    epochs = int(os.environ.get("SM_HP_EPOCHS", 1)    ...

Option 3

parse.args

def main():    parser = argparse.ArgumentParser()    parser.add_argument("--learning_rate", type=float, default=3e-5)    parser.add_argument("--epochs", type=int, default=1)        args = parse_args()        learning_rate = args.learning_rate    epochs = args.epochs

Run distributed training jobs

SageMaker supports distributed training to support training for deep learning tasks such as natural language processing and computer vision, to run secure and scalable data parallel and model parallel jobs. This is usually achieved by providing the right set of parameters when using an Estimator. For example, to use torchrun, you would define the distribution parameter in the PyTorch Estimator and set it to "torch_distributed": {"enabled": True}.

The ModelTrainer class provides enhanced flexibility for users to specify custom commands directly through the command parameter in the SourceCode class, and supports torchrun, torchrun smp, and the MPI strategies. This capability is particularly useful when you need to launch a job with a custom launcher command that is not supported by the training toolkit.

In the following example, we show how to fine-tune the latest Meta Llama 3.1 8B model using the default launcher script using Torchrun on a custom dataset that’s preprocessed and saved in an Amazon Simple Storage Service (Amazon S3) location:

from sagemaker.modules.train import ModelTrainerfrom sagemaker.modules.distributed import Torchrunfrom sagemaker.modules.configs import Compute, SourceCode, InputData# provide  image URI - update the URI if you're in a different regionpytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310"# Define the source code configuration for the distributed training jobsource_code = SourceCode(    source_dir="distributed-training-scripts",        requirements="requirements.txt",      entry_point="fine_tune.py",)torchrun = Torchrun()hyperparameters = {    ...}# Compute configuration for the training jobcompute = Compute(    instance_count=1,    instance_type="ml.g5.12xlarge",    volume_size_in_gb=96,    keep_alive_period_in_seconds=3600,)# Initialize the ModelTrainer with the specified configurationsmodel_trainer = ModelTrainer(    training_image=pytorch_image,      source_code=source_code,    compute=compute,    distributed_runner=torchrun,    hyperparameters=hyperparameters,)# pass the input datainput_data = InputData(    channel_name="dataset",    data_source="s3://your-bucket/your-prefix",  # this is the s3 path where processed data is stored)# Start the training jobmodel_trainer.train(input_data_config=[input_data], wait=False)

If you wanted to customize your torchrun launcher script, you can also directly provide the commands using the command parameter:

# Define the source code configuration for the distributed training jobsource_code = SourceCode(    source_dir="distributed-training-scripts",        requirements="requirements.txt",        # Custom command for distributed training launcher script    command="torchrun --nnodes 1 \            --nproc_per_node 4 \            --master_addr algo-1 \            --master_port 7777 \            fine_tune_llama.py")# Initialize the ModelTrainer with the specified configurationsmodel_trainer = ModelTrainer(    training_image=pytorch_image,      source_code=source_code,    compute=compute,)# Start the training jobmodel_trainer.train(..)

For more examples and end-to-end ML workflows using the SageMaker ModelTrainer, refer to the GitHub repo.

Conclusion

The newly launched SageMaker ModelTrainer class simplifies the user experience by reducing the number of parameters, introducing intuitive configurations, and supporting complex setups like bringing your own container and running distributed training. Data scientists can also seamlessly transition from local training to remote training and training on multiple nodes using the ModelTrainer.

We encourage you to try out the ModelTrainer class by referring to the SDK documentation and sample notebooks on the GitHub repo. The ModelTrainer class is available from the SageMaker SDK v2.x onwards, at no additional charge. In Part 2 of this series, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.

About the Authors

Durga Sury is a Senior Solutions Architect on the Amazon SageMaker team. Over the past 5 years, she has worked with multiple enterprise customers to set up a secure, scalable AI/ML platform built on SageMaker.

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning (ML) platform team at AWS, leading SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.

Benefits of the ModelTrainer class

Launch a training job using the ModelTrainer class

Run the job locally for experimentation

Read hyperparameters in your custom script

Run distributed training jobs

Conclusion

About the Authors

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签