AWS Machine Learning Blog 2024年12月13日
Accelerate your ML lifecycle using the new and improved Amazon SageMaker Python SDK – Part 1: ModelTrainer
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker 的 Python SDK 进行了重新设计,提供了一个统一的面向对象接口,可以轻松地与 SageMaker 服务进行交互. 新的 SDK 采用了分层用户体验设计,其中新的低级 SDK (SageMaker Core) 提供了对 SageMaker 特性和配置的全面访问,为机器学习工程师提供了更大的灵活性和控制力. 高级抽象层是为具有有限 AWS 专业知识的数据科学家设计的,提供了一个简化的接口,隐藏了复杂的基础设施细节. 本文重点介绍了用于简化训练体验的 ModelTrainer 类,并展示了如何使用 ModelTrainer 类来训练您的机器学习模型,包括使用自定义脚本或容器执行分布式训练.

🚀ModelTrainer 类通过将配置整合到几个核心参数中来降低复杂性, 这种精简最大限度地减少了认知过载, 使用户能够专注于模型训练而不是配置细节, 此外, 它还采用了直观的配置类, 以实现简单的平台交互。

💻ModelTrainer 自动将源代码、数据路径和参数规范映射到远程执行环境, 从而简化了从本地开发到云训练的过渡, 消除了特殊握手或复杂设置过程的需要。

🔗ModelTrainer 类为用户提供了增强的灵活性, 可以指定自定义命令和分布式训练策略, 允许您通过 SourceCode 中的 command 参数直接提供要在容器中运行的确切命令, 这种方法将分布式训练策略与训练工具包和特定框架的估算器解耦。

📑ModelTrainer 类将训练作业的超参数作为单个环境变量传递, 允许您使用单个 SM_HPS 变量加载超参数。

🛠️ModelTrainer 类提供了增强的灵活性, 用户可以通过 SourceCode 类中的 command 参数直接指定自定义命令, 并支持 torchrun、torchrun smp 和 MPI 策略, 当您需要使用训练工具包不支持的自定义启动程序命令启动作业时, 此功能特别有用。

Amazon SageMaker has redesigned its Python SDK to provide a unified object-oriented interface that makes it straightforward to interact with SageMaker services. The new SDK is designed with a tiered user experience in mind, where the new lower-level SDK (SageMaker Core) provides access to full breadth of SageMaker features and configurations, allowing for greater flexibility and control for ML engineers. The higher-level abstracted layer is designed for data scientists with limited AWS expertise, offering a simplified interface that hides complex infrastructure details.

In this two-part series, we introduce the abstracted layer of the SageMaker Python SDK that allows you to train and deploy machine learning (ML) models by using the new ModelTrainer and the improved ModelBuilder classes.

In this post, we focus on the ModelTrainer class for simplifying the training experience. The ModelTrainer class provides significant improvements over the current Estimator class, which are discussed in detail in this post. We show you how to use the ModelTrainer class to train your ML models, which includes executing distributed training using a custom script or container. In Part 2, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.

Benefits of the ModelTrainer class

The new ModelTrainer class has been designed to address usability challenges associated with Estimator class. Moving forward, ModelTrainer will be the preferred approach for model training, bringing significant enhancements that greatly improve the user experience. This evolution marks a step towards achieving a best-in-class developer experience for model training. The following are the key benefits:

To further explain each of these benefits, we demonstrate with examples in the following sections, and finally show you how to set up and run distributed training for the Meta Llama 3.1 8B model using the new ModelTrainer class.

Launch a training job using the ModelTrainer class

The ModelTrainer class simplifies the experience by letting you customize the training job, including providing a custom script, directly providing a command to run the training job, supporting local mode, and much more. However, you can spin up a SageMaker training job in script mode by providing minimal parameters—the SourceCode and the training image URI.

The following example illustrates how you can launch a training job with your own custom script by providing just the script and the training image URI (in this case, PyTorch), and an optional requirements file. Additional parameters such as the instance type and instance size are automatically set by the SDK to preset defaults, and parameters such as the AWS Identity and Access Management (IAM) role and SageMaker session are automatically detected from the current session and user’s credentials. Admins and users can also overwrite the defaults using the SDK defaults configuration file. For the detailed list of pre-set values, refer to the SDK documentation.

from sagemaker.modules.train import ModelTrainerfrom sagemaker.modules.configs import SourceCode, InputData# image URI for the training jobpytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.0.0-cpu-py310"# you can find all available images here# https://docs.aws.amazon.com/sagemaker/latest/dg-ecr-paths/sagemaker-algo-docker-registry-paths.html# define the script to be runsource_code = SourceCode(    source_dir="basic-script-mode",    requirements="requirements.txt",    entry_script="custom_script.py",)# define the ModelTrainermodel_trainer = ModelTrainer(    training_image=pytorch_image,    source_code=source_code,    base_job_name="script-mode",)# pass the input datainput_data = InputData(    channel_name="train",    data_source=training_input_path,  #s3 path where training data is stored)# start the training jobmodel_trainer.train(input_data_config=[input_data], wait=False)

With purpose-built configurations, you can now reuse these objects to create multiple training jobs with different hyperparameters, for example, without having to re-define all the parameters.

Run the job locally for experimentation

To run the preceding training job locally, you can simply set the training_mode parameter as shown in the following code:

from sagemaker.modules.train.model_trainer import Mode...model_trainer = ModelTrainer(    training_image=pytorch_image,    source_code=source_code,    base_job_name="script-mode-local",    training_mode=Mode.LOCAL_CONTAINER,)model_trainer.train()

The training job runs remotely because training_mode is set to Mode.LOCAL_CONTAINER. If not explicitly set, the ModelTrainer runs a remote SageMaker training job by default. This behavior can also be enforced by changing the value to Mode.SAGEMAKER_TRAINING_JOB. For a full list of the available configs, including compute and networking, refer to the SDK documentation.

Read hyperparameters in your custom script

The ModelTrainer supports multiple ways to read the hyperparameters that are passed to a training job. In addition to the existing support to read the hyperparameters as command line arguments in your custom script, ModelTrainer also supports reading the hyperparameters as individual environment variables, prefixed with SM_HPS_<hyperparameter-key>, or as a single environment variable dictionary, SM_HPS.

Suppose the following hyperparameters are passed to the training job:

hyperparams = {    "learning_rate": 1e-5,    "epochs": 2,}model_trainer = ModelTrainer(    ...    hyperparameters=hyperparams,    ...)

You have the following options:

def main():    hyperparams = json.loads(os.environ["SM_HPS"])    learning_rate = hyperparams.get("learning_rate")    epochs = hyperparams.get("epochs", 1)    ...
def main():    learning_rate = float(os.environ.get("SM_HP_LEARNING_RATE", 3e-5))    epochs = int(os.environ.get("SM_HP_EPOCHS", 1)    ...
def main():    parser = argparse.ArgumentParser()    parser.add_argument("--learning_rate", type=float, default=3e-5)    parser.add_argument("--epochs", type=int, default=1)        args = parse_args()        learning_rate = args.learning_rate    epochs = args.epochs

Run distributed training jobs

SageMaker supports distributed training to support training for deep learning tasks such as natural language processing and computer vision, to run secure and scalable data parallel and model parallel jobs. This is usually achieved by providing the right set of parameters when using an Estimator. For example, to use torchrun, you would define the distribution parameter in the PyTorch Estimator and set it to "torch_distributed": {"enabled": True}.

The ModelTrainer class provides enhanced flexibility for users to specify custom commands directly through the command parameter in the SourceCode class, and supports torchrun, torchrun smp, and the MPI strategies. This capability is particularly useful when you need to launch a job with a custom launcher command that is not supported by the training toolkit.

In the following example, we show how to fine-tune the latest Meta Llama 3.1 8B model using the default launcher script using Torchrun on a custom dataset that’s preprocessed and saved in an Amazon Simple Storage Service (Amazon S3) location:

from sagemaker.modules.train import ModelTrainerfrom sagemaker.modules.distributed import Torchrunfrom sagemaker.modules.configs import Compute, SourceCode, InputData# provide  image URI - update the URI if you're in a different regionpytorch_image = "763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:2.2.0-gpu-py310"# Define the source code configuration for the distributed training jobsource_code = SourceCode(    source_dir="distributed-training-scripts",        requirements="requirements.txt",      entry_point="fine_tune.py",)torchrun = Torchrun()hyperparameters = {    ...}# Compute configuration for the training jobcompute = Compute(    instance_count=1,    instance_type="ml.g5.12xlarge",    volume_size_in_gb=96,    keep_alive_period_in_seconds=3600,)# Initialize the ModelTrainer with the specified configurationsmodel_trainer = ModelTrainer(    training_image=pytorch_image,      source_code=source_code,    compute=compute,    distributed_runner=torchrun,    hyperparameters=hyperparameters,)# pass the input datainput_data = InputData(    channel_name="dataset",    data_source="s3://your-bucket/your-prefix",  # this is the s3 path where processed data is stored)# Start the training jobmodel_trainer.train(input_data_config=[input_data], wait=False)

If you wanted to customize your torchrun launcher script, you can also directly provide the commands using the command parameter:

# Define the source code configuration for the distributed training jobsource_code = SourceCode(    source_dir="distributed-training-scripts",        requirements="requirements.txt",        # Custom command for distributed training launcher script    command="torchrun --nnodes 1 \            --nproc_per_node 4 \            --master_addr algo-1 \            --master_port 7777 \            fine_tune_llama.py")# Initialize the ModelTrainer with the specified configurationsmodel_trainer = ModelTrainer(    training_image=pytorch_image,      source_code=source_code,    compute=compute,)# Start the training jobmodel_trainer.train(..)

For more examples and end-to-end ML workflows using the SageMaker ModelTrainer, refer to the GitHub repo.

Conclusion

The newly launched SageMaker ModelTrainer class simplifies the user experience by reducing the number of parameters, introducing intuitive configurations, and supporting complex setups like bringing your own container and running distributed training. Data scientists can also seamlessly transition from local training to remote training and training on multiple nodes using the ModelTrainer.

We encourage you to try out the ModelTrainer class by referring to the SDK documentation and sample notebooks on the GitHub repo. The ModelTrainer class is available from the SageMaker SDK v2.x onwards, at no additional charge. In Part 2 of this series, we show you how to build a model and deploy to a SageMaker endpoint using the improved ModelBuilder class.


About the Authors

Durga Sury is a Senior Solutions Architect on the Amazon SageMaker team. Over the past 5 years, she has worked with multiple enterprise customers to set up a secure, scalable AI/ML platform built on SageMaker.

Shweta Singh is a Senior Product Manager in the Amazon SageMaker Machine Learning (ML) platform team at AWS, leading SageMaker Python SDK. She has worked in several product roles in Amazon for over 5 years. She has a Bachelor of Science degree in Computer Engineering and a Masters of Science in Financial Engineering, both from New York University.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker Python SDK ModelTrainer 机器学习 分布式训练
相关文章