AWS Machine Learning Blog 2024年09月05日
Deploy Amazon SageMaker pipelines using AWS Controllers for Kubernetes
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

介绍如何利用Kubernetes和SageMaker管理机器学习的整个生命周期,包括训练和推理等环节。

🎯Kubernetes的可扩展性和负载均衡能力使其适用于处理机器学习应用的可变工作负载,但使用它管理部署阶段存在复杂性。Amazon SageMaker则提供了简化模型构建和部署的功能,其Pipelines可自动化模型构建流程。

💻为简化流程,可使用AWS Controllers for Kubernetes(ACK)管理和部署SageMaker训练管道。通过配置IAM权限并使用SageMaker Helm Chart安装控制器,DevOps工程师可利用Kubernetes创建和管理ML管道。

📄ML工程师使用SageMaker Python SDK生成JSON格式的管道定义,DevOps工程师获取后可部署和维护所需基础设施。同时,SageMaker默认使用AWS管理的密钥对数据进行加密,还可指定自定义密钥,并建议为特定IAM角色确保对管道工件的访问安全。

📋在Kubernetes中,通过提供对象规格描述其期望状态及基本信息,使用kubectl等工具以YAML或JSON格式的清单文件与Kubernetes API通信。创建并提交管道YAML规范,包括修改.spec.pipeline Definition键并添加管道JSON定义,以及准备并提交单独的管道执行YAML规范以在SageMaker中运行管道。

🔍介绍了提交管道、查看和排查管道运行情况的方法,包括列出创建的管道和管道运行、获取管道详细信息以检查状态和错误等,还提到了清理操作,如删除创建的管道和取消启动的管道运行。

Kubernetes is a popular orchestration platform for managing containers. Its scalability and load-balancing capabilities make it ideal for handling the variable workloads typical of machine learning (ML) applications. DevOps engineers often use Kubernetes to manage and scale ML applications, but before an ML model is available, it must be trained and evaluated and, if the quality of the obtained model is satisfactory, uploaded to a model registry.

Amazon SageMaker provides capabilities to remove the undifferentiated heavy lifting of building and deploying ML models. SageMaker simplifies the process of managing dependencies, container images, auto scaling, and monitoring. Specifically for the model building stage, Amazon SageMaker Pipelines automates the process by managing the infrastructure and resources needed to process data, train models, and run evaluation tests.

A challenge for DevOps engineers is the additional complexity that comes from using Kubernetes to manage the deployment stage while resorting to other tools (such as the AWS SDK or AWS CloudFormation) to manage the model building pipeline. One alternative to simplify this process is to use AWS Controllers for Kubernetes (ACK) to manage and deploy a SageMaker training pipeline. ACK allows you to take advantage of managed model building pipelines without needing to define resources outside of the Kubernetes cluster.

In this post, we introduce an example to help DevOps engineers manage the entire ML lifecycle—including training and inference—using the same toolkit.

Solution overview

We consider a use case in which an ML engineer configures a SageMaker model building pipeline using a Jupyter notebook. This configuration takes the form of a Directed Acyclic Graph (DAG) represented as a JSON pipeline definition. The JSON document can be stored and versioned in an Amazon Simple Storage Service (Amazon S3) bucket. If encryption is required, it can be implemented using an AWS Key Management Service (AWS KMS) managed key for Amazon S3. A DevOps engineer with access to fetch this definition file from Amazon S3 can load the pipeline definition into an ACK service controller for SageMaker, which is running as part of an Amazon Elastic Kubernetes Service (Amazon EKS) cluster. The DevOps engineer can then use the Kubernetes APIs provided by ACK to submit the pipeline definition and initiate one or more pipeline runs in SageMaker. This entire workflow is shown in the following solution diagram.

Prerequisites

To follow along, you should have the following prerequisites:

Install the SageMaker ACK service controller

The SageMaker ACK service controller makes it straightforward for DevOps engineers to use Kubernetes as their control plane to create and manage ML pipelines. To install the controller in your EKS cluster, complete the following steps:

    Configure IAM permissions to make sure the controller has access to the appropriate AWS resources. Install the controller using a SageMaker Helm Chart to make it available on the client machine.

The following tutorial provides step-by-step instructions with the required commands to install the ACK service controller for SageMaker.

Generate a pipeline JSON definition

In most companies, ML engineers are responsible for creating the ML pipeline in their organization. They often work with DevOps engineers to operate those pipelines. In SageMaker, ML engineers can use the SageMaker Python SDK to generate a pipeline definition in JSON format. A SageMaker pipeline definition must follow the provided schema, which includes base images, dependencies, steps, and instance types and sizes that are needed to fully define the pipeline. This definition then gets retrieved by the DevOps engineer for deploying and maintaining the infrastructure needed for the pipeline.

The following is a sample pipeline definition with one training step:

{  "Version": "2020-12-01",  "Steps": [  {    "Name": "AbaloneTrain",    "Type": "Training",    "Arguments": {      "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",      "HyperParameters": {        "max_depth": "5",        "gamma": "4",        "eta": "0.2",        "min_child_weight": "6",        "objective": "multi:softmax",        "num_class": "10",        "num_round": "10"     },     "AlgorithmSpecification": {     "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",     "TrainingInputMode": "File"   },   "OutputDataConfig": {     "S3OutputPath": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/"   },   "ResourceConfig": {     "InstanceCount": 1,     "InstanceType": "ml.m4.xlarge",     "VolumeSizeInGB": 5   },   "StoppingCondition": {     "MaxRuntimeInSeconds": 86400   },   "InputDataConfig": [   {     "ChannelName": "train",     "DataSource": {       "S3DataSource": {         "S3DataType": "S3Prefix",         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/train/",         "S3DataDistributionType": "       }     },     "ContentType": "text/libsvm"   },   {     "ChannelName": "validation",     "DataSource": {       "S3DataSource": {         "S3DataType": "S3Prefix",         "S3Uri": "s3://<<YOUR_BUCKET_NAME>>/sagemaker/xgboost/validation/",         "S3DataDistributionType": "FullyReplicated"       }     },     "ContentType": "text/libsvm"   }]  } }]}

With SageMaker, ML model artifacts and other system artifacts are encrypted in transit and at rest. SageMaker encrypts these by default using the AWS managed key for Amazon S3. You can optionally specify a custom key using the KmsKeyId property of the OutputDataConfig argument. For more information on how SageMaker protects data, see Data Protection in Amazon SageMaker.

Furthermore, we recommend securing access to the pipeline artifacts, such as model outputs and training data, to a specific set of IAM roles created for data scientists and ML engineers. This can be achieved by attaching an appropriate bucket policy. For more information on best practices for securing data in Amazon S3, see Top 10 security best practices for securing data in Amazon S3.

Create and submit a pipeline YAML specification

In the Kubernetes world, objects are the persistent entities in the Kubernetes cluster used to represent the state of your cluster. When you create an object in Kubernetes, you must provide the object specification that describes its desired state, as well as some basic information about the object (such as a name). Then, using tools such as kubectl, you provide the information in a manifest file in YAML (or JSON) format to communicate with the Kubernetes API.

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. DevOps engineers need to modify the .spec.pipelineDefinition key in the file and add the ML engineer-provided pipeline JSON definition. They then prepare and submit a separate pipeline execution YAML specification to run the pipeline in SageMaker. There are two ways to submit a pipeline YAML specification:

jq -r tojson <pipeline-definition.json>

In this post, we use the first option and prepare the YAML specification (my-pipeline.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1kind: Pipelinemetadata:  name: my-kubernetes-pipelinespec:  parallelismConfiguration:    maxParallelExecutionSteps: 2  pipelineName: my-kubernetes-pipeline  pipelineDefinition: |  {    "Version": "2020-12-01",    "Steps": [    {      "Name": "AbaloneTrain",      "Type": "Training",      "Arguments": {        "RoleArn": "<<YOUR_SAGEMAKER_ROLE_ARN>>",        "HyperParameters": {          "max_depth": "5",          "gamma": "4",          "eta": "0.2",          "min_child_weight": "6",          "objective": "multi:softmax",          "num_class": "10",          "num_round": "30"        },        "AlgorithmSpecification": {          "TrainingImage": "683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.7-1",          "TrainingInputMode": "File"        },        "OutputDataConfig": {          "S3OutputPath": "s3://<<YOUR_S3_BUCKET>>/sagemaker/"        },        "ResourceConfig": {          "InstanceCount": 1,          "InstanceType": "ml.m4.xlarge",          "VolumeSizeInGB": 5        },        "StoppingCondition": {          "MaxRuntimeInSeconds": 86400        },        "InputDataConfig": [        {          "ChannelName": "train",          "DataSource": {            "S3DataSource": {              "S3DataType": "S3Prefix",              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/train/",              "S3DataDistributionType": "FullyReplicated"            }          },          "ContentType": "text/libsvm"        },        {          "ChannelName": "validation",          "DataSource": {            "S3DataSource": {              "S3DataType": "S3Prefix",              "S3Uri": "s3://<<YOUR_S3_BUCKET>>/sagemaker/xgboost/validation/",              "S3DataDistributionType": "FullyReplicated"            }          },          "ContentType": "text/libsvm"        }      ]    }  }]}pipelineDisplayName: my-kubernetes-pipelineroleARN: <<YOUR_SAGEMAKER_ROLE_ARN>>

Submit the pipeline to SageMaker

To submit your prepared pipeline specification, apply the specification to your Kubernetes cluster as follows:

kubectl apply -f my-pipeline.yaml

Create and submit a pipeline execution YAML specification

Refer to the following Kubernetes YAML specification for a SageMaker pipeline. Prepare the pipeline execution YAML specification (pipeline-execution.yaml) as follows:

apiVersion: sagemaker.services.k8s.aws/v1alpha1kind: PipelineExecutionmetadata:  name: my-kubernetes-pipeline-executionspec:  parallelismConfiguration:   maxParallelExecutionSteps: 2  pipelineExecutionDescription: "My first pipeline execution via Amazon EKS cluster."  pipelineName: my-kubernetes-pipeline

To start a run of the pipeline, use the following code:

kubectl apply -f pipeline-execution.yaml

Review and troubleshoot the pipeline run

To list all pipelines created using the ACK controller, use the following command:

kubectl get pipeline

To list all pipeline runs, use the following command:

kubectl get pipelineexecution

To get more details about the pipeline after it’s submitted, like checking the status, errors, or parameters of the pipeline, use the following command:

kubectl describe pipeline my-kubernetes-pipeline

To troubleshoot a pipeline run by reviewing more details about the run, use the following command:

kubectl describe pipelineexecution my-kubernetes-pipeline-execution

Clean up

Use the following command to delete any pipelines you created:

kubectl delete pipeline

Use the following command to cancel any pipeline runs you started:

kubectl delete pipelineexecution

Conclusion

In this post, we presented an example of how ML engineers familiar with Jupyter notebooks and SageMaker environments can efficiently work with DevOps engineers familiar with Kubernetes and related tools to design and maintain an ML pipeline with the right infrastructure for their organization. This enables DevOps engineers to manage all the steps of the ML lifecycle with the same set of tools and environment they are used to, which enables organizations to innovate faster and more efficiently.

Explore the GitHub repository for ACK and the SageMaker controller to start managing your ML operations with Kubernetes.


About the Authors

Pratik Yeole is a Senior Solutions Architect working with global customers, helping customers build value-driven solutions on AWS. He has expertise in MLOps and containers domains. Outside of work, he enjoys time with friends, family, music, and cricket.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Kubernetes SageMaker ML生命周期 管道管理 数据加密
相关文章