<section class="blog-post-content"><p><a href="https://aws.amazon.com/sagemaker/pipelines/" target="_blank" rel="noopener noreferrer">Amazon SageMaker Pipelines</a> allows data scientists and machine learning (ML) engineers to automate training workflows, which helps you create a repeatable process to orchestrate model development steps for rapid experimentation and model retraining. You can automate the entire model build workflow, including data preparation, feature engineering, model training, model tuning, and model validation, and catalog it in the model registry. You can configure pipelines to run automatically at regular intervals or when certain events are triggered, or you can run them manually as needed.</p><p>In this post, we highlight some of the enhancements to the <a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noopener noreferrer">Amazon SageMaker</a> SDK and introduce new features of Amazon SageMaker Pipelines that make it easier for ML practitioners to build and train ML models.</p><p>Pipelines continues to innovate its developer experience, and with these recent releases, you can now use the service in a more customized way:</p><ul><li><a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.99.0" target="_blank" rel="noopener noreferrer">2.99.0</a>, <a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.101.1" target="_blank" rel="noopener noreferrer">2.101.1</a>, <a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.102.0" target="_blank" rel="noopener noreferrer">2.102.0</a>, <a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.104.0" target="_blank" rel="noopener noreferrer">2.104.0</a> – Updated documentation on <code>PipelineVariable</code> usage for estimator, processor, tuner, transformer, and model base classes, Amazon models, and framework models. There will be additional changes coming with newer versions of the SDK to support all subclasses of estimators and processors.</li><li><a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.90.0" target="_blank" rel="noopener noreferrer">2.90.0</a> – Availability of <a href="https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.model_step.ModelStep" target="_blank" rel="noopener noreferrer">ModelStep</a> for integrated model resource creation and registration tasks.</li><li><a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.88.2" target="_blank" rel="noopener noreferrer">2.88.2</a> – Availability of <a href="https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.pipeline_context.PipelineSession" target="_blank" rel="noopener noreferrer">PipelineSession</a> for managed interaction with SageMaker entities and resources.</li><li><a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.88.2" target="_blank" rel="noopener noreferrer">2.88.2</a> – Subclass compatibility for <a href="https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#steps" target="_blank" rel="noopener noreferrer">workflow pipeline job steps</a> so you can build job abstractions and configure and run processing, training, transform, and tuning jobs as you would without a pipeline.</li><li><a href="https://github.com/aws/sagemaker-python-sdk/releases/tag/v2.76.0" target="_blank" rel="noopener noreferrer">2.76.0</a> – Availability of <a href="https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.fail_step.FailStep" target="_blank" rel="noopener noreferrer">FailStep</a> to conditionally stop a pipeline with a failure status.</li></ul><p>In this post, we walk you through a workflow using a sample dataset with a focus on model building and deployment to demonstrate how to implement Pipelines’s new features. By the end, you should have enough information to successfully use these newer features and simplify your ML workloads.</p><h2>Features overview</h2><p>Pipelines offers the following new features:</p><ul><li><strong>Pipeline variable annotation</strong> – Certain method parameters accept multiple input types, including <code>PipelineVariables</code>, and additional documentation has been added to clarify where <code>PipelineVariables</code> are supported in both the latest stable version of SageMaker SDK documentation and the init signature of the functions. For example, in the following TensorFlow estimator, the init signature now shows that <code>model_dir</code> and <code>image_uri</code> support <code>PipelineVariables</code>, whereas the other parameters do not. For more information, refer to <a href="https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/sagemaker.tensorflow.html" target="_blank" rel="noopener noreferrer">TensorFlow Estimator</a>.<ul><li>Before:</li><li>After:</li></ul></li></ul><ul><li><strong>Pipeline session</strong> – <a href="https://sagemaker.readthedocs.io/en/stable/workflows/pipelines/sagemaker.workflow.pipelines.html#sagemaker.workflow.pipeline_context.PipelineSession" target="_blank" rel="noopener noreferrer">PipelineSession</a> is a new concept introduced to bring unity across the SageMaker SDK and introduces lazy initialization of the pipeline resources (the run calls are captured but not run until the pipeline is created and run). The <code>PipelineSession</code> context inherits the <code>SageMakerSession</code> and implements convenient methods for you to interact with other SageMaker entities and resources, such as training jobs, endpoints, and input datasets stored in <a href="http://aws.amazon.com/s3" target="_blank" rel="noopener noreferrer">Amazon Simple Storage Service</a> (Amazon S3).</li><li><strong>Subclass compatibility with workflow pipeline job steps</strong> – You can now build job abstractions and configure and run processing, training, transform, and tuning jobs as you would without a pipeline.<ul><li>For example, creating a processing step with <code>SKLearnProcessor</code> previously required the following:</li><li>As we see in the preceding code, <code>ProcessingStep</code> needs to do basically the same preprocessing logic as <code>.run</code>, just without initiating the API call to start the job. But with subclass compatibility now enabled with workflow pipeline job steps, we declare the <code>step_args</code> argument that takes the preprocessing logic with .run so you can build a job abstraction and configure it as you would use it without Pipelines. We also pass in the <code>pipeline_session</code>, which is a <code>PipelineSession</code> object, instead of <code>sagemaker_session</code> to make sure the run calls are captured but not called until the pipeline is created and run. See the following code:</li></ul></li></ul><ul><li><strong>Model step (a streamlined approach with model creation and registration steps)</strong> –Pipelines offers two step types to integrate with SageMaker models: <code>CreateModelStep</code> and <code>RegisterModel</code>. You can now achieve both using only the <code>ModelStep</code> type. Note that a <code>PipelineSession</code> is required to achieve this. This brings similarity between the pipeline steps and the SDK.<ul><li>Before:</li></ul><ul><li>After:</li></ul></li></ul><ul><li><strong>Fail step (conditional stop of the pipeline run)</strong> – <code>FailStep</code> allows a pipeline to be stopped with a failure status if a condition is met, such as if the model score is below a certain threshold.</li></ul><h2>Solution overview</h2><p>In this solution, your entry point is the <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html" target="_blank" rel="noopener noreferrer">Amazon SageMaker Studio</a> integrated development environment (IDE) for rapid experimentation. Studio offers an environment to manage the end-to-end Pipelines experience. With Studio, you can bypass the <a href="http://aws.amazon.com/console" target="_blank" rel="noopener noreferrer">AWS Management Console</a> for your entire workflow management. For more information on managing Pipelines from within Studio, refer to <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines-studio.html" target="_blank" rel="noopener noreferrer">View, Track, and Execute SageMaker Pipelines in SageMaker Studio</a>.</p><p>The following diagram illustrates the high-level architecture of the ML workflow with the different steps to train and generate inferences using the new features.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/ML-10522-image001.jpg"><img class="alignnone size-full wp-image-42975" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/ML-10522-image001.jpg" alt="" width="984" height="464" /></a></p><p>The pipeline includes the following steps:</p><ol><li>Preprocess data to build features required and split data into train, validation, and test datasets.</li><li>Create a training job with the SageMaker XGBoost framework.</li><li>Evaluate the trained model using the test dataset.</li><li>Check if the AUC score is above a predefined threshold.<ul><li>If the AUC score is less than the threshold, stop the pipeline run and mark it as failed.</li><li>If the AUC score is greater than the threshold, create a SageMaker model and register it in the SageMaker model registry.</li></ul></li><li>Apply batch transform on the given dataset using the model created in the previous step.</li></ol><h2>Prerequisites</h2><p>To follow along with this post, you need an AWS account with a <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html" target="_blank" rel="noopener noreferrer">Studio domain</a>.</p><p>Pipelines is integrated directly with SageMaker entities and resources, so you don’t need to interact with any other AWS services. You also don’t need to manage any resources because it’s a fully managed service, which means that it creates and manages resources for you. For more information on the various SageMaker components that are both standalone Python APIs along with integrated components of Studio, see the <a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noopener noreferrer">SageMaker product page</a>.</p><p>Before getting started, install SageMaker SDK version >= 2.104.0 and xlrd >=1.0.0 within the Studio notebook using the following code snippet:</p><h2>ML workflow</h2><p>For this post, you use the following components:</p><ul><li><strong>Data preparation</strong><ul><li><a href="https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProcessingJob.html" target="_blank" rel="noopener noreferrer">SageMaker Processing</a> – SageMaker Processing is a fully managed service allowing you to run custom data transformations and feature engineering for ML workloads.</li></ul></li><li><strong>Model building</strong><ul><li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks.html" target="_blank" rel="noopener noreferrer">Studio notebooks</a> – One-click notebooks with elastic compute.</li><li><a href="https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html" target="_blank" rel="noopener noreferrer">SageMaker built-in algorithms</a> – XGBoost as a built-in algorithm.</li></ul></li><li><strong>Model training and evaluation</strong><ul><li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html" target="_blank" rel="noopener noreferrer">One-click training</a> – The SageMaker distributed training feature. SageMaker provides distributed training libraries for data parallelism and model parallelism. The libraries are optimized for the SageMaker training environment, help adapt your distributed training jobs to SageMaker, and improve training speed and throughput.</li><li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/experiments.html" target="_blank" rel="noopener noreferrer">SageMaker Experiments</a> – Experiments is a capability of SageMaker that lets you organize, track, compare, and evaluate your ML iterations.</li><li><a href="https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-batch.html" target="_blank" rel="noopener noreferrer">SageMaker batch transform</a> – Batch transform or offline scoring is a managed service in SageMaker that lets you predict on a larger dataset using your ML models.</li></ul></li><li><strong>Workflow orchestration</strong></li></ul><p>A SageMaker pipeline is a series of interconnected steps defined by a JSON pipeline definition. It encodes a pipeline using a directed acyclic graph (DAG). The DAG gives information on the requirements for and relationships between each step of the pipeline, and its structure is determined by the data dependencies between steps. These dependencies are created when the properties of a step’s output are passed as the input to another step.</p><p>The following diagram illustrates the different steps in the SageMaker pipeline (for a churn prediction use case) where the connections between the steps are inferred by SageMaker based on the inputs and outputs defined by the step definitions.</p><p>The next sections walk through creating each step of the pipeline and running the entire pipeline once created.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/ML-10522-image003.jpg"><img class="alignnone size-full wp-image-42976" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/ML-10522-image003.jpg" alt="" width="724" height="428" /></a></p><h2>Project structure</h2><p>Let’s start with the project structure:</p><ul><li><strong>/sm-pipelines-end-to-end-example</strong> – The project name<ul><li><strong>/data</strong> – The datasets</li><li><strong>/pipelines</strong> – The code files for pipeline components<ul><li>/customerchurn<ul><li>preprocess.py</li><li>evaluate.py</li></ul></li></ul></li><li><strong>sagemaker-pipelines-project.ipynb</strong> – A notebook walking through the modeling workflow using Pipelines’s new features</li></ul></li></ul><h2>Download the dataset</h2><p>To follow along with this post, you need to download and save the <a href="https://www.kaggle.com/uttamp/store-data" target="_blank" rel="noopener noreferrer">sample dataset</a> under the data folder within the project home directory, which saves the file in <a href="https://aws.amazon.com/efs/" target="_blank" rel="noopener noreferrer">Amazon Elastic File System</a> (Amazon EFS) within the Studio environment.</p><h2>Build the pipeline components</h2><p>Now you’re ready to build the pipeline components.</p><h3>Import statements and declare parameters and constants</h3><p>Create a Studio notebook called <code>sagemaker-pipelines-project.ipynb</code> within the project home directory. Enter the following code block in a cell, and run the cell to set up SageMaker and S3 client objects, create <code>PipelineSession</code>, and set up the S3 bucket location using the default bucket that comes with a SageMaker session:</p><p>Pipelines supports parameterization, which allows you to specify input parameters at runtime without changing your pipeline code. You can use the modules available under the <code>sagemaker.workflow.parameters</code> module, such as <code>ParameterInteger</code>, <code>ParameterFloat</code>, and <code>ParameterString</code>, to specify pipeline parameters of various data types. Run the following code to set up multiple input parameters:</p><h3>Generate a batch dataset</h3><p>Generate the batch dataset, which you use later in the batch transform step:</p><h3>Upload data to an S3 bucket</h3><p>Upload the datasets to Amazon S3:</p><h3>Define a processing script and processing step</h3><p>In this step, you prepare a Python script to do feature engineering, one hot encoding, and curate the training, validation, and test splits to be used for model building. Run the following code to build your processing script:</p><p>Next, run the following code block to instantiate the processor and the Pipelines step to run the processing script. Because the processing script is written in Pandas, you use a <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/use-scikit-learn-processing-container.html" target="_blank" rel="noopener noreferrer">SKLearnProcessor</a>. The Pipelines <code>ProcessingStep</code> function takes the following arguments: the processor, the input S3 locations for raw datasets, and the output S3 locations to save processed datasets.</p><h3>Define a training step</h3><p>Set up model training using a SageMaker XGBoost estimator and the Pipelines <code>TrainingStep</code> function:</p><h3>Define the evaluation script and model evaluation step</h3><p>Run the following code block to evaluate the model once trained. This script encapsulates the logic to check if the AUC score meets the specified threshold.</p><p>Next, run the following code block to instantiate the processor and the Pipelines step to run the evaluation script. Because the evaluation script uses the XGBoost package, you use a <code>ScriptProcessor</code> along with the XGBoost image. The Pipelines <code>ProcessingStep</code> function takes the following arguments: the processor, the input S3 locations for raw datasets, and the output S3 locations to save processed datasets.</p><h3>Define a create model step</h3><p>Run the following code block to create a SageMaker model using the Pipelines model step. This step utilizes the output of the training step to package the model for deployment. Note that the value for the instance type argument is passed using the Pipelines parameter you defined earlier in the post.</p><h3>Define a batch transform step</h3><p>Run the following code block to run batch transformation using the trained model with the batch input created in the first step:</p><h3>Define a register model step</h3><p>The following code registers the model within the SageMaker model registry using the Pipelines model step:</p><h3>Define a fail step to stop the pipeline</h3><p>The following code defines the Pipelines fail step to stop the pipeline run with an error message if the AUC score doesn’t meet the defined threshold:</p><h3>Define a condition step to check AUC score</h3><p>The following code defines a condition step to check the AUC score and conditionally create a model and run a batch transformation and register a model in the model registry, or stop the pipeline run in a failed state:</p><h2>Build and run the pipeline</h2><p>After defining all of the component steps, you can assemble them into a Pipelines object. You don’t need to specify the order of pipeline because Pipelines automatically infers the order sequence based on the dependencies between the steps.</p><p>Run the following code in a cell in your notebook. If the pipeline already exists, the code updates the pipeline. If the pipeline doesn’t exist, it creates a new one.</p><h2>Conclusion</h2><p>In this post, we introduced some of the new features now available with Pipelines along with other built-in SageMaker features and the XGBoost algorithm to develop, iterate, and deploy a model for churn prediction. The solution can be extended with additional data sources</p><p>to implement your own ML workflow. For more details on the steps available in the Pipelines workflow, refer to <a href="https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html" target="_blank" rel="noopener noreferrer">Amazon SageMaker Model Building Pipeline</a> and <a href="https://docs.aws.amazon.com/sagemaker/latest/dg/workflows.html" target="_blank" rel="noopener noreferrer">SageMaker Workflows</a>. The <a href="https://github.com/aws/amazon-sagemaker-examples" target="_blank" rel="noopener noreferrer">AWS SageMaker Examples</a> GitHub repo has more examples around various use cases using Pipelines.</p><h3>About the Authors</h3><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Jerry-Peng.png"><img class="size-full wp-image-42974 alignleft" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Jerry-Peng.png" alt="" width="100" height="132" /></a>Jerry Peng</strong> is a software development engineer with AWS SageMaker. He focuses on building end-to-end large-scale MLOps system from training to model monitoring in production. He is also passionate about bringing the concept of MLOps to broader audience.</p><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Dewen-Qi.png"><img class="size-full wp-image-42972 alignleft" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Dewen-Qi.png" alt="" width="100" height="131" /></a>Dewen Qi</strong> is a Software Development Engineer in AWS. She currently focuses on developing and improving SageMaker Pipelines. Outside of work, she enjoys practicing Cello.</p><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Gayatri-Ghanakota-.jpg"><img class="size-full wp-image-42973 alignleft" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Gayatri-Ghanakota-.jpg" alt="" width="100" height="96" /></a>Gayatri Ghanakota</strong> is a Sr. Machine Learning Engineer with AWS Professional Services. She is passionate about developing, deploying, and explaining AI/ ML solutions across various domains. Prior to this role, she led multiple initiatives as a data scientist and ML engineer with top global firms in the financial and retail space. She holds a master’s degree in Computer Science specialized in Data Science from the University of Colorado, Boulder.</p><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Rupinder-Grewal-.jpg"><img class="size-full wp-image-42978 alignleft" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Rupinder-Grewal-.jpg" alt="" width="100" height="89" /></a>Rupinder Grewal</strong> is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.</p><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Ray-Li.png"><img class="size-full wp-image-42977 alignleft" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/09/21/Ray-Li.png" alt="" width="100" height="103" /></a>Ray Li</strong> is a Sr. Data Scientist with AWS Professional Services. His specialty focuses on building and operationalizing AI/ML solutions for customers of varying sizes, ranging from startups to enterprise organizations. Outside of work, Ray enjoys fitness and traveling.</p></section>