AWS Machine Learning Blog 03月28日
Integrating custom dependencies in Amazon SageMaker Canvas workflows
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍在Amazon SageMaker Canvas中如何纳入外部依赖。Canvas虽提供强大功能,但某些项目需特殊依赖和库。文中展示如何将Amazon S3中存储的依赖整合进Canvas工作流,包括上传脚本和依赖到S3、在Canvas中使用等步骤,并以处理计算机屏幕运输信息数据集为例进行说明。

Amazon SageMaker Canvas是低代码无代码ML平台,可引导用户完成从数据准备到模型部署的全过程,具有多种数据处理能力。

将依赖存储在Amazon S3,通过SageMaker Data Wrangler在Canvas中使用上传的代码来转换数据,包括三步:上传到S3、在Canvas中使用、训练和导出模型。

以处理计算机屏幕运输信息数据集为例,展示如何在数据集中使用依赖mpmath库的自定义函数来计算总距离。

介绍了将脚本和依赖打包成.zip文件并上传到Amazon S3,以及在SageMaker Canvas中运行自定义脚本的方法。

When implementing machine learning (ML) workflows in Amazon SageMaker Canvas, organizations might need to consider external dependencies required for their specific use cases. Although SageMaker Canvas provides powerful no-code and low-code capabilities for rapid experimentation, some projects might require specialized dependencies and libraries that aren’t included by default in SageMaker Canvas. This post provides an example of how to incorporate code that relies on external dependencies into your SageMaker Canvas workflows.

Amazon SageMaker Canvas is a low-code no-code (LCNC) ML platform that guides users through every stage of the ML journey, from initial data preparation to final model deployment. Without writing a single line of code, users can explore datasets, transform data, build models, and generate predictions.

SageMaker Canvas offers comprehensive data wrangling capabilities that help you prepare your data, including:

In this post, we demonstrate how to incorporate dependencies stored in Amazon Simple Storage Service (Amazon S3) within an Amazon SageMaker Data Wrangler flow. Using this approach, you can run custom scripts that depend on modules not inherently supported by SageMaker Canvas.

Solution overview

To showcase the integration of custom scripts and dependencies from Amazon S3 into SageMaker Canvas, we explore the following example workflow.

The solution follows three main steps:

    Upload custom scripts and dependencies to Amazon S3 Use SageMaker Data Wrangler in SageMaker Canvas to transform your data using the uploaded code Train and export the model

The following diagram is the architecture for the solution.

In this example, we work with two complementary datasets available in SageMaker Canvas that contain shipping information for computer screen deliveries. By joining these datasets, we create a comprehensive dataset that captures various shipping metrics and delivery outcomes. Our goal is to build a predictive model that can determine whether future shipments will arrive on time based on historical shipping patterns and characteristics.

Prerequisites

As a prerequisite, you need access to Amazon S3 and Amazon SageMaker AI. If you don’t already have a SageMaker AI domain configured in your account, you also need permissions to create a SageMaker AI domain.

Create the data flow

To create the data flow, follow these steps:

    On the Amazon SageMaker AI console, in the navigation pane, under Applications and IDEs, select Canvas, as shown in the following screenshot. You might need to create a SageMaker domain if you haven’t done so already. After your domain is created, choose Open Canvas.
    In Canvas, select the Datasets tab and select canvas-sample-shipping-logs.csv, as shown in the following screenshot. After the preview appears, choose + Create a data flow.

The initial data flow will open with one source and one data type.

    At the top right of the screen, and select Add data → tabular. Choose Canvas Datasets as the source and select canvas-sample-product-descriptions.csv. Choose Next as shown in the following screenshot. Then choose Import.
    After both datasets have been added, select the plus sign. From the dropdown menu, choose select Combine data. From the next dropdown menu, choose Join.
    To perform an inner join on the ProductID column, in the right-hand menu, under Join type, choose Inner join. Under Join keys, choose ProductId, as shown in the following screenshot.
    After the datasets have been joined, select the plus sign. In the dropdown menu, select + Add transform. A preview of the dataset will open.

The dataset contains XShippingDistance (long) and YShippingDistance (long) columns. For our purposes, we want to use a custom function that will find the total distance using the X and Y coordinates and then drop the individual coordinate columns. For this example, we find the total distance using a function that relies on the mpmath library.

    To call the custom function, select + Add transform. In the dropdown menu, select Custom transform. Change the editor to Python (Pandas) and try to run the following function from the Python editor:
from mpmath import sqrt  # Import sqrt from mpmathdef calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):    # Use mpmath's sqrt to calculate the total distance for each row    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)        # Drop the original x and y columns    df = df.drop(columns=[x_col, y_col])        return dfdf = calculate_total_distance(df)

Running the function produces the following error: ModuleNotFoundError: No module named ‘mpmath’, as shown in the following screenshot.

This error occurs because mpmath isn’t a module that is inherently supported by SageMaker Canvas. To use a function that relies on this module, we need to approach the use of a custom function differently.

Zip the script and dependencies

To use a function that relies on a module that isn’t natively supported in Canvas, the custom script must be zipped with the module(s) it relies on. For this example, we used our local integrated development environment (IDE) to create a script.py that relies on the mpmath library.

The script.py file contains two functions: one function that is compatible with the Python (Pandas) runtime (function calculate_total_distance), and one that is compatible with the Python (Pyspark) runtime (function udf_total_distance).

def calculate_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):    from npmath import sqrt  # Import sqrt from npmath    # Use npmath's sqrt to calculate the total distance for each row    df[new_col] = df.apply(lambda row: float(sqrt(row[x_col]**2 + row[y_col]**2)), axis=1)    # Drop the original x and y columns    df = df.drop(columns=[x_col, y_col])    return dfdef udf_total_distance(df, x_col="XShippingDistance", y_col="YShippingDistance", new_col="TotalDistance"):    from pyspark.sql import SparkSession    from pyspark.sql.functions import udf    from pyspark.sql.types import FloatType    spark = SparkSession.builder \        .master("local") \        .appName("DistanceCalculation") \        .getOrCreate()    def calculate_distance(x, y):        import sys        # Add the path to npmath        mpmath_path = "/tmp/maths"        if mpmath_path not in sys.path:            sys.path.insert(0, mpmath_path)        from mpmath import sqrt        return float(sqrt(x**2 + y**2))    # Register and apply UDF    distance_udf = udf(calculate_distance, FloatType())    df = df.withColumn(new_col, distance_udf(df[x_col], df[y_col]))    df = df.drop(x_col, y_col)    return df

To make sure the script can run, install mpmath into the same directory as script.py by running pip install mpmath.

Run zip -r my_project.zip to create a .zip file containing the function and the mpmath installation. The current directory now contains a .zip file, our Python script, and the installation our script depends on, as shown in the following screenshot.

Upload to Amazon S3

After creating the .zip file, upload it to an Amazon S3 bucket.

After the zip file has been uploaded to Amazon S3, it’s accessible in SageMaker Canvas.

Run the custom script

Return to the data flow in SageMaker Canvas and replace the prior custom function code with the following code and choose Update.

import zipfileimport boto3import sysfrom pathlib import Pathimport shutilimport importlib.utildef load_script_and_dependencies(bucket_name, zip_key, extract_to):    """    Downloads a zip file from S3, unzips it, and ensures dependencies are available.    Args:        bucket_name (str): Name of the S3 bucket.        zip_key (str): Key for the .zip file in the bucket.        extract_to (str): Directory to extract files to.    Returns:        str: Path to the extracted folder containing the script and dependencies.    """        s3_client = boto3.client("s3")        # Local path for the zip file    zip_local_path = '/tmp/dependencies.zip'        # Download the .zip file from S3    s3_client.download_file(bucket_name, zip_key, zip_local_path)    print(f"Downloaded zip file from S3: {zip_key}")    # Unzip the file    try:        with zipfile.ZipFile(zip_local_path, 'r') as zip_ref:            zip_ref.extractall(extract_to)            print(f"Extracted files to {extract_to}")    except Exception as e:        raise RuntimeError(f"Failed to extract zip file: {e}")    # Add the extracted folder to Python path    if extract_to not in sys.path:      sys.path.insert(0, extract_to)              return extract_to    def call_function_from_script(script_path, function_name, df):    """    Dynamically loads a function from a Python script using importlib.    """    try:        # Get the script name from the path        module_name = script_path.split('/')[-1].replace('.py', '')                # Load the module specification        spec = importlib.util.spec_from_file_location(module_name, script_path)        if spec is None:            raise ImportError(f"Could not load specification for module {module_name}")                    # Create the module        module = importlib.util.module_from_spec(spec)        sys.modules[module_name] = module                # Execute the module        spec.loader.exec_module(module)                # Get the function from the module        if not hasattr(module, function_name):            raise AttributeError(f"Function '{function_name}' not found in the script.")                    loaded_function = getattr(module, function_name)        # Clean up: remove module from sys.modules after execution        del sys.modules[module_name]                # Call the function        return loaded_function(df)            except Exception as e:        raise RuntimeError(f"Error loading or executing function: {e}")bucket_name = 'canvasdatabuckett'  # S3 bucket namezip_key = 'functions/my_project.zip'  # S3 path to the zip file with our custom dependancyscript_name = 'script.py'  # Name of the script in the zip filefunction_name = 'udf' # Name of function to call from our scriptextract_to = '/tmp/maths' # Local path to our custom script and dependancies# Step 1: Load the script and dependenciesextracted_path = load_script_and_dependencies(bucket_name, zip_key, extract_to)# Step 2: Call the function from the scriptscript_path = f"{extracted_path}/{script_name}"df = call_function_from_script(script_path, function_name, df)

This example code unzips the .zip file and adds the required dependencies to the local path so they’re available to the function at run time. Because mpmath was added to the local path, you can now call a function that relies on this external library.

The preceding code runs using the Python (Pandas) runtime and calculate_total_distance function. To use the Python (Pyspark) runtime, update the function_name variable to call the udf_total_distance function instead.

Complete the data flow

As a last step, remove irrelevant columns before training the model. Follow these steps:

    On the SageMaker Canvas console, select + Add transform. From the dropdown menu, select Manage columns Under Transform, choose Drop column. Under Columns to drop, add ProductId_0, ProductId_1, and OrderID, as shown in the following screenshot.

The final dataset should contain 13 columns. The complete data flow is pictured in the following image.

Train the model

To train the model, follow these steps:

    At the top right of the page, select Create model and name your dataset and model. Select Predictive analysis as the problem type and OnTimeDelivery as the target column, as shown in the screenshot below.

When building the model you can choose to run a Quick build or a Standard build. A Quick build prioritizes speed over accuracy and produces a trained model in less than 20 minutes. A standard build prioritizes accuracy over latency but the model takes longer to train.

Results

After the model build is complete, you can view the model’s accuracy, along with metrics like F1, precision and recall. In the case of a standard build, the model achieved 94.5% accuracy.

After the model training is complete, there are four ways you can use your model:

    Deploy the model directly from SageMaker Canvas to an endpoint Add the model to the SageMaker Model Registry Export your model to a Jupyter Notebook Send your model to Amazon QuickSight for use in dashboard visualizations

Clean up

To manage costs and prevent additional workspace charges, choose Log out to sign out of SageMaker Canvas when you’re done using the application, as shown in the following screenshot. You can also configure SageMaker Canvas to automatically shut down when idle.

If you created an S3 bucket specifically for this example, you might also want to empty and delete your bucket.

Summary

In this post, we demonstrated how you can upload custom dependencies to Amazon S3 and integrate them into SageMaker Canvas workflows. By walking through a practical example of implementing a custom distance calculation function with the mpmath library, we showed how to:

    Package custom code and dependencies into a .zip file Store and access these dependencies from Amazon S3 Implement custom data transformations in SageMaker Data Wrangler Train a predictive model using the transformed data

This approach means that data scientists and analysts can extend SageMaker Canvas capabilities beyond the more than 300 included functions.

To try custom transforms yourself, refer to the Amazon SageMaker Canvas documentation and sign in to SageMaker Canvas today. For additional insights into how you can optimize your SageMaker Canvas implementation, we recommend exploring these related posts:


About the Author

Nadhya Polanco is an Associate Solutions Architect at AWS based in Brussels, Belgium. In this role, she supports organizations looking to incorporate AI and Machine Learning into their workloads. In her free time, Nadhya enjoys indulging in her passion for coffee and exploring new destinations.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker Canvas 外部依赖 Amazon S3 数据处理
相关文章