AWS Machine Learning Blog 2024年08月16日
Perform generative AI-powered data prep and no-code ML over any size of data using Amazon SageMaker Canvas
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon SageMaker Canvas支持处理大规模数据集,可通过简单操作完成数据准备、模型创建及推理,提升企业数据利用能力。

🎯Amazon SageMaker Canvas支持处理PB级规模数据集,用户可通过交互方式准备大数据集,创建端到端数据流程,并进行自动化机器学习实验。它具有超过50个连接器、直观的聊天数据准备界面和PB级数据支持,为处理实际企业用例提供了可扩展的低代码/无代码ML解决方案。

📊在数据准备过程中,用户可先从Amazon S3导入数据,利用SageMaker Data Wrangler进行交互操作,在准备全数据集的数据流程前先处理样本数据,提高时间和性能。还可查看数据质量洞察报告,评估数据质量并分析结果,为后续数据准备步骤提供指导。

💻数据准备阶段,用户可使用SageMaker Canvas的Chat for data prep功能,通过自然语言提示简化数据准备工作,也可使用LCNC变换和SageMaker Data Wrangler UI进行数据转换。此外,还可选择将数据导出到Amazon S3,通过EMR Serverless作业处理整个数据集。

🎁完成数据准备后,用户可在流程结束时创建模型,并在模型创建过程中进行配置。模型创建完成后,可在分析页面查看模型性能的相关信息,如概述、评分、高级指标等。最后,用户还可进行推理操作,对生成的数据集进行批量预测。

Amazon SageMaker Canvas now empowers enterprises to harness the full potential of their data by enabling support of petabyte-scale datasets. Starting today, you can interactively prepare large datasets, create end-to-end data flows, and invoke automated machine learning (AutoML) experiments on petabytes of data—a substantial leap from the previous 5 GB limit. With over 50 connectors, an intuitive Chat for data prep interface, and petabyte support, SageMaker Canvas provides a scalable, low-code/no-code (LCNC) ML solution for handling real-world, enterprise use cases.

Organizations often struggle to extract meaningful insights and value from their ever-growing volume of data. You need data engineering expertise and time to develop the proper scripts and pipelines to wrangle, clean, and transform data. Then you must experiment with numerous models and hyperparameters requiring domain expertise. Afterward, you need to manage complex clusters to process and train your ML models over these large-scale datasets.

Starting today, you can prepare your petabyte-scale data and explore many ML models with AutoML by chat and with a few clicks. In this post, we show you how you can complete all these steps with the new integration in SageMaker Canvas with Amazon EMR Serverless without writing code.

Solution overview

For this post, we use a sample dataset of a 33 GB CSV file containing flight purchase transactions from Expedia between April 16, 2022, and October 5, 2022. We use the features to predict the base fare of a ticket based on the flight date, distance, seat type, and others.

In the following sections, we demonstrate how to import and prepare the data, optionally export the data, create a model, and run inference, all in SageMaker Canvas.

Prerequisites

You can follow along by completing the following prerequisites:

    Set up SageMaker Canvas. Download the dataset from Kaggle and upload it to an Amazon Simple Storage Service (Amazon S3) bucket. Add emr-serverless as a trusted entity to the SageMaker Canvas execution role to allow Amazon EMR processing jobs.

Import data in SageMaker Canvas

We start by importing the data from Amazon S3 using Amazon SageMaker Data Wrangler in SageMaker Canvas. Complete the following steps:

    In SageMaker Canvas, choose Data Wrangler in the navigation pane. On the Data flows tab, choose Tabular on the Import and prepare dropdown menu.
    Enter the S3 URI for the file and choose Go, then choose Next. Give your dataset a name, choose Random for Sampling method, then choose Import.

Importing data from the SageMaker Data Wrangler flow allows you to interact with a sample of the data before scaling the data preparation flow to the full dataset. This improves time and performance because you don’t need to work with the entirety of the data during preparation. You can later use EMR Serverless to handle the heavy lifting. When SageMaker Data Wrangler finishes importing, you can start transforming the dataset.

After you import the dataset, you can first look at the Data Quality Insights Report to see recommendations from SageMaker Canvas on how to improve the data quality and therefore improve the model’s performance.

    In the flow, choose the options menu (three dots) for the node, then choose Get data insights.
    Give your analysis a name, select Regression for Problem type, choose baseFare for Target column, select Sampled dataset for Data Size, then choose Create.

Assessing the data quality and analyzing the report’s findings is often the first step because it can guide the proceeding data preparation steps. Within the report, you will find dataset statistics, high priority warnings around target leakage, skewness, anomalies, and a feature summary.

Prepare the data with SageMaker Canvas

Now that you understand your dataset characteristics and potential issues, you can use the Chat for data prep feature in SageMaker Canvas to simplify data preparation with natural language prompts. This generative artificial intelligence (AI)-powered capability reduces the time, effort, and expertise required for the often complex tasks of data preparation.

    Choose the .flow file on the top banner to go back to your flow canvas.
    Choose the options menu for the node, then choose Chat for data prep.

For our first example, converting searchDate and flightDate to datetime format might help us perform date manipulations and extract useful features such as year, month, day, and the difference in days between searchDate and flightDate. These features can find temporal patterns in the data that can influence the baseFare.

    Provide a prompt like “Convert searchDate and flightDate to datetime format” to view the code and choose Add to steps.

In addition to data preparation using the chat UI, you can use LCNC transforms with the SageMaker Data Wrangler UI to transform your data. For example, we use one-hot encoding as a technique to convert categorical data into numerical format using the LCNC interface.

    Add the transform Encode categorical. Choose One-hot encode for Transform and add the following columns: startingAirport, destinationAirport, fareBasisCode, segmentsArrivalAirportCode, segmentsDepartureAirportCode, segmentsAirlineName, segmentsAirlineCode, segmentsEquipmentDescription, and segmentsCabinCode.

You can use the advanced search and filter option in SageMaker Canvas to select columns that are of String data type to simplify the process.

Refer to the SageMaker Canvas blog for other examples using SageMaker Data Wrangler. For this post, we simplify our efforts with these two steps, but we encourage you to use both chat and transforms to add data preparation steps on your own. In our testing, we successfully ran all our data preparation steps through the chat using the following prompts as an example:

When these steps are complete, you can move to the next step of processing the full dataset and creating a model.

(Optional) Export your data in Amazon S3 using an EMR Serverless job

You can process the entire 33 GB dataset by running the data flow using EMR Serverless for the data preparation job without worrying about the infrastructure.

    From the last node in the flow diagram, choose Export and Export data to Amazon S3.
    Provide a dataset name and output location. It is recommended to keep Auto job configuration selected unless you want to change any of the Amazon EMR or SageMaker Processing configs. (If your data is greater than 5 GB data processing will run in EMR Serverless, otherwise it will run within the SageMaker Canvas workspace.) Under EMR Serverless, provide a job name and choose Export.

You can view the job status in SageMaker Canvas on the Data Wrangler page on the Jobs tab.

You can also view the job status on the Amazon EMR Studio console by choosing Applications under Serverless in the navigation pane.

Create a model

You can also create a model at the end of your flow.

    Choose Create model from the node options, and SageMaker Canvas will create a dataset and then navigate you to create a model. Provide a dataset and model name, select Predictive analysis for Problem type, choose baseFare as the target column, then choose Export and create model.

The model creation process will take a couple of minutes to complete.

    Choose My Models in the navigation pane. Choose the model you just exported and navigate to version 1. Under Model type, choose Configure model. Select Numeric model type, then choose Save. On the dropdown menu, choose Quick Build to start the build process.

When the build is complete, on the Analyze page, you can the following tabs:

Run inference

In this section, we walk through the steps to run batch predictions against the generated dataset.

    On the Analyze page, choose Predict. To generate predictions on your test dataset, choose Manual. Select the test dataset you created and choose Generate predictions.
    When the predictions are ready, either choose View in the pop-up message at the bottom of the page or navigate to the Status column to choose Preview on the options menu (three dots).

You’re now able to review the predictions.

You have now used the generative AI data preparation capabilities in SageMaker Canvas to prepare a large dataset, trained a model using AutoML techniques, and run batch predictions at scale. All of this was done with a few clicks and using a natural language interface.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas. To log out, choose Log out in the navigation pane of the SageMaker Canvas application.

When you log out of SageMaker Canvas, your models and datasets aren’t affected, but SageMaker Canvas cancels any Quick build tasks. If you log out of SageMaker Canvas while running a Quick build, your build might be interrupted until you relaunch the application. When you relaunch, SageMaker Canvas automatically restarts the build. Standard builds continue even if you log out.

Conclusion

The introduction of petabyte-scale AutoML support within SageMaker Canvas marks a significant milestone in the democratization of ML. By combining the power of generative AI, AutoML, and the scalability of EMR Serverless, we’re empowering organizations of all sizes to unlock insights and drive business value from even the largest and most complex datasets.

The benefits of ML are no longer confined to the domain of highly specialized experts. SageMaker Canvas is revolutionizing the way businesses approach data and AI, putting the power of predictive analytics and data-driven decision-making into the hands of everyone. Explore the future of no-code ML with SageMaker Canvas today.


About the authors

Bret Pontillo is a Sr. Solutions Architect at AWS. He works closely with enterprise customers building data lakes and analytical applications on the AWS platform. In his free time, Bret enjoys traveling, watching sports, and trying new restaurants.

Polaris Jhandi is a Cloud Application Architect with AWS Professional Services. He has a background in AI/ML & big data. He is currently working with customers to migrate their legacy Mainframe applications to the Cloud.

Peter Chung is a Solutions Architect serving enterprise customers at AWS. He loves to help customers use technology to solve business problems on various topics like cutting costs and leveraging artificial intelligence. He wrote a book on AWS FinOps, and enjoys reading and building solutions.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon SageMaker Canvas 数据处理 模型构建 自动化机器学习
相关文章