AWS Machine Learning Blog 2024年10月29日
Import data from Google Cloud Platform BigQuery for no-code machine learning with Amazon SageMaker Canvas
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了一种从GCP BigQuery提取数据到SageMaker Canvas进行机器学习的架构方法,包括设置Athena进行联邦查询、使用SageMaker Data Wrangler进行数据准备等步骤,还提到了该方案的优势、数据集、前提条件及具体操作步骤。

文章详细阐述了使用Amazon Athena Federated Query从GCP BigQuery提取数据的过程,Athena作为中间服务,通过AWS Lambda函数和Secrets Manager实现与BigQuery的通信和认证,获取数据并返回给SageMaker Canvas。

SageMaker Canvas是一个无代码的ML界面,允许业务分析师从超过50个来源访问和导入数据,使用自然语言和300多个内置转换进行数据准备,构建和训练高精度模型,生成预测并部署模型到生产环境。

该解决方案具有无缝集成、安全访问和可扩展性等优点,能够消除额外的数据移动和复杂集成需求,确保BigQuery凭据安全存储和访问,且Lambda函数的无服务器特性和Athena处理大数据集的能力使方案具有可扩展性。

文中以电信移动电话运营商的合成数据集为例,展示了如何通过Athena将数据从GCP BigQuery导入SageMaker Canvas,包括创建GCP服务账号、下载私钥JSON文件并存储在Secrets Manager等前提步骤。

In the modern, cloud-centric business landscape, data is often scattered across numerous clouds and on-site systems. This fragmentation can complicate efforts by organizations to consolidate and analyze data for their machine learning (ML) initiatives.

This post presents an architectural approach to extract data from different cloud environments, such as Google Cloud Platform (GCP) BigQuery, without the need for data movement. This minimizes the complexity and overhead associated with moving data between cloud environments, enabling organizations to access and utilize their disparate data assets for ML projects.

We highlight the process of using Amazon Athena Federated Query to extract data from GCP BigQuery, using Amazon SageMaker Data Wrangler to perform data preparation, and then using the prepared data to build ML models within Amazon SageMaker Canvas, a no-code ML interface.

SageMaker Canvas allows business analysts to access and import data from over 50 sources, prepare data using natural language and over 300 built-in transforms, build and train highly accurate models, generate predictions, and deploy models to production without requiring coding or extensive ML experience.

Solution overview

The solution outlines two main steps:

After the data is imported into SageMaker Canvas, you can use the no-code interface to build ML models and generate predictions based on the imported data.

You can use SageMaker Canvas to build the initial data preparation routine and generate accurate predictions without writing code. However, as your ML needs evolve or require more advanced customization, you may want to transition from a no-code environment to a code-first approach. The integration between SageMaker Canvas and Amazon SageMaker Studio allows you to operationalize the data preparation routine for production-scale deployments. For more details, refer to Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio

The overall architecture, as seen below, demonstrates how to use AWS services to seamlessly access and integrate data from a GCP BigQuery data warehouse into SageMaker Canvas for building and deploying ML models.

The workflow includes the following steps:

    Within the SageMaker Canvas interface, the user composes a SQL query to run against the GCP BigQuery data warehouse. SageMaker Canvas relays this query to Athena, which acts as an intermediary service, facilitating the communication between SageMaker Canvas and BigQuery. Athena uses the Athena Google BigQuery connector, which uses a pre-built AWS Lambda function to enable Athena federated query capabilities. This Lambda function retrieves the necessary BigQuery credentials (service account private key) from AWS Secrets Manager for authentication purposes. After authentication, the Lambda function uses the retrieved credentials to query BigQuery and obtain the desired result set. It parses this result set and sends it back to Athena. Athena returns the queried data from BigQuery to SageMaker Canvas, where you can use it for ML model training and development purposes within the no-code interface.

This solution offers the following benefits:

In the next sections, we dive deeper into the technical implementation details and walk through a step-by-step demonstration of this solution.

Dataset

The steps outlined in this post provide an example of how to import data into SageMaker Canvas for no-code ML. In this example, we demonstrate how to import data through Athena from GCP BigQuery.

For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The Churn column in the dataset indicates whether the customer left service (true/false). This Churn attribute is the target variable that the ML model should aim to predict.

The following screenshot shows an example of the dataset on the BigQuery console.

Prerequisites

Complete the following prerequisite steps:

    Create a service account in GCP and a service account key. Download the private key JSON file. Store the JSON file in Secrets Manager:
      On the Secrets Manager console, choose Secrets in the navigation pane, then choose Store a new secret. For Secret type¸ select Other type of secret. Copy the contents of the JSON file and enter it under Key/value pairs on the Plaintext tab.

    If you don’t have a SageMaker domain already created, create it along with the user profile. For instructions, see Quick setup to Amazon SageMaker. Make sure the user profile has permission to invoke Athena by confirming that the AWS Identity and Access Management (IAM) role has glue:GetDatabase and athena:GetDataCatalog permission on the resource. See the following example:
    {"Version": "2012-10-17","Statement": [{"Sid": "VisualEditor0","Effect": "Allow","Action": ["glue:GetDatabase","athena:GetDataCatalog"],"Resource": ["arn:aws:glue:*:<AWS account id>:catalog","arn:aws:glue:*:<AWS account id>:database/*","arn:aws:athena:*:<AWS account id>:datacatalog/*"]}]}

Register the Athena data source connector

Complete the following steps to set up the Athena data source connector:

    On the Athena console, choose Data sources in the navigation pane. Choose Create data source. On the Choose a data source page, search for and select Google BigQuery, then choose Next.

    On the Enter data source details page, provide the following information:
      For Data source name¸ enter a name. For Description, enter an optional description. For Lambda function, choose Create Lambda function to configure the connection.

    Under Application settings¸ enter the following details:
      For SpillBucket, enter the name of the bucket where the function can spill data. For GCPProjectID, enter the project ID within GCP. For LambdaFunctionName, enter the name of the Lambda function that you’re creating. For SecretNamePrefix, enter the secret name stored in Secrets Manager that contains GCP credentials.

    Choose Deploy.

You’re returned to the Enter data source details page.

    In the Connection details section, choose the refresh icon under Lambda function. Choose the Lambda function you just created. The ARN of the Lambda function is displayed. Optionally, for Tags, add key-value pairs to associate with this data source.

For more information about tags, see Tagging Athena resources.

    Choose Next. On the Review and create page, review the data source details, then choose Create data source.

The Data source details section of the page for your data source shows information about your new connector. You can now use the connector in your Athena queries. For information about using data connectors in queries, see Running federated queries.

To query from Athena, launch the Athena SQL editor and choose the data source you created. You should be able to run live queries against the BigQuery database.

Connect to SageMaker Canvas with Athena as a data source

To import data from Athena, complete the following steps:

    On the SageMaker Canvas console, choose Data Wrangler in the navigation pane. Choose Import data and prepare. Select the Tabular Choose Athena as the data source.

SageMaker Data Wrangler in SageMaker Canvas allows you to prepare, featurize, and analyze your data. You can integrate a SageMaker Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.

    Choose an Athena table in the left pane from AwsDataCatalog and drag and drop the table into the right pane.

    Choose Edit in SQL and enter the following SQL query:
SELECT state,account_length,area_code,phone,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,intl_mins,intl_calls,intl_charge,custserv_calls,churn FROM "bigquery"."athenabigquery"."customer_churn" order by random() limit 50 ;

In the preceding query, bigquery is the data source name created in Athena, athenabigquery is the database name, and customer_churn is the table name.

    Choose Run SQL to preview the dataset and when you’re satisfied with the data, choose Import.

When working with ML, it’s crucial to randomize or shuffle the dataset. This step is essential because you may have access to millions or billions of data points, but you don’t necessarily need to use the entire dataset for training the model. Instead, you can limit the data to a smaller subset specifically for training purposes. After you’ve shuffled and prepared the data, you can begin the iterative process of data preparation, feature evaluation, model training, and ultimately hosting the trained model.

    You can process or export your data to a location that is suitable for your ML workflows. For example, you can export the transformed data as a SageMaker Canvas dataset and create an ML model from it. After you export your data, choose Create model to create an ML model from your data.

The data is imported into SageMaker Canvas as a dataset from the specific table in Athena. You can now use this dataset to create a model.

Train a model

After your data is imported, it shows up on the Datasets page in SageMaker Canvas. At this stage, you can build a model. To do so, complete the following steps:

    Select your dataset and choose Create a model.

    For Model name, enter your model name (for this post, my_first_model).

SageMaker Canvas enables you to create models for predictive analysis, image analysis, and text analysis.

    Because we want to categorize customers, select Predictive analysis for Problem type. Choose Create.

On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mode of the data.

    For Target column, choose a column that you want to predict (for this post, churn).

SageMaker Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 30 minutes–2 hours.

    For this example, choose Quick build.

After the model is trained, you can analyze the model accuracy.

The Overview tab shows us the column impact, or the estimated importance of each column in predicting the target column. In this example, the Night_calls column has the most significant impact in predicting if a customer will churn. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ_Calls increase the likelihood of churn. The marketing team can take actions to help prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.

Generate predictions

On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps to generate a batch prediction:

    Download the following sample inference dataset for generating predictions. To test batch predictions, choose Batch prediction.

SageMaker Canvas allows you to generate batch predictions either manually or automatically on a schedule. To learn how to automate batch predictions on a schedule, refer to Manage automations.

    For this post, choose Manual. Upload the file you downloaded. Choose Generate predictions.

After a few seconds, the prediction is complete, and you can choose View to see the prediction.

Optionally, choose Download to download a CSV file containing the full output. SageMaker Canvas will return a prediction for each row of data and the probability of the prediction being correct.

Optionally, you can deploy your models to an endpoint to make predictions. For more information, refer to Deploy your models to an endpoint.

Clean up

To avoid future charges, log out of SageMaker Canvas.

Conclusion

In this post, we showcased a solution to extract the data from BigQuery using Athena federated queries and a sample dataset. We then used the extracted data to build an ML model using SageMaker Canvas to predict customers at risk of churning—without writing code. SageMaker Canvas enables business analysts to build and deploy ML models effortlessly through its no-code interface, democratizing ML across the organization. This enables you to harness the power of advanced analytics and ML to drive business insights and innovation, without the need for specialized technical skills.

For more information, see Query any data source with Amazon Athena’s new federated query and Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas. If you’re new to SageMaker Canvas, refer to Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.


About the authors

Amit Gautam is an AWS senior solutions architect supporting enterprise customers in the UK on their cloud journeys, providing them with architectural advice and guidance that helps them achieve their business outcomes.

Sujata Singh is an AWS senior solutions architect supporting enterprise customers in the UK on their cloud journeys, providing them with architectural advice and guidance that helps them achieve their business outcomes.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GCP BigQuery SageMaker Canvas 机器学习 数据提取
相关文章