Import data from Google Cloud Platform BigQuery for no-code machine learning with Amazon SageMaker Canvas

In the modern, cloud-centric business landscape, data is often scattered across numerous clouds and on-site systems. This fragmentation can complicate efforts by organizations to consolidate and analyze data for their machine learning (ML) initiatives.

This post presents an architectural approach to extract data from different cloud environments, such as Google Cloud Platform (GCP) BigQuery, without the need for data movement. This minimizes the complexity and overhead associated with moving data between cloud environments, enabling organizations to access and utilize their disparate data assets for ML projects.

We highlight the process of using Amazon Athena Federated Query to extract data from GCP BigQuery, using Amazon SageMaker Data Wrangler to perform data preparation, and then using the prepared data to build ML models within Amazon SageMaker Canvas, a no-code ML interface.

SageMaker Canvas allows business analysts to access and import data from over 50 sources, prepare data using natural language and over 300 built-in transforms, build and train highly accurate models, generate predictions, and deploy models to production without requiring coding or extensive ML experience.

Solution overview

The solution outlines two main steps:

Amazon Athena

After the data is imported into SageMaker Canvas, you can use the no-code interface to build ML models and generate predictions based on the imported data.

You can use SageMaker Canvas to build the initial data preparation routine and generate accurate predictions without writing code. However, as your ML needs evolve or require more advanced customization, you may want to transition from a no-code environment to a code-first approach. The integration between SageMaker Canvas and Amazon SageMaker Studio allows you to operationalize the data preparation routine for production-scale deployments. For more details, refer to Seamlessly transition between no-code and code-first machine learning with Amazon SageMaker Canvas and Amazon SageMaker Studio

The overall architecture, as seen below, demonstrates how to use AWS services to seamlessly access and integrate data from a GCP BigQuery data warehouse into SageMaker Canvas for building and deploying ML models.

The workflow includes the following steps:

Athena Google BigQuery connector

AWS Lambda

Athena federated query

AWS Secrets Manager

This solution offers the following benefits:

Seamless integration

Secure access

Scalability

In the next sections, we dive deeper into the technical implementation details and walk through a step-by-step demonstration of this solution.

Dataset

The steps outlined in this post provide an example of how to import data into SageMaker Canvas for no-code ML. In this example, we demonstrate how to import data through Athena from GCP BigQuery.

For our dataset, we use a synthetic dataset from a telecommunications mobile phone carrier. This sample dataset contains 5,000 records, where each record uses 21 attributes to describe the customer profile. The Churn column in the dataset indicates whether the customer left service (true/false). This Churn attribute is the target variable that the ML model should aim to predict.

The following screenshot shows an example of the dataset on the BigQuery console.

Prerequisites

Complete the following prerequisite steps:

service account

service account key

Secrets

Store a new secret

Secret type

Other type of secret

Key/value pairs

Plaintext

Quick setup to Amazon SageMaker

AWS Identity and Access Management

glue:GetDatabase

athena:GetDataCatalog

{"Version": "2012-10-17","Statement": [{"Sid": "VisualEditor0","Effect": "Allow","Action": ["glue:GetDatabase","athena:GetDataCatalog"],"Resource": ["arn:aws:glue:*:<AWS account id>:catalog","arn:aws:glue:*:<AWS account id>:database/*","arn:aws:athena:*:<AWS account id>:datacatalog/*"]}]}

Register the Athena data source connector

Complete the following steps to set up the Athena data source connector:

Data sources

Create data source

Choose a data source

Google BigQuery

Next

Enter data source details

Data source name

Description

Lambda function

Create Lambda function

Application settings

SpillBucket

GCPProjectID

LambdaFunctionName

SecretNamePrefix

Deploy

You’re returned to the Enter data source details page.

Connection details

Lambda function

Tags

For more information about tags, see Tagging Athena resources.

Next

Review and create

Create data source

The Data source details section of the page for your data source shows information about your new connector. You can now use the connector in your Athena queries. For information about using data connectors in queries, see Running federated queries.

To query from Athena, launch the Athena SQL editor and choose the data source you created. You should be able to run live queries against the BigQuery database.

Connect to SageMaker Canvas with Athena as a data source

To import data from Athena, complete the following steps:

Data Wrangler

Import data and prepare

Tabular

Athena

SageMaker Data Wrangler in SageMaker Canvas allows you to prepare, featurize, and analyze your data. You can integrate a SageMaker Data Wrangler data preparation flow into your ML workflows to simplify and streamline data preprocessing and feature engineering using little to no coding.

Choose an Athena table in the left pane from AwsDataCatalog and drag and drop the table into the right pane.

Edit in SQL

SELECT state,account_length,area_code,phone,intl_plan,vmail_plan,vmail_message,day_mins,day_calls,day_charge,eve_mins,eve_calls,eve_charge,night_mins,night_calls,night_charge,intl_mins,intl_calls,intl_charge,custserv_calls,churn FROM "bigquery"."athenabigquery"."customer_churn" order by random() limit 50 ;

In the preceding query, bigquery is the data source name created in Athena, athenabigquery is the database name, and customer_churn is the table name.

Run SQL

Import

When working with ML, it’s crucial to randomize or shuffle the dataset. This step is essential because you may have access to millions or billions of data points, but you don’t necessarily need to use the entire dataset for training the model. Instead, you can limit the data to a smaller subset specifically for training purposes. After you’ve shuffled and prepared the data, you can begin the iterative process of data preparation, feature evaluation, model training, and ultimately hosting the trained model.

Create model

The data is imported into SageMaker Canvas as a dataset from the specific table in Athena. You can now use this dataset to create a model.

Train a model

After your data is imported, it shows up on the Datasets page in SageMaker Canvas. At this stage, you can build a model. To do so, complete the following steps:

Create a model

Model name

my_first_model

SageMaker Canvas enables you to create models for predictive analysis, image analysis, and text analysis.

Predictive analysis

Problem type

Create

On the Build page, you can see statistics about your dataset, such as the percentage of missing values and mode of the data.

Target column

churn

SageMaker Canvas offers two types of models that can generate predictions. Quick build prioritizes speed over accuracy, providing a model in 2–15 minutes. Standard build prioritizes accuracy over speed, providing a model in 30 minutes–2 hours.

Quick build

After the model is trained, you can analyze the model accuracy.

The Overview tab shows us the column impact, or the estimated importance of each column in predicting the target column. In this example, the Night_calls column has the most significant impact in predicting if a customer will churn. This information can help the marketing team gain insights that lead to taking actions to reduce customer churn. For example, we can see that both low and high CustServ_Calls increase the likelihood of churn. The marketing team can take actions to help prevent customer churn based on these learnings. Examples include creating a detailed FAQ on websites to reduce customer service calls, and running education campaigns with customers on the FAQ that can keep engagement up.

Generate predictions

On the Predict tab, you can generate both batch predictions and single predictions. Complete the following steps to generate a batch prediction:

inference dataset

Batch prediction

SageMaker Canvas allows you to generate batch predictions either manually or automatically on a schedule. To learn how to automate batch predictions on a schedule, refer to Manage automations.

Manual

Generate predictions

After a few seconds, the prediction is complete, and you can choose View to see the prediction.

Optionally, choose Download to download a CSV file containing the full output. SageMaker Canvas will return a prediction for each row of data and the probability of the prediction being correct.

Optionally, you can deploy your models to an endpoint to make predictions. For more information, refer to Deploy your models to an endpoint.

Clean up

To avoid future charges, log out of SageMaker Canvas.

Conclusion

In this post, we showcased a solution to extract the data from BigQuery using Athena federated queries and a sample dataset. We then used the extracted data to build an ML model using SageMaker Canvas to predict customers at risk of churning—without writing code. SageMaker Canvas enables business analysts to build and deploy ML models effortlessly through its no-code interface, democratizing ML across the organization. This enables you to harness the power of advanced analytics and ML to drive business insights and innovation, without the need for specialized technical skills.

For more information, see Query any data source with Amazon Athena’s new federated query and Import data from over 40 data sources for no-code machine learning with Amazon SageMaker Canvas. If you’re new to SageMaker Canvas, refer to Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.

About the authors

Amit Gautam is an AWS senior solutions architect supporting enterprise customers in the UK on their cloud journeys, providing them with architectural advice and guidance that helps them achieve their business outcomes.

Sujata Singh is an AWS senior solutions architect supporting enterprise customers in the UK on their cloud journeys, providing them with architectural advice and guidance that helps them achieve their business outcomes.