Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio

未知数据源 2024年10月02日

Automate ETL jobs between Amazon RDS for SQL Server and Azure Managed SQL using AWS Glue Studio

文章介绍了在多云计算策略下，如何使用AWS Glue Studio自动化ETL作业，实现Amazon RDS for SQL Server和Azure SQL Managed Instances之间的数据迁移。文中详细说明了所需的工具、解决方案概述、前提条件及具体操作步骤。

🎯AWS Glue Studio是完全托管的无服务器集成服务的一部分，具有图形界面，可轻松创建、运行和监控ETL作业，并能设定特定时间自动运行作业。

📈解决方案中，以AWS Glue Studio为中间件，从源数据库（Azure SQL Managed Instance）抽取数据，使用预构建的转换进行处理，再加载到目标数据库（RDS for SQL Server instance）。

📋前提条件包括安装SQL Server Management Studio等工具、设置VPN连接、创建安全组和IAM角色、打开防火墙端口等，并分别创建源和目标数据库表。

🔗创建连接时，需在AWS Glue Data Catalog中填充来自源和目标数据源的模式信息，通过创建连接对象存储特定数据存储的连接信息。

<section class="blog-post-content"><p>Nowadays many customers are following a multi-cloud strategy. They might choose to use various cloud-managed services, such as <a class="c-link" href="https://aws.amazon.com/rds/sqlserver/" target="_blank" rel="noopener noreferrer" data-stringify-link="https://aws.amazon.com/rds/sqlserver/" data-sk="tooltip_parent" data-remove-tab-index="true">Amazon Relational Database Service (Amazon RDS) for SQL Server</a> and <a class="c-link" href="https://azure.microsoft.com/en-us/products/azure-sql/managed-instance/?&amp;ef_id=EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE:G:s&amp;OCID=AID2200277_SEM_EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE:G:s&amp;gclid=EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE" target="_blank" rel="noopener noreferrer" data-stringify-link="https://azure.microsoft.com/en-us/products/azure-sql/managed-instance/?&amp;ef_id=EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE:G:s&amp;OCID=AID2200277_SEM_EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE:G:s&amp;gclid=EAIaIQobChMI-5q2gdu0-AIVPxitBh3HNQT-EAAYASAAEgLr9PD_BwE" data-sk="tooltip_parent" data-remove-tab-index="true">Azure SQL Managed Instances</a>, to perform data analytics tasks, but still use traditional extract, transform, and load (ETL) tools to integrate and process the data. However, traditional ETL tools may require you to develop custom scripts, which makes ETL automation difficult.</p><p>In this post, I show you how to automate ETL jobs between Amazon RDS for SQL Server and Azure SQL Managed Instances using <a class="c-link" href="https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html" target="_blank" rel="noopener noreferrer" data-stringify-link="https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html" data-sk="tooltip_parent" data-remove-tab-index="true">AWS Glue Studio</a>, which is part of <a class="c-link" href="https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;whats-new-cards.sort-order=desc" target="_blank" rel="noopener noreferrer" data-stringify-link="https://aws.amazon.com/glue/?whats-new-cards.sort-by=item.additionalFields.postDateTime&amp;whats-new-cards.sort-order=desc" data-sk="tooltip_parent" data-remove-tab-index="true">AWS Glue</a>, a fully managed serverless integration service. AWS Glue Studio has a graphical interface that makes it easy to create, run, and monitor ETL jobs, and you can create a schedule to run your jobs at specific times.</p><h2>Solution overview</h2><p>To move data from one database to another, there are different services available either on-premise or in the cloud, varied by bandwidth limits, ongoing changes (CDC), schema and table modifications, and other features. Beyond that, we need to apply advanced data transformations, monitor, and automate the ETL jobs. This is where AWS Glue Studio can help us facilitate these activities.</p><p>As shown in the following diagram, we use AWS Glue Studio as the middleware to pull data from the source database (in this case an Azure SQL Managed Instance), then create and automate the ETL job using one of the pre-built transformations in AWS Glue Studio. Finally, we load the data to the target database (in this case an RDS for SQL Server instance).</p><p><img class="alignnone size-full wp-image-34327" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Overview-Diagram-1.jpg" alt="" width="1808" height="1210" /></p><p>The solution workflow consists of the following steps:</p><ol><li>Create connections for the source and target databases.</li><li>Create and run AWS Glue crawlers.</li><li>Create and run an ETL job that transforms the data and loads it from source to target.</li><li>Schedule the ETL job to run automatically.</li><li>Monitor the ETL job.</li></ol><h2>Prerequisites</h2><p>Complete the following prerequisite steps:</p><ol><li>Install <a href="https://docs.microsoft.com/en-us/sql/ssms/sql-server-management-studio-ssms?view=sql-server-ver16" target="_blank" rel="noopener noreferrer">SQL Server Management Studio</a> (SSMS) or an equivalent client tool.</li><li>Set up a <a href="https://docs.aws.amazon.com/vpn/latest/s2svpn/SetUpVPNConnections.html" target="_blank" rel="noopener noreferrer">VPN connection</a> between <a href="http://aws.amazon.com/vpc" target="_blank" rel="noopener noreferrer">Amazon Virtual Private Cloud</a> (Amazon VPC) and the Azure private subnet.</li><li><a href="https://docs.aws.amazon.com/glue/latest/dg/setup-vpc-for-glue-access.html" target="_blank" rel="noopener noreferrer">Create a security group for AWS Glue ENI in your VPC</a>.</li><li>Create an <a href="http://aws.amazon.com/iam" target="_blank" rel="noopener noreferrer">AWS Identity and Access Management</a> (IAM) role for AWS Glue. For instructions, refer to <a href="https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html" target="_blank" rel="noopener noreferrer">Setting up IAM permissions for AWS Glue</a>.</li><li><a href="https://docs.microsoft.com/en-us/azure/virtual-network/tutorial-filter-network-traffic" target="_blank" rel="noopener noreferrer">Open the appropriate firewall ports in the Azure private subnet</a>.</li><li>Create a source database table (Azure SQL Managed Instance). You can deploy the Azure database instance using the following <a href="https://docs.microsoft.com/en-us/azure/azure-sql/managed-instance/instance-create-quickstart?view=azuresql" target="_blank" rel="noopener noreferrer">QuickStart</a>. For testing purposes, I import the public <em><a href="https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&amp;tabs=ssms" target="_blank" rel="noopener noreferrer">AdventureWorks</a></em> sample database and use the dbo.Employee table. See the following code:<pre class="lang-sql">#Query tableSELECT FROM [AdventureWorksLT2019].[dbo].[Employee]</pre><p><img class="alignnone size-full wp-image-34347 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/AzureSQLManaged_2.jpg" alt="" width="1122" height="622" /></p></li><li>Create the target database table (Amazon RDS for SQL Server). To deploy the RDS instance, refer to <a href="https://aws.amazon.com/getting-started/hands-on/create-microsoft-sql-db/" target="_blank" rel="noopener noreferrer">Create and Connect to a Microsoft SQL Server Database with Amazon RDS</a>. You can create an empty database and table with the following statements. This is the table where the data coming from Azure will be stored.</li></ol><pre class="lang-sql">#Create databaseCREATE DATABASE AdventureWorksonRDS;#Create tableCREATE TABLE Employee(EmpID INT NOT NULL,EmpName VARCHAR(50) NOT NULL,Designation VARCHAR(50) NULL,Department VARCHAR(50) NULL,JoiningDate DATETIME NULL,CONSTRAINT [PK_Employee] PRIMARY KEY CLUSTERED (EmpID)#Query tableSELECT FROM [AdventureWorksonRDS].[dbo].[Employee]</pre><p><img class="alignnone size-full wp-image-34355 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-1.1.jpg" alt="" width="829" height="560" /></p><h2>Create connections</h2><p>The first step is to populate our AWS Glue Data Catalog with the schema information coming from our source and target data sources.</p><p>To do that, we first create <a href="https://docs.aws.amazon.com/glue/latest/dg/connection-using.html" target="_blank" rel="noopener noreferrer">connections</a>. A connection is a Data Catalog object that stores connection information for a particular data store. Connections store login credentials, URI strings, VPC information, and more. Creating connections in the Data Catalog saves the effort of having to specify the connection details every time you create a crawler or job.</p><h3>Create a connection for Azure SQL Managed Instance</h3><p>To create the connection to our source database, complete the following steps:</p><ol><li>On the AWS Glue console, choose <strong>AWS Glue Studio</strong>.<img class="alignnone size-full wp-image-34381 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Connection-Azure-1.2.jpg" alt="" width="1828" height="814" /></li><li>In the navigation pane of the AWS Glue Studio console, choose <strong>Connectors</strong>.</li><li>Choose <strong>Create connection</strong>.<img class="alignnone size-full wp-image-34392 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Connection-Azure-1.3.jpg" alt="" width="1782" height="1186" /></li><li>For <strong>Name</strong>, enter <code>AzureSQLManaged</code>.</li><li>For <strong>Connection type</strong>, choose <strong>JDBC</strong>.</li><li>For <strong>JDBC URL</strong>, use the SQL Server syntax <code>jdbc:protocol://host:port;database=db_name</code>.</li></ol><p>You can find the host and database name on the Azure SQL Managed Instance service console, on the <strong>Overview</strong> page.<img class="c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/AzureSQLManaged_1.1.jpg" alt="" width="1842" height="693" />For this specific example, we use the following information for our Azure SQL Instance:</p><ul><li><strong>Protocol</strong> – <code>sqlserver</code></li><li><strong>Host</strong> – <code>adi-qa-sql-managed-instance-test.public.xxxxxxxxxxxx.database.windows.net</code></li><li><strong>Port</strong> – <code>3342</code></li><li><strong>Database name</strong> – <code>AdventureWorksLT2019</code></li></ul><p>Enter your user name and password.Choose <strong>Create connection</strong>.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/AzureSQLManaged_2.1-1.jpg"><img class="alignnone wp-image-34547 size-full c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/AzureSQLManaged_2.1-1.jpg" alt="" width="500" height="726" /></a></p><h3>Create a connection for Amazon RDS for SQL Server</h3><p>To create a connection for our target data source, complete the following steps:</p><ol><li>On the AWS Glue Studio console, choose <strong>Connectors</strong> in the navigation pane.</li><li>Choose <strong>Create connection</strong>.</li><li>For <strong>Name</strong>, enter <code>AWSRDSforSQL</code>.</li><li>For <strong>Connection type</strong>, choose <strong>Amazon RDS</strong>.</li><li>For <strong>Database engine</strong>, choose <strong>Microsoft SQL Server</strong>.</li><li>For <strong>Database instances</strong>, choose your RDS DB instance.</li><li>For <strong>Database name</strong>, enter <code>AdventureWorksonRDS</code>.</li><li>Enter your user name and password.</li><li>Choose <strong>Create connection</strong>.</li></ol><p><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/AzureSQLManaged_2.2.1-1.jpg"><img class="alignnone wp-image-34548 size-full c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/AzureSQLManaged_2.2.1-1.jpg" alt="" width="500" height="729" /></a></p><p>You can now see the two connections created in the <strong>Connections</strong> section.</p><p><img class="alignnone size-full wp-image-34433 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/AzureSQLManaged_2.3.jpg" alt="" width="1538" height="1322" /></p><h2>Create and run AWS Glue crawlers</h2><p>You can use a crawler to populate the AWS Glue Data Catalog with tables. This is the most common method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, it updates the Data Catalog with the tables it found. The ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.</p><h3>Create a crawler for Azure SQL Managed Instance</h3><p>To create a crawler for our source database, complete the following steps:</p><ol><li>On the AWS Glue console, choose <strong>Crawlers</strong> in the navigation pane.</li><li>Choose <strong>Create crawler</strong>.<img class="alignnone size-full wp-image-34434 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-1.1.jpg" alt="" width="1766" height="604" /></li><li>If the data hasn’t been mapped into an AWS Glue table, select <strong>Not yet</strong> and choose <strong>Add a data source</strong>.<img class="alignnone size-full wp-image-34435 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-2.1.jpg" alt="" width="2392" height="862" /></li><li>For <strong>Data source</strong>¸ choose <strong>JDBC</strong>.</li><li>For <strong>Connection</strong>, choose <code>AzureSQLManaged</code>.</li><li>For <strong>Include path</strong>, specify the path of the database including the schema: <code>AdventureWorksLT2019/dbo/%</code>.</li><li>Choose <strong>Add a JDBC data source</strong>. <a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/Crawlers-2.2-1.jpg"><img class="alignnone wp-image-34545 size-full c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/Crawlers-2.2-1.jpg" alt="" width="500" height="506" /></a></li><li>Choose <strong>Next</strong>.<img class="alignnone size-full wp-image-34440 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-2.3.jpg" alt="" width="2390" height="778" /></li><li>Choose the IAM role created as part of the prerequisites and choose <strong>Next</strong>.<img class="alignnone size-full wp-image-34441 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.1.jpg" alt="" width="2394" height="614" /></li><li>Choose <strong>Add database</strong> to create the target database in the AWS Glue Data Catalog.<img class="alignnone size-full wp-image-34442 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.2.jpg" alt="" width="2382" height="890" /></li><li>For <strong>Name</strong>, enter <code>azuresqlmanaged_db</code>.</li><li>Choose <strong>Create database</strong>.<img class="alignnone size-full wp-image-34443 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.4.jpg" alt="" width="2386" height="804" /></li><li>For <strong>Target database</strong>, choose <code>azuresqlmanaged_db</code>.</li><li>Choose <strong>Next</strong>.<img class="alignnone size-full wp-image-34444 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.5.jpg" alt="" width="2402" height="882" /></li><li>Review if everything looks correct and choose <strong>Create crawler</strong>.<img class="alignnone size-full wp-image-34445 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.6.jpg" alt="" width="2392" height="1268" /></li></ol><h3>Create a crawler for Amazon RDS for SQL Server</h3><p>Repeat the crawler creation steps to create the crawler for the target RDS for SQL Server database, using the following information:</p><ul><li><strong>Crawler name</strong> – <code>AmazonRDSSQL_Crawler</code></li><li><strong>Data source</strong> – JDBC</li><li><strong>Connection</strong> – <code>AWSRDSforSQL</code></li><li><strong>Include path</strong> – <code>AdventureWorksonRDS/dbo/%</code></li><li><strong>IAM role</strong> – <code>AWSGlueServiceRoleDefault</code></li><li><strong>Database name</strong> – <code>amazonrdssql_db</code></li></ul><h2>Run the crawlers</h2><p>Now it’s time to run the crawlers.</p><ol><li>On the AWS Glue console, choose <strong>Crawlers</strong> in the navigation pane.</li><li>Select the crawlers you created and choose <strong>Run</strong>.<img class="alignnone size-full wp-image-34446 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-3.7.jpg" alt="" width="2386" height="504" /></li><li>When the crawler is complete, choose <strong>Databases</strong> in the navigation pane. Here you can find the databases discovered by the crawler.<img class="alignnone size-full wp-image-34447 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-12.1.jpg" alt="" width="2762" height="582" /></li><li>Choose <strong>Tables</strong> in the navigation pane and explore the tables discovered by the crawler that correctly identified the data type as SQL Server.<img class="alignnone size-full wp-image-34448 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-13.2.jpg" alt="" width="2190" height="1274" /></li><li>Choose the table <em>adventureworkslt2019_dbo_employee</em> and review the schema created for the data source.<img class="alignnone size-full wp-image-34449 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Crawlers-14.2.jpg" alt="" width="1540" height="1248" /></li></ol><h2>Create and run an ETL job</h2><p>Now that we have crawled our source and target databases, and we have the data in the AWS Glue Data Catalog, we can create an ETL job to load and transform this data.</p><ol><li>On the AWS Glue Studio console, choose <strong>Jobs</strong> in the navigation pane.</li><li>Select <strong>Visual with a blank canvas</strong> to use a visual interface to create our ETL jobs.</li><li>Choose <strong>Create</strong>.<img class="alignnone size-full wp-image-34450 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-2.jpg" alt="" width="2660" height="1056" /></li><li>On the <strong>Source</strong> menu, choose <strong>AWS Glue Data Catalog</strong>.<img class="alignnone size-full wp-image-34451 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-4.jpg" alt="" width="1956" height="1150" /></li><li>On the <strong>Data source properties</strong> tab, specify the database and table (for this post, <code>azuresqlmanaged_db</code> and <code>adventureworkslt2019_dbo_employee</code>).<img class="alignnone size-full wp-image-34452 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-5.1.jpg" alt="" width="2648" height="876" /></li><li>On the <strong>Transform</strong> menu, choose <strong>Apply mapping</strong> to map the source fields to the target database<strong>.<img class="alignnone size-full wp-image-34453 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-6.jpg" alt="" width="1956" height="1154" /></strong></li><li>On the <strong>Transform</strong> tab, you can see the data fields to be loaded, and you even can drop some of them if needed.<img class="alignnone size-full wp-image-34454 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-7.1.jpg" alt="" width="2768" height="1048" /></li><li>On the <strong>Target</strong> menu, choose <strong>AWS Glue Data Catalog</strong>.<img class="alignnone size-full wp-image-34455 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-8.jpg" alt="" width="1620" height="990" /></li><li>On the <strong>Data target properties</strong> tab, choose the database and table where you want to load the transformed data (for this post, <code>amazonrdssql_db</code> and <code>adventureworksrds_dbo_employee</code>).<img class="alignnone size-full wp-image-34456 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-9.1.jpg" alt="" width="2658" height="1204" /></li><li>On the <strong>Job details</strong> tab, for <strong>Name</strong>, enter <code>ETL_Azure_to_AWS</code>.</li><li>For <strong>IAM Role</strong>, choose the appropriate role.</li><li>Choose <strong>Save</strong>. <img class="alignnone size-full wp-image-34457 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-11.1.jpg" alt="" width="2320" height="854" /></li><li>Choose <strong>Run</strong> to run the job.<img class="alignnone size-full wp-image-34458 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/ETL-Jobs-11.jpg" alt="" width="2802" height="228" /></li></ol><p>If the ETL job ran successfully, it should map the data from the source database (Azure SQL) to the target database (Amazon RDS for SQL). To confirm it, you can connect to the target database using <a href="https://docs.microsoft.com/en-us/sql/ssms/sql-server-management-studio-ssms?view=sql-server-ver16" target="_blank" rel="noopener noreferrer">SQL Server Management Studio</a> (SSMS), and query the empty <code>database/table AdventureWorksonRDS/dbo.Employee</code>. It should have the data coming from the Azure SQL Managed Instance.</p><pre class="lang-sql">#Query tableSELECT * FROM [AdventureWorksonRDS].[dbo].[Employee]</pre><h2><img class="alignnone size-full wp-image-34459 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-3.jpg" alt="" width="1183" height="620" /></h2><h2>Schedule your ETL job</h2><p>In AWS Glue Studio, you can create a schedule to have your jobs run at specific times. This will reimport the full dataset and reference the use of bookmarks to do incremental loads. You can schedule your ETL jobs on an hourly, daily, weekly, monthly, or custom basis, depending on your needs. To schedule a job, complete the following steps:</p><ol><li>On the AWS Glue Studio, navigate to the job you created.</li><li>On the <strong>Schedules</strong> tab, choose <strong>Create schedule</strong>.<img class="alignnone size-full wp-image-34462 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-4.jpg" alt="" width="2788" height="592" /></li><li>For <strong>Name</strong>, enter a name (for example, <code>dbo_employee_daily_load</code>).</li><li>Choose your preferred frequency, start hour, and minute of the hour. For this post, we schedule it daily at 3:00 UTC.</li><li>For <strong>Description</strong>, enter an optional description.</li><li>Choose <strong>Create schedule</strong>.<img class="alignnone size-full wp-image-34463 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-5.jpg" alt="" width="1786" height="1298" /></li></ol><p>Confirm on the <strong>Schedules</strong> tab that the schedule was created and activated successfully.<img class="alignnone size-full wp-image-34464 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-6.jpg" alt="" width="2780" height="632" /></p><p>You have now automated your ETL job to run at your desired frequency.</p><h2>Monitor your ETL job</h2><p>The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with a status of <strong>Running</strong>, <strong>Canceled</strong>, <strong>Success</strong>, or <strong>Failed</strong>.<img class="alignnone size-full wp-image-34465 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-7.jpg" alt="" width="2768" height="1314" /></p><p>The <strong>Runs</strong> tab shows the jobs for the specified date range and filters. You can filter the jobs on additional criteria, such as status, worker type, job type, and job name.<img class="alignnone size-full wp-image-34466 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Monitor-Job-2.jpg" alt="" width="2764" height="1188" /></p><h2>Conclusion</h2><p>In this post, I went through the steps to automate ETL jobs using AWS Glue Studio, which is a user-friendly graphical interface to perform data integration tasks such as discovering and extracting data from various sources; enriching, cleaning, normalizing, and combining data; and loading and organizing data in databases, data warehouses, and data lakes. You can easily find and access this data using the AWS Glue Data Catalog. Data engineers and ETL developers can visually create, run, and monitor ETL workflows with a few clicks in AWS Glue Studio.</p><h3>About the author</h3><p><strong><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Daniel-Maldonado-picture.jpeg"><img class="size-full wp-image-34467 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/15/Daniel-Maldonado-picture.jpeg" alt="" width="100" height="133" /></a>Daniel Maldonado</strong> is an AWS Solutions Architect, specializing in Microsoft workloads and big data technologies, and focused on helping customers migrate their applications and data to AWS. Daniel has over 13 years of experience working with information technologies and enjoys helping clients reap the benefits of running their workloads in the cloud.</p></section>

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS Glue Studio ETL 作业数据迁移多云计算

相关文章

StarRocks跨集群迁移最佳实践｜得物技术

苹果与谷歌合作推出数据可移植工具支持快速将谷歌相册中的照片传输到iCloud相册

苹果与谷歌合作，云端直接迁移照片 / 视频

How Stripe Scaled to 5 Million Database Queries Per Second

VMware升级替代 | 技术路线选择篇

打破AlphaFold大模型局限，世界最大蛋白质相互作用数据集AlphaSeq横空出世

开源利器DuckDB实测：把MongoDB全量数据导入MySQL

存量房贷利率下降，系统改造先行

115 网盘服务故障，官网 502 错误

传115网盘将永久停止运营官方紧急回应：假的服务器被攻击