Upgrade Amazon EMR Hive Metastore from 5.X to 6.X

<section class="blog-post-content">If you are currently running <a href="http://aws.amazon.com/emr" target="_blank" rel="noopener noreferrer">Amazon EMR</a> 5.X clusters, consider moving to Amazon EMR 6.X as it includes new features that helps you improve performance and optimize on cost. For instance, <a href="https://aws.amazon.com/cn/blogs/big-data/apache-hive-is-2x-faster-with-hive-llap-on-emr-6-0-0/" target="_blank" rel="noopener noreferrer">Apache Hive is two times faster with LLAP</a> on Amazon EMR 6.X, and <a href="https://medium.com/insiderengineering/spark-3-reduced-our-emr-cost-by-40-144f17987af5" target="_blank" rel="noopener noreferrer">Spark 3 reduces costs by 40%</a>. Additionally, Amazon EMR 6.x releases include Trino, a fast distributed SQL engine and Iceberg, high-performance open data format for petabyte scale tables.To upgrade Amazon EMR clusters from 5.X to 6.X release, a Hive Metastore upgrade is the first step before applications such as Hive and Spark can be migrated. This post provides guidance on how to upgrade Amazon EMR Hive Metastore from 5.X to 6.X as well as migration of Hive Metastore to the <a href="https://aws.amazon.com/glue">AWS Glue</a> Data Catalog. As Hive 3 Metastore is compatible with Hive 2 applications, you can continue to use Amazon EMR 5.X with the upgraded Hive Metastore.<h2>Solution overview</h2>In the following section, we provide steps to upgrade the Hive Metastore schema using MySQL as the backend.. For any other backends (such as MariaDB, Oracle, or SQL Server), update the commands accordingly.There are two options to upgrade the Amazon EMR Hive Metastore:<ul><li>Upgrade the Hive Metastore schema from 2.X to 3.X by using the Hive Schema Tool</li><li>Migrate the Hive Metastore to the AWS Glue Data Catalog</li></ul>We walk through the steps for both options.<h2>Pre-upgrade prerequisites</h2>Before upgrading the Hive Metastore, you must complete the following prerequisites steps:<ol><li>Verify the Hive Metastore database is running and accessible.You should be able to run Hive DDL and DML queries successfully. Any errors or issues must be fixed before proceeding with upgrade process. Use the following sample queries to test the database:</li><li>To get the Metastore schema version in the current EMR 5.X cluster, run the following command in the primary node:The following code shows our sample output:</li><li>Stop the Metastore service and restrict access to the Metastore MySQL database.It’s very important that no one else accesses or modifies the contents of the Metastore database while you’re performing the schema upgrade.To stop the Metastore, use the following commands:For Amazon EMR release 5.30 and 6.0 onwards (Amazon Linux 2 is the operating system for the Amazon EMR 5.30+ and 6.x release series), use the following commands:You can also note the total number of databases and tables present in the Hive Metastore before the upgrade, and verify the number of databases and tables after the upgrade.</li><li>To get the total number of tables and databases before the upgrade, run the following commands after connecting to the external Metastore database (assuming the Hive Metadata DB name is hive):</li><li>Take a backup or snapshot of the Hive database.This allows you to revert any changes made during the upgrade process if something goes wrong. If you’re using <a href="http://aws.amazon.com/rds" target="_blank" rel="noopener noreferrer">Amazon Relational Database Service</a> (Amazon RDS), refer to <a href="https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_CommonTasks.BackupRestore.html" target="_blank" rel="noopener noreferrer">Backing up and restoring an Amazon RDS instance</a> for instructions.</li><li>Take note of the Hive table storage location if data is stored in HDFS.</li></ol>If all the table data is on <a href="http://aws.amazon.com/s3" target="_blank" rel="noopener noreferrer">Amazon Simple Storage Service</a> (Amazon S3), then no action is needed. If HDFS is used as the storage layer for Hive databases and tables, then take a note of them. You will need to copy the files on HDFS to a similar path on the new cluster, and then verify or update the location attribute for databases and tables on the new cluster accordingly.<h2>Upgrade the Amazon EMR Hive Metastore schema with the Hive Schema Tool</h2>In this approach, you use the persistent Hive Metastore on a remote database (<a href="https://aws.amazon.com/rds/mysql/" target="_blank" rel="noopener noreferrer">Amazon RDS for MySQL</a> or <a href="https://aws.amazon.com/rds/aurora/mysql-features/" target="_blank" rel="noopener noreferrer">Amazon Aurora MySQL-Compatible Edition</a>). The following diagram shows the upgrade procedure.<img class="alignnone size-full wp-image-34004 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Hive-Metastore2Metastore-Upgrade.jpg" alt="EMR Hive Metastore Upgrade" width="711" height="391" />To upgrade the Amazon EMR Hive Metastore from 5.X (Hive version 2.X) to 6.X (Hive version 3.X), we can use the <a href="https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool" target="_blank" rel="noopener noreferrer">Hive Schema Tool</a>. The Hive Schema Tool is an offline tool for Metastore schema manipulation. You can use it to initialize, upgrade, and validate the Metastore schema. Run the following command to show the available options for the Hive Schema Tool:Be sure to complete the prerequisites mentioned earlier, including taking a backup or snapshot, before proceeding with the next steps.<ol><li>Note down the details of the existing Hive external Metastore to be upgraded.This includes the RDS for MySQL endpoint host name, database name (for this post, hive), user name, and password. You can do this through one of the following options:<ul><li>Get the Hive Metastore DB information from the Hive configuration file – Log in to the EMR 5.X primary node, open the file <code>/etc/hive/conf/hive-site.xml</code>, and note the four properties:</li></ul><ul><li>Get the Hive Metastore DB information from the Amazon EMR console – Navigate to the EMR 5.X cluster, choose the Configurations tab, and note down the Metastore DB information.<img class="alignnone size-full wp-image-34008 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-EMR-Console-Configuration.jpg" alt="EMR Cosole for Configuration" width="3030" height="930" /></li></ul></li></ol><ol start="2"><li>Create a new EMR 6.X cluster.To use the Hive Schema Tool, we need to create an EMR 6.X cluster. You can create a new EMR 6.X cluster via the Hive console or the <a href="http://aws.amazon.com/cli" target="_blank" rel="noopener noreferrer">AWS Command Line Interface</a> (AWS CLI), without specifying external hive Metastore details. This lets the EMR 6.X cluster launch successfully using the default Hive Metastore. For more information about EMR cluster management, refer to <a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan.html" target="_blank" rel="noopener noreferrer">Plan and configure clusters</a>.</li><li>After your new EMR 6.X cluster is launched successfully and is in the waiting state, SSH to the EMR 6.X primary node and take a backup of <code>/etc/hive/conf/hive-site.xml</code>:</li><li>Stop Hive services:Now you update the Hive configuration and point it to the old hive Metastore database.</li><li>Modify <code>/etc/hive/conf/hive-site.xml</code> and update the properties with the values you collected earlier:</li><li>On the same or new SSH session, run the Hive Schema Tool to check that the Metastore is pointing to the old Metastore database:The output should look as follows (old-hostname, old-dbname, and old-username are the values you changed):You can upgrade the Hive Metastore by passing the <code>-upgradeSchema</code> option to the Hive Schema Tool. The tool figures out the SQL scripts required to initialize or upgrade the schema and then runs those scripts against the backend database.</li><li>Run the <code>upgradeSchema</code> command with <code>-dryRun</code>, which only lists the SQL scripts needed during the actual run:The output should look like the following code. It shows the Metastore upgrade path from the old version to the new version. You can find the upgrade order on the <a href="https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/sql/mysql/upgrade.order.mysql" target="_blank" rel="noopener noreferrer">GitHub repo</a>. In case of failure during the upgrade process, these scripts can be run manually in the same order.</li><li>To upgrade the Hive Metastore schema, run the Hive Schema Tool with <code>-upgradeSchema</code>:The output should look like the following code:In case of any issues or failures, you can run the preceding command with verbose. This prints all the queries getting run in order and their output.If you encounter any failures during this process and you want to upgrade your Hive Metastore by running the SQL yourself, refer to <a href="https://aws.github.io/aws-emr-best-practices/applications/hive/best_practices/#bp-521-upgrading-hive-metastore" target="_blank" rel="noopener noreferrer">Upgrading Hive Metastore</a>.If HDFS was used as storage for the Hive warehouse or any Hive DB location, you need to update the <code>NameNode</code> alias or URI with the new cluster’s HDFS alias.</li><li>Use the following commands to update the HDFS <code>NameNode</code> alias (replace <code><new-loc> <old-loc></code> with the HDFS root location of the new and old clusters, respectively):You can run the following command on any EMR cluster node to get the HDFS <code>NameNode</code> alias:At first you can run with the <code>dryRun</code> option, which displays all the changes but aren’t persisted. For example:However, if the new location needs to be changed to a different HDFS or S3 path, then use the following approach.First connect to the remote Hive Metastore database and run the following query to pull all the tables for a specific database and list the locations. Replace <code>HiveMetastore_DB</code> with the database name used for the Hive Metastore in the external database (for this post, hive) and the Hive database name (default):Identify the table for which location needs to be updated. Then run the Alter table command to update the table locations. You can prepare a script or chain of Alter table commands to update the locations for multiple tables.</li><li>Start and check the status of Hive Metastore and HiveServer2:</li></ol><h3>Post-upgrade validation</h3>Perform the following post-upgrade steps:<ol><li>Confirm the Hive Metastore schema is upgraded to the new version:The output should look like the following code:</li><li>Run the following Hive Schema Tool command to query the Hive schema version and verify that it’s upgraded:</li><li>Run some DML queries against old tables and ensure they are running successfully.</li><li>Verify the table and database counts using the same commands mentioned in the prerequisites section, and compare the counts.</li></ol>The Hive Metastore schema migration process is complete, and you can start working on your new EMR cluster. If for some reason you want to relaunch the EMR cluster, then you just need to provide the Hive Metastore remote database that we upgraded in the previous steps using the options on the Amazon EMR Configurations tab.<h2>Migrate the Amazon EMR Hive Metastore to the AWS Glue Data Catalog</h2>The AWS Glue Data Catalog is flexible and reliable, and can reduce your operation cost. Moreover, the Data Catalog supports different versions of EMR clusters. Therefore, when you migrate your Amazon EMR 5.X Hive Metastore to the Data Catalog, you can use the same Data Catalog with any new EMR 5.8+ cluster, including Amazon EMR 6.x. There are some factors you should consider when using this approach; refer to <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html#emr-hive-glue-considerations-hive" target="_blank" rel="noopener noreferrer">Considerations when using AWS Glue Data Catalog</a> for more information. The following diagram shows the upgrade procedure.<img class="alignnone size-full wp-image-34005 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Hive-Metastore2GlueDataCatalog.jpg" alt="EMR Hive Metastore Migrate to Glue Data Catalog" width="711" height="391" />To migrate your Hive Metastore to the Data Catalog, you can use the <a href="https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Hive_metastore_migration" target="_blank" rel="noopener noreferrer">Hive Metastore migration script</a> from GitHub. The following are the major steps for a direct migration.Make sure all the table data is stored in Amazon S3 and not HDFS. Otherwise, tables migrated to the Data Catalog will have the table location pointing to HDFS, and you can’t query the table. You can check your table data location by connecting to the MySQL database and running the following SQL:Make sure to complete the prerequisite steps mentioned earlier before proceeding with the migration. Ensure the EMR 5.X cluster is in a waiting state and all the components’ status are in service.<ol><li>Note down the details of the existing EMR 5.X cluster Hive Metastore database to be upgraded.As mentioned before, this includes the endpoint host name, database name, user name, and password. You can do this through one of the following options:<ul><li>Get the Hive Metastore DB information from the Hive configuration file – Log in to the Amazon EMR 5.X primary node, open the file <code>/etc/hive/conf/hive-site.xml</code>, and note the four properties:</li></ul><ul><li>Get the Hive Metastore DB information from the Amazon EMR console – Navigate to the Amazon EMR 5.X cluster, choose the Configurations tab, and note down the Metastore DB information.</li></ul><img class="alignnone size-full wp-image-34008 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-EMR-Console-Configuration.jpg" alt="EMR Cosole for Configuration" width="3030" height="930" /></li><li>On the AWS Glue console, <a href="https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html" target="_blank" rel="noopener noreferrer">create a connection</a> to the Hive Metastore as a JDBC data source.Use the connection JDBC URL, user name, and password you gathered in the previous step. Specify the VPC, subnet, and security group associated with your Hive Metastore. You can find these on the Amazon EMR console if the Hive Metastore is on the EMR primary node, or on the Amazon RDS console if the Metastore is an RDS instance.</li><li>Download two extract, transform, and load (ETL) job scripts from GitHub and upload them to an S3 bucket:If you configured AWS Glue to access Amazon S3 from a VPC endpoint, you must upload the script to a bucket in the same AWS Region where your job runs.Now you must create a job on the AWS Glue console to extract metadata from your Hive Metastore to migrate it to the Data Catalog.</li><li>On the AWS Glue console, choose Jobs in the navigation pane.</li><li>Choose Create job.</li><li>Select Spark script editor.</li><li>For Options¸ select Upload and edit an existing script.</li><li>Choose Choose file and upload the <code>import_into_datacatalog.py</code> script you downloaded earlier.</li><li>Choose Create.<img class="alignnone size-full wp-image-34014 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture5.jpg" alt="Glue Job script editor" width="3054" height="1212" /></li><li>On the Job details tab, enter a job name (for example, <code>Import-Hive-Metastore-To-Glue</code>).</li><li>For IAM Role, choose a role.</li><li>For Type, choose Spark.<img class="alignnone size-full wp-image-34013 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture6.jpg" alt="Glue ETL Job details" width="3058" height="1200" /></li><li>For Glue version¸ choose Glue 3.0.</li><li>For Language, choose Python 3.</li><li>For Worker type, choose G1.X.</li><li>For Requested number of workers, enter 2.<img class="alignnone size-full wp-image-34012 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture7.jpg" alt="Glue ETL Job details" width="3046" height="1148" /></li><li>In the Advanced properties section, for Script filename, enter <code>import_into_datacatalog.py</code>.</li><li>For Script path, enter the S3 path you used earlier (just the parent folder).<img class="alignnone size-full wp-image-34011 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture8.jpg" alt="Glue ETL Job details" width="3056" height="1104" /></li><li>Under Connections, choose the connection you created earlier.<img class="alignnone size-full wp-image-34010 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture9.jpg" alt="Glue ETL Job details" width="3050" height="1166" /></li><li>For Python library path, enter the S3 path you used earlier for the file <code>hive_metastore_migration.py</code>.</li><li>Under Job parameters, enter the following key-pair values:<ul><li><code>--mode: from-jdbc</code></li><li><code>--connection-name: EMR-Hive-Metastore</code></li><li><code>--region: us-west-2</code></li></ul><img class="alignnone size-full wp-image-34009 c4" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/BDB-2332-Picture10.jpg" alt="Glue ETL Job details" width="3048" height="1202" /></li><li>Choose Save to save the job.</li><li>Run the job on demand on the AWS Glue console.</li></ol>If the job runs successfully, Run status should show as Succeeded. When the job is finished, the metadata from the Hive Metastore is visible on the AWS Glue console. Check the databases and tables listed to verify that they were migrated correctly.<h2>Known issues</h2>In some cases where the Hive Metastore schema version is on a very old release or if some required metadata tables are missing, the upgrade process may fail. In this case, you can use the following steps to identify and fix the issue. Run the schemaTool <code>upgradeSchema</code> command with verbose as follows:This prints all the queries being run in order and their output:Note down the query and the error message, then take the required steps to address the issue. For example, depending on the error message, you may have to create the missing table or alter an existing table. Then you can either rerun the schemaTool <code>upgradeSchema</code> command, or you can manually run the remaining queries required for upgrade. You can get the complete script that schemaTool runs from the following path on the primary node <code>/usr/lib/hive/scripts/metastore/upgrade/mysql/</code> or from <a href="https://github.com/apache/hive/tree/master/standalone-metastore/metastore-server/src/main/sql/mysql" target="_blank" rel="noopener noreferrer">GitHub</a>.<h2>Clean up</h2>Running additional EMR clusters to perform the upgrade activity in your AWS account may incur additional charges. When you complete the Hive Metastore upgrade successfully, we recommend deleting the additional EMR clusters to save cost.<h2>Conclusion</h2>To upgrade Amazon EMR from 5.X to 6.X and take advantage of some features from Hive 3.X or Spark SQL 3.X, you have to upgrade the Hive Metastore first. If you’re using the AWS Glue Data Catalog as your Hive Metastore, you don’t need to do anything because the Data Catalog supports both Amazon EMR versions. If you’re using a MySQL database as the external Hive Metastore, you can upgrade by following the steps outlined in this post, or you can migrate your Hive Metastore to the Data Catalog.There are some functional differences between the different versions of Hive, Spark, and Flink. If you have some applications running on Amazon EMR 5.X, make sure test your applications in Amazon EMR 6.X and validate the function compatibility. We will cover application upgrades for Amazon EMR components in a future post.<h3>About the authors</h3><img class="size-full wp-image-34023 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/JianweiLi.jpg" alt="" width="100" height="113" />Jianwei Li is Senior Analytics Specialist TAM. He provides consultant service for AWS enterprise support customers to design and build modern data platform. He has more than 10 years experience in big data and analytics domain. In his spare time, he like running and hiking.<img class="size-full wp-image-34024 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/Narayanan.jpg" alt="" width="100" height="100" />Narayanan Venkateswaran is an Engineer in the AWS EMR group. He works on developing Hive in EMR. He has over 17 years of work experience in the industry across several companies including Sun Microsystems, Microsoft, Amazon and Oracle. Narayanan also holds a PhD in databases with focus on horizontal scalability in relational stores.<img class="size-full wp-image-34025 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/Partha.jpg" alt="" width="100" height="153" />Partha Sarathi is an Analytics Specialist TAM – at AWS based in Sydney, Australia. He brings 15+ years of technology expertise and helps Enterprise customers optimize Analytics workloads. He has extensively worked on both on-premise and cloud Bigdata workloads along with various ETL platform in his previous roles. He also actively works on conducting proactive operational reviews around the Analytics services like Amazon EMR, Redshift, and OpenSearch.<img class="size-full wp-image-34036 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/05/Krish.jpg" alt="" width="100" height="119" />Krish is an Enterprise Support Manager responsible for leading a team of specialists in EMEA focused on BigData & Analytics, Databases, Networking and Security. He is also an expert in helping enterprise customers modernize their data platforms and inspire them to implement operational best practices. In his spare time, he enjoys spending time with his family, travelling, and video games.</section>

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签