Get a quick start with Apache Hudi, Apache Iceberg, and Delta Lake with Amazon EMR on EKS

<section class="blog-post-content"><p>A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can keep your data as is in your object store or file-based storage without having to first structure the data. Additionally, you can run different types of analytics against your loosely formatted data lake—from dashboards and visualizations to big data processing, real-time analytics, and machine learning (ML) to guide better decisions. Due to the flexibility and cost effectiveness that a data lake offers, it’s very popular with customers who are looking to implement data analytics and AI/ML use cases.</p><p>Due to the immutable nature of the underlying storage in the cloud, one of the challenges in data processing is updating or deleting a subset of identified records from a data lake. Another challenge is making concurrent changes to the data lake. Implementing these tasks is time consuming and costly.</p><p>In this post, we explore three open-source transactional file formats: Apache Hudi, Apache Iceberg, and Delta Lake to help us to overcome these data lake challenges. We focus on how to get started with these data storage frameworks via real-world use case. As an example, we demonstrate how to handle incremental data change in a data lake by implementing a <a href="https://en.wikipedia.org/wiki/Slowly_changing_dimension" target="_blank" rel="noopener noreferrer">Slowly Changing Dimension</a> Type 2 solution (SCD2) with Hudi, Iceberg, and Delta Lake, then deploy the applications with <a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks.html" target="_blank" rel="noopener noreferrer">Amazon EMR on EKS</a>.</p><h2>ACID challenge in data lakes</h2><p>In analytics, the data lake plays an important role as an immutable and agile data storage layer. Unlike traditional data warehouses or data mart implementations, we make no assumptions on the data schema in a data lake and can define whatever schemas required by our use cases. It’s up to the downstream consumption layer to make sense of that data for their own purposes.</p><p>One of the most common challenges is supporting ACID (Atomicity, Consistency, Isolation, Durability) transactions in a data lake. For example, how do we run queries that return consistent and up-to-date results while new data is continuously being ingested or existing data is being modified?</p><p>Let’s try to understand the data problem with a real-world scenario. Assume we centralize customer contact datasets from multiple sources to an <a href="http://aws.amazon.com/s3" target="_blank" rel="noopener noreferrer">Amazon Simple Storage Service</a> (Amazon S3)-backed data lake, and we want to keep all the historical records for analysis and reporting. We face the following challenges:</p><ul><li>We keep creating append-only files in Amazon S3 to track the contact data changes (insert, update, delete) in near-real time.</li><li><strong>Consistency</strong> and <strong>atomicity</strong> aren’t guaranteed because we just dump data files from multiple sources without knowing whether the entire operation is successful or not.</li><li>We don’t have an <strong>isolation</strong> guarantee whenever multiple workloads are simultaneously reading and writing to the same target contact table.</li><li>We track every single activity at source, including duplicates caused by the retry mechanism and accidental data changes that are then reverted. This leads to the creation of a large volume of append-only files. The performance of extract, transform, and load (ETL) jobs decreases as all the data files are read each time.</li><li>We have to shorten the file retention period to reduce the data scan and read performance.</li></ul><p>In this post, we walk through a simple SCD2 ETL example designed for solving the ACID transaction problem with the help of Hudi, Iceberg, and Delta Lake. We also show how to deploy the ACID solution with EMR on EKS and query the results by <a href="http://aws.amazon.com/athena" target="_blank" rel="noopener noreferrer">Amazon Athena</a>.</p><h2>Custom library dependencies with EMR on EKS</h2><p>By default, Hudi and Iceberg are supported by <a href="http://aws.amazon.com/emr" target="_blank" rel="noopener noreferrer">Amazon EMR</a> as out-of-the-box features. For this demonstration, we use EMR on EKS release 6.8.0, which contains <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg.html" target="_blank" rel="noopener noreferrer">Apache Iceberg</a> 0.14.0-amzn-0 and <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi.html" target="_blank" rel="noopener noreferrer">Apache Hudi</a> 0.11.1-amzn-0. To find out the latest and past versions that Amazon EMR supports, check out the <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/Hudi-release-history.html" target="_blank" rel="noopener noreferrer">Hudi release history</a> and the <a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/Iceberg-release-history.html" target="_blank" rel="noopener noreferrer">Iceberg release history</a> tables. The runtime binary files of these frameworks can be found in the Spark’s class path location within each EMR on EKS image. See <a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks-releases.html" target="_blank" rel="noopener noreferrer">Amazon EMR on EKS release versions</a> for the list of supported versions and applications.</p><p>As of this writing, Amazon EMR does not include Delta Lake by default. There are two ways to make it available in EMR on EKS:</p><ul><li><strong>At the application level</strong> – You install Delta libraries by setting a Spark configuration <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/df6dd8e58e34bea855bbbd6efb87e30a8d98234d/delta/delta_submit.sh#L32" target="_blank" rel="noopener noreferrer">spark.jars</a> or <code>--jars</code> command-line argument in your submission script. The JAR files will be downloaded and distributed to each Spark Executor and Driver pod when starting a job.</li><li><strong>At the Docker container level</strong> – You can <a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-steps.html" target="_blank" rel="noopener noreferrer">customize an EMR on EKS image</a> by <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/7886d041f14be093a88439033e10b10f70526098/delta/Dockerfile#L6" target="_blank" rel="noopener noreferrer">packaging Delta dependencies</a> into a single Docker container that promotes portability and simplifies dependency management for each workload</li></ul><p>Other custom library dependencies can be managed the same way as for Delta Lake—passing a comma-separated list of JAR files in the Spark configuration at job submission, or packaging all the required libraries into a Docker image.</p><h2>Solution overview</h2><p>The solution provides two sample CSV files as the data source: <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/data/initial_contacts.csv" target="_blank" rel="noopener noreferrer">initial_contact.csv</a> and <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/data/update_contacts.csv" target="_blank" rel="noopener noreferrer">update_contacts.csv</a>. They were generated by a Python script with the Faker package. For more details, check out the <a href="https://github.com/awslabs/sql-based-etl-with-apache-spark-on-amazon-eks#test-job-in-jupyter-notebook" target="_blank" rel="noopener noreferrer">tutorial on GitHub</a>.</p><p>The following diagram describes a high-level architecture of the solution and different services being used.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image001.png" target="_blank" rel="noopener noreferrer"><img class="alignnone size-full wp-image-34571" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image001.png" alt="" width="1428" height="539" /></a></p><p>The workflow steps are as follows:</p><ol><li>Ingest the first CSV file from a source S3 bucket. The data is being processed by running a Spark ETL job with EMR on EKS. The application contains either the Hudi, Iceberg, or Delta framework.</li><li>Store the initial table in Hudi, Iceberg, or Delta file format in a target S3 bucket (curated). We use the <a href="https://aws.amazon.com/glue" target="_blank" rel="noopener noreferrer">AWS Glue</a> Data Catalog as the hive metastore. Optionally, you can configure <a href="https://aws.amazon.com/dynamodb/" target="_blank" rel="noopener noreferrer">Amazon DynamoDB</a> as a lock manager for the concurrency controls.</li><li>Ingest a second CSV file that contains new records and some changes to the existing ones.</li><li>Perform SCD2 via Hudi, Iceberg, or Delta in the Spark ETL job.</li><li>Query the Hudi, Iceberg, or Delta table stored on the target S3 bucket in Athena</li></ol><p>To simplify the demo, we have accommodated steps 1–4 into a single Spark application.</p><h2>Prerequisites</h2><p>Install the following tools:</p><p>For a quick start, you can use <a href="https://aws.amazon.com/cloudshell/" target="_blank" rel="noopener noreferrer">AWS CloudShell</a> which includes the AWS CLI and <code>kubectl</code> already.</p><h2>Clone the project</h2><p>Download the sample project either to your computer or the CloudShell console:</p><h2>Set up the environment</h2><p>Run the following <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/blog_provision.sh" target="_blank" rel="noopener noreferrer">blog_provision.sh</a> script to set up a test environment. The infrastructure deployment includes the following resources:</p><ul><li>A new S3 bucket to store sample data and job code.</li><li>An Amazon Elastic Kubernetes Service (Amazon EKS) cluster (version 1.21) in a new VPC across two Availability Zones.</li><li>An EMR virtual cluster in the same VPC, registered to the emr namespace in Amazon EKS.</li><li>An AWS Identity and Access Management (IAM) job execution role contains DynamoDB access, because we use DynamoDB to provide concurrency controls that ensure atomic transaction with the Hudi and Iceberg tables.</li></ul><h2>Job execution role</h2><p>The provisioning includes an <a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/iam-execution-role.html" target="_blank" rel="noopener noreferrer">IAM job execution role</a> called <code>emr-on-eks-quickstart-execution-role</code> that allows your EMR on EKS jobs access to the required AWS services. It contains AWS Glue permissions because we use the Data Catalog as our metastore.</p><p>See the following code:</p><p>Additionally, the role contains DynamoDB permissions, because we use the service as the lock manager. It provides concurrency controls that ensure atomic transaction with our Hudi and Iceberg tables. If a DynamoDB table with the given name doesn’t exist, a new table is created with the billing mode set as pay-per-request. More details can be found in the following framework examples.</p><h2>Example 1: Run Apache Hudi with EMR on EKS</h2><p>The following steps provide a quick start for you to implement SCD Type 2 data processing with the Hudi framework. To learn more, refer to <a href="https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/" target="_blank" rel="noopener noreferrer">Build Slowly Changing Dimensions Type 2 (SCD2) with Apache Spark and Apache Hudi on Amazon EMR</a>.</p><p>The following code snippet demonstrates the SCD type2 implementation logic. It creates Hudi tables in a default database in the Glue Data Catalog. The full version is in the script <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/hudi/hudi_scd_script.py" target="_blank" rel="noopener noreferrer">hudi_scd_script.py</a>.</p><p>In the job script, the <code>hudiOptions</code> were set to use the AWS Glue Data Catalog and enable the DynamoDB-based Optimistic Concurrency Control (OCC). For more information about concurrency control and alternatives for lock providers, refer to <a href="https://hudi.apache.org/docs/0.10.1/concurrency_control" target="_blank" rel="noopener noreferrer">Concurrency Control</a>.</p><ol><li>Upload the job scripts to Amazon S3:</li><li>Submit Hudi jobs with EMR on EKS to create SCD2 tables:<p>Hudi supports <a href="https://hudi.apache.org/docs/next/table_types" target="_blank" rel="noopener noreferrer">two tables types</a>: Copy on Write (CoW) and Merge on Read (MoR). The following is the code snippet to create a CoW table. For the complete job scripts for each table type, refer to <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/hudi/hudi_submit_cow.sh" target="_blank" rel="noopener noreferrer">hudi_submit_cow.sh</a> and <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/hudi/hudi_submit_mor.sh" target="_blank" rel="noopener noreferrer">hudi_submit_mor.sh</a>.</p></li><li>Check the job status on the EMR virtual cluster console.<a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image003.jpg" target="_blank" rel="noopener noreferrer"><img class="alignnone size-full wp-image-34572" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image003.jpg" alt="" width="1262" height="529" /></a></li><li>Query the output in Athena:</li></ol><h2>Example 2: Run Apache Iceberg with EMR on EKS</h2><p>Starting with Amazon EMR version 6.6.0, you can use Apache Spark 3 on EMR on EKS with the Iceberg table format. For more information on how Iceberg works in an immutable data lake, see <a href="https://aws.amazon.com/blogs/big-data/build-a-high-performance-acid-compliant-evolving-data-lake-using-apache-iceberg-on-amazon-emr/" target="_blank" rel="noopener noreferrer">Build a high-performance, ACID compliant, evolving data lake using Apache Iceberg on Amazon EMR</a>.</p><p>The sample job creates an Iceberg table <code>iceberg_contact</code> in the <code>default</code> database of AWS Glue. The full version is in the <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/iceberg/iceberg_scd_script.py" target="_blank" rel="noopener noreferrer">iceberg_scd_script.py</a> script. The following code snippet shows the SCD2 type of MERGE operation:</p><p>As demonstrated earlier when discussing the job execution role, the role <code>emr-on-eks-quickstart-execution-role</code> granted access to the required DynamoDB table <code>myIcebergLockTable</code>, because the table is used to obtain locks on Iceberg tables, in case of multiple concurrent write operations against a single table. For more information on Iceberg’s lock manager, refer to <a href="https://iceberg.apache.org/docs/latest/aws/#dynamodb-lock-manager" target="_blank" rel="noopener noreferrer">DynamoDB Lock Manager</a>.</p><ol><li>Upload the application scripts to the example S3 bucket:</li><li>Submit the job with EMR on EKS to create an SCD2 Iceberg table:<p>The full version code is in the <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/iceberg/iceberg_submit.sh" target="_blank" rel="noopener noreferrer">iceberg_submit.sh</a> script. The code snippet is as follows:</p></li><li>Check the job status on the EMR on EKS console.<a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image009.jpg" target="_blank" rel="noopener noreferrer"><img class="alignnone size-full wp-image-34575" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image009.jpg" alt="" width="1288" height="369" /></a></li><li>When the job is complete, query the table in Athena:</li></ol><h2>Example 3: Run open-source Delta Lake with EMR on EKS</h2><p>Delta Lake 2.1.x is compatible with Apache Spark 3.3.x. Check out the <a href="https://docs.delta.io/latest/releases.html" target="_blank" rel="noopener noreferrer">compatibility list</a> for other versions of Delta Lake and Spark. In this post, we use Amazon EMR release 6.8 (Spark 3.3.0) to demonstrate the SCD2 implementation in a data lake.</p><p>The following is the Delta code snippet to load initial dataset; the incremental load MERGE logic is highly similar to the Iceberg example. As a one-off task, there should be two tables set up on the same data:</p><ul><li><strong>The Delta table delta_table_contact</strong> – Defined on the <code>TABLE_LOCATION at ‘s3://{S3_BUCKET_NAME}/delta/delta_contact’</code>. The MERGE/UPSERT operation must be implemented on the Delta destination table. Athena can’t query this table directly, instead it reads from a <em>manifest file</em> stored in the same location, which is a text file containing a list of data files to read for querying a table. It is described as an Athena table below.</li><li><strong>The Athena table delta_contact</strong> – Defined on the manifest location <code>s3://{S3_BUCKET_NAME}/delta/delta_contact/_symlink_format_manifest/</code>. All read operations from Athena must use this table.</li></ul><p>The full version code is in the <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/delta/delta_scd_script.py" target="_blank" rel="noopener noreferrer">delta_scd_script.py</a> script. The code snippet is as follows:</p><p>The SQL statement GENERATE <code>symlink_format_manifest FOR TABLE ...</code> is a required step to set up the Athena for Delta Lake. Whenever the data in a Delta table is updated, you must regenerate the manifests. Therefore, we use <code>ALTER TABLE .... SET TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)</code> to automate the manifest refresh as a one-off setup.</p><ol><li>Upload the Delta sample scripts to the S3 bucket:</li><li>Submit the job with EMR on EKS:<p>The full version code is in the <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta/blob/main/delta/delta_submit.sh" target="_blank" rel="noopener noreferrer">delta_submit.sh</a> script. The open-source Delta JAR files must be included in the spark.jars. Alternatively, follow the instructions in <a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-steps.html" target="_blank" rel="noopener noreferrer">How to customize Docker images</a> and build a custom EMR on EKS image to accommodate the Delta dependencies.</p><p class="code">The code snippet is as follows:</p></li><li>Check the job status from the EMR on EKS console.<a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image013.png" target="_blank" rel="noopener noreferrer"><img class="alignnone size-full wp-image-34577" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/BDB-2360-image013.png" alt="" width="1323" height="366" /></a></li><li>When the job is complete, query the table in Athena:</li></ol><h2>Clean up</h2><p>To avoid incurring future charges, delete the resources generated if you don’t need the solution anymore. Run the following cleanup script (change the Region if necessary):</p><h2>Conclusion</h2><p>Implementing an ACID-compliant data lake with EMR on EKS enables you focus more on delivering business value, instead of worrying about managing complexities and reliabilities at the data storage layer.</p><p>This post presented three different transactional storage frameworks that can meet your ACID needs. They ensure you never read partial data (Atomicity). The read/write isolation allows you to see consistent snapshots of the data, even if an update occurs at the same time (Consistency and Isolation). All the transactions are stored directly to the underlying Amazon S3-backed data lake, which is designed for 11 9’s of durability (Durability).</p><p>For more information, check out the <a href="https://github.com/aws-samples/emr-on-eks-hudi-iceberg-delta" target="_blank" rel="noopener noreferrer">sample GitHub repository</a> used in this post and the <a href="https://catalog.us-east-1.prod.workshops.aws/workshops/1f91e1d4-5587-40ff-8d5d-54fc86e0ddc1/en-US" target="_blank" rel="noopener noreferrer">EMR on EKS Workshop</a>. They will get you started with running your familiar transactional framework with EMR on EKS. If you want dive deep into each storage format, check out the following posts:</p><h3>About the authors</h3><p class="c5"><strong><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/12/03/Amir-Shenavandeh.png" target="_blank" rel="noopener noreferrer"><img class="size-full wp-image-24490 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/12/03/Amir-Shenavandeh.png" alt="" width="100" height="150" /></a> Amir Shenavandeh</strong> is a Sr Analytics Specialist Solutions Architect and Amazon EMR subject matter expert at Amazon Web Services. He helps customers with architectural guidance and optimisation. He leverages his experience to help people bring their ideas to life, focusing on distributed processing and big data architectures.</p><p class="c5"><strong><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/06/24/Melody-Yang.jpg" target="_blank" rel="noopener noreferrer"><img class="size-full wp-image-19144 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/06/24/Melody-Yang.jpg" alt="" width="100" height="133" /></a>Melody Yang</strong> is a Senior Big Data Solutions Architect for Amazon EMR at AWS. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering, and DataOps.</p><p class="c5"><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/Amit-Maindola.png" target="_blank" rel="noopener noreferrer"><img class="size-full wp-image-34586 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/16/Amit-Maindola.png" alt="" width="100" height="133" /></a><strong>Amit Maindola</strong> is a Data Architect focused on big data and analytics at Amazon Web Services. He helps customers in their digital transformation journey and enables them to build highly scalable, robust, and secure cloud-based analytical solutions on AWS to gain timely insights and make critical business decisions.</p></section>

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签