未知数据源 2024年10月02日
Run a data processing job on Amazon EMR Serverless with AWS Step Functions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了使用多种基础设施即代码(IaC)框架和AWS服务构建数据处理流程的方法。包括使用Terraform定义基础设施,利用Amazon EMR Serverless、AWS Step Functions等处理数据,还提到了相关的设计决策和前提条件。

🌐Terraform是一种IaC工具,类似于AWS CloudFormation,具有友好语法及多种功能,如规划、绘图及创建模板等,可用于创建、更新和版本控制AWS基础设施。

🎯文章展示了如何使用Amazon EMR Serverless、AWS Step Functions和Terraform构建和协调Scala Spark应用程序,处理样本点击流数据并存储结果。

📄解决方案中使用了多种AWS服务,如AWS Lambda函数、Amazon Kinesis Data Firehose交付流、AWS Glue数据目录等,还详细说明了各个服务的作用和流程。

💡文中提到了该解决方案的设计决策,如在实际使用中根据情况选择合适的工具,以及使用不同语言开发代码等,并列出了使用该解决方案的前提条件。

<section class="blog-post-content"><p>There are several infrastructure as code (IaC) frameworks available today, to help you define your infrastructure, such as the <a href="https://aws.amazon.com/cdk/&quot; target="_blank" rel="noopener noreferrer">AWS Cloud Development Kit</a> (AWS CDK) or <a href="https://www.terraform.io/&quot; target="_blank" rel="noopener noreferrer">Terraform by HashiCorp</a>. Terraform, an AWS Partner Network (APN) Advanced Technology Partner and member of the AWS DevOps Competency, is an IaC tool similar to <a href="http://aws.amazon.com/cloudformation&quot; target="_blank" rel="noopener noreferrer">AWS CloudFormation</a> that allows you to create, update, and version your AWS infrastructure. Terraform provides friendly syntax (similar to AWS CloudFormation) along with other features like planning (visibility to see the changes before they actually happen), graphing, and the ability to create templates to break infrastructure configurations into smaller chunks, which allows better maintenance and reusability. We use the capabilities and features of Terraform to build an API-based ingestion process into AWS. Let’s get started!</p><p>In this post, we showcase how to build and orchestrate a <a href="https://www.scala-lang.org/&quot; target="_blank" rel="noopener noreferrer">Scala</a> Spark application using <a href="https://aws.amazon.com/emr/serverless/&quot; target="_blank" rel="noopener noreferrer">Amazon EMR Serverless</a>, <a href="http://aws.amazon.com/step-functions&quot; target="_blank" rel="noopener noreferrer">AWS Step Functions</a>, and Terraform. In this end-to-end solution, we run a Spark job on EMR Serverless that processes sample clickstream data in an <a href="http://aws.amazon.com/s3&quot; target="_blank" rel="noopener noreferrer">Amazon Simple Storage Service</a> (Amazon S3) bucket and stores the aggregation results in Amazon S3.</p><p>With EMR Serverless, you don’t have to configure, optimize, secure, or operate clusters to run applications. You will continue to get the benefits of <a href="https://aws.amazon.com/emr/&quot; target="_blank" rel="noopener noreferrer">Amazon EMR</a>, such as open source compatibility, concurrency, and optimized runtime performance for popular data frameworks. EMR Serverless is suitable for customers who want ease in operating applications using open-source frameworks. It offers quick job startup, automatic capacity management, and straightforward cost controls.</p><h2 id="solution-overview">Solution overview</h2><p>We provide the Terraform infrastructure definition and the source code for an <a href="http://aws.amazon.com/lambda&quot; target="_blank" rel="noopener noreferrer">AWS Lambda</a> function using sample customer user clicks for online website inputs, which are ingested into an <a href="https://aws.amazon.com/kinesis/data-firehose/&quot; target="_blank" rel="noopener noreferrer">Amazon Kinesis Data Firehose</a> delivery stream. The solution uses Kinesis Data Firehose to convert the incoming data into a Parquet file (an open-source file format for Hadoop) before pushing it to Amazon S3 using the <a href="https://aws.amazon.com/glue/&quot; target="_blank" rel="noopener noreferrer">AWS Glue</a> Data Catalog. The generated output S3 Parquet file logs are then processed by an EMR Serverless process, which outputs a report detailing aggregate clickstream statistics in an S3 bucket. The EMR Serverless operation is triggered using Step Functions. The sample architecture and code are spun up as shown in the following diagram.</p><p><a href="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/emr-serverless-click-logs-from-web-application.drawio.png&quot;&gt;&lt;img class="aligncenter size-large wp-image-10067" src="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/emr-serverless-click-logs-from-web-application.drawio-1024x768.png&quot; alt="emr serverless application" width="1024" height="768" /></a></p><p>The provided samples have the source code for building the infrastructure using Terraform for running the Amazon EMR application. Setup scripts are provided to create the sample ingestion using Lambda for the incoming application logs. For a similar ingestion pattern sample, refer to <a href="https://aws.amazon.com/blogs/developer/provision-aws-infrastructure-using-terraform-by-hashicorp-an-example-of-web-application-logging-customer-data/&quot; target="_blank" rel="noopener noreferrer">Provision AWS infrastructure using Terraform (By HashiCorp): an example of web application logging customer data</a>.</p><p>The following are the high-level steps and AWS services used in this solution:</p><ul><li>The provided application code is packaged and built using Apache Maven.</li><li>Terraform commands are used to deploy the infrastructure in AWS.</li><li>The EMR Serverless application provides the option to submit a Spark job.</li><li>The solution uses two Lambda functions:<ul><li><strong>Ingestion</strong> – This function processes the incoming request and pushes the data into the Kinesis Data Firehose delivery stream.</li><li><strong>EMR Start Job</strong> – This function starts the EMR Serverless application. The EMR job process converts the ingested user click logs into output in another S3 bucket.</li></ul></li><li>Step Functions triggers the EMR Start Job Lambda function, which submits the application to EMR Serverless for processing of the ingested log files.</li><li>The solution uses four S3 buckets:<ul><li><strong>Kinesis Data Firehose delivery bucket</strong> – Stores the ingested application logs in Parquet file format.</li><li><strong>Loggregator source bucket</strong> – Stores the Scala code and JAR for running the EMR job.</li><li><strong>Loggregator output bucket</strong> – Stores the EMR processed output.</li><li><strong>EMR Serverless logs bucket</strong> – Stores the EMR process application logs.</li></ul></li><li>Sample invoke commands (run as part of the initial setup process) insert the data using the ingestion Lambda function. The Kinesis Data Firehose delivery stream converts the incoming stream into a Parquet file and stores it in an S3 bucket.</li></ul><p>For this solution, we made the following design decisions:</p><ul><li>We use Step Functions and Lambda in this use case to trigger the EMR Serverless application. In a real-world use case, the data processing application could be long running and may exceed Lambda’s timeout limits. In this case, you can use tools like <a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/&quot; target="_blank" rel="noopener noreferrer">Amazon Managed Workflows for Apache Airflow</a> (Amazon MWAA). Amazon MWAA is a managed orchestration service makes it easier to set up and operate end-to-end data pipelines in the cloud at scale.</li><li>The Lambda code and EMR Serverless log aggregation code are developed using Java and Scala, respectively. You can use any supported languages in these use cases.</li><li>The <a href="http://aws.amazon.com/cli&quot; target="_blank" rel="noopener noreferrer">AWS Command Line Interface</a> (AWS CLI) V2 is required for querying EMR Serverless applications from the command line. You can also view these from the <a href="http://aws.amazon.com/console&quot; target="_blank" rel="noopener noreferrer">AWS Management Console</a>. We provide a sample AWS CLI command to test the solution later in this post.</li></ul><h2 id="prerequisites">Prerequisites</h2><p>To use this solution, you must complete the following prerequisites:</p><ul><li><a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html&quot; target="_blank" rel="noopener noreferrer">Install the AWS CLI</a>. For this post, we used version 2.7.18. This is required in order to query the <code>aws emr-serverless</code> AWS CLI commands from your local machine. Optionally, all the AWS services used in this post can be viewed and operated via the console.</li><li>Make sure to have <a href="https://www.java.com/en/download/&quot; target="_blank" rel="noopener noreferrer">Java</a> installed, and JDK/JRE 8 is set in the environment path of your machine. For instructions, see the <a href="https://www.java.com/en/download/&quot; target="_blank" rel="noopener noreferrer">Java Development Kit</a>.</li><li>Install <a href="https://maven.apache.org/download.cgi&quot; target="_blank" rel="noopener noreferrer">Apache Maven</a>. The Java Lambda functions are built using mvn packages and are deployed using Terraform into AWS.</li><li>Install the <a href="https://www.scala-sbt.org/download.html&quot; target="_blank" rel="noopener noreferrer">Scala Build Tool</a>. For this post, we used version 1.4.7. Make sure to download and install based on your operating system needs.</li><li>Set up <a href="https://www.terraform.io/downloads&quot; target="_blank" rel="noopener noreferrer">Terraform</a>. For steps, see <a href="https://www.terraform.io/downloads&quot; target="_blank" rel="noopener noreferrer">Terraform downloads</a>. We use version 1.2.5 for this post.</li><li>Have an <a href="https://aws.amazon.com/free/&quot; target="_blank" rel="noopener noreferrer">AWS account</a>.</li></ul><h2 id="configure-the-solution">Configure the solution</h2><p>To spin up the infrastructure and the application, complete the following steps:</p><ol><li>Clone the following <a href="https://github.com/aws-samples/aws-emr-serverless-using-terraform&quot; target="_blank" rel="noopener noreferrer">GitHub repository</a>.The provided <code>exec.sh</code> shell script builds the Java application JAR (for the Lambda ingestion function) and the Scala application JAR (for the EMR processing) and deploys the AWS infrastructure that is needed for this use case.</li><li>Run the following commands:<p>To run the commands individually, set the application deployment Region and account number, as shown in the following example:</p><p>The following is the Maven build Lambda application JAR and Scala application package:</p></li><li>Deploy the AWS infrastructure using Terraform:</li></ol><h2>Test the solution</h2><p>After you build and deploy the application, you can insert sample data for Amazon EMR processing. We use the following code as an example. The <code>exec.sh</code> script has multiple sample insertions for Lambda. The ingested logs are used by the EMR Serverless application job.</p><p>The sample AWS CLI invoke command inserts sample data for the application logs:</p><p>To validate the deployments, complete the following steps:</p><ol><li>On the Amazon S3 console, navigate to the bucket created as part of the infrastructure setup.</li><li>Choose the bucket to view the files.You should see that data from the ingested stream was converted into a Parquet file.</li><li>Choose the file to view the data.The following screenshot shows an example of our bucket contents.<a href="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/s3_source_parquet_files.png&quot;&gt;&lt;img class="aligncenter size-full wp-image-10069" src="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/s3_source_parquet_files.png&quot; alt="" width="977" height="422" /></a>Now you can run Step Functions to validate the EMR Serverless application.</li><li>On the Step Functions console, open <code>clicklogger-dev-state-machine</code>.The state machine shows the steps to run that trigger the Lambda function and EMR Serverless application, as shown in the following diagram.<a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/step_function_success.png&quot;&gt;&lt;img class="alignnone wp-image-34220 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/step_function_success.png&quot; alt="" width="500" height="390" /></a></li><li>Run the state machine.</li><li>After the state machine runs successfully, navigate to the <code>clicklogger-dev-output-</code>bucket on the Amazon S3 console to see the output files.<a href="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/s3_output_response_file.png&quot;&gt;&lt;img class="aligncenter size-full wp-image-10068" src="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/s3_output_response_file.png&quot; alt="" width="977" height="331" /></a></li><li>Use the AWS CLI to check the deployed EMR Serverless application:</li><li>On the Amazon EMR console, choose <strong>Serverless</strong> in the navigation pane.</li><li>Select <code>clicklogger-dev-studio</code> and choose <strong>Manage applications</strong>.</li><li>The Application created by the stack will be as shown below <code>clicklogger-dev-loggregator-emr-&lt;Your-Account-Number&gt;</code><a href="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/EMRStudioApplications.png&quot;&gt;&lt;img class="aligncenter size-large wp-image-10072" src="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/EMRStudioApplications-1024x481.png&quot; alt="" width="1024" height="481" /></a><a href="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/EMRServerlessApplication.png&quot;&gt;&lt;img class="aligncenter size-large wp-image-10071" src="https://d2908q01vomqb2.cloudfront.net/0716d9708d321ffb6a00818614779e779925365c/2022/09/08/EMRServerlessApplication-1024x343.png&quot; alt="" width="1024" height="343" /></a>Now you can review the EMR Serverless application output.</li><li>On the Amazon S3 console, open the output bucket (<code>us-east-1-clicklogger-dev-loggregator-output-</code>).The EMR Serverless application writes the output based on the date partition, such as <code>2022/07/28/response.md</code>.The following code shows an example of the file output:</li></ol><h2>Clean up</h2><p>The provided <code>./cleanup.sh</code> script has the required steps to delete all the files from the S3 buckets that were created as part of this post. The <code>terraform destroy</code> command cleans up the AWS infrastructure that you created earlier. See the following code:</p><p>To do the steps manually, you can also delete the resources via the AWS CLI:</p><h2>Conclusion</h2><p>In this post, we built, deployed, and ran a data processing Spark job in EMR Serverless that interacts with various AWS services. We walked through deploying a Lambda function packaged with Java using Maven, and a Scala application code for the EMR Serverless application triggered with Step Functions with infrastructure as code. You can use any combination of applicable programming languages to build your Lambda functions and EMR job application. EMR Serverless can be triggered manually, automated, or orchestrated using AWS services like Step Functions and Amazon MWAA.</p><p>We encourage you to test this example and see for yourself how this overall application design works within AWS. Then, it’s just the matter of replacing your individual code base, packaging it, and letting EMR Serverless handle the process efficiently.</p><p>If you implement this example and run into any issues, or have any questions or feedback about this post, please leave a comment!</p><h2>References</h2><h3><strong>About the Authors</strong></h3><p class="c4"><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/ramani.png&quot;&gt;&lt;img class="size-full wp-image-34218 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/ramani.png&quot; alt="" width="100" height="100" /></a><strong>Sivasubramanian Ramani (Siva Ramani)</strong> is a Sr Cloud Application Architect at Amazon Web Services. His expertise is in application optimization &amp; modernization, serverless solutions and using Microsoft application workloads with AWS.</p><p class="c4"><strong><a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/nbbalara.jpg&quot;&gt;&lt;img class="size-full wp-image-34217 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/09/09/nbbalara.jpg&quot; alt="" width="100" height="125" /></a>Naveen Balaraman</strong> is a Sr Cloud Application Architect at Amazon Web Services. He is passionate about Containers, serverless Applications, Architecting Microservices and helping customers leverage the power of AWS cloud.</p></section>

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Terraform Amazon EMR Serverless AWS服务 数据处理
相关文章