Use LangChain with PySpark to process documents at massive scale with Amazon SageMaker Studio and Amazon EMR Serverless

Harnessing the power of big data has become increasingly critical for businesses looking to gain a competitive edge. From deriving insights to powering generative artificial intelligence (AI)-driven applications, the ability to efficiently process and analyze large datasets is a vital capability. However, managing the complex infrastructure required for big data workloads has traditionally been a significant challenge, often requiring specialized expertise. That’s where the new Amazon EMR Serverless application integration in Amazon SageMaker Studio can help.

With the introduction of EMR Serverless support for Apache Livy endpoints, SageMaker Studio users can now seamlessly integrate their Jupyter notebooks running sparkmagic kernels with the powerful data processing capabilities of EMR Serverless. This allows SageMaker Studio users to perform petabyte-scale interactive data preparation, exploration, and machine learning (ML) directly within their familiar Studio notebooks, without the need to manage the underlying compute infrastructure. By using the Livy REST APIs, SageMaker Studio users can also extend their interactive analytics workflows beyond just notebook-based scenarios, enabling a more comprehensive and streamlined data science experience within the Amazon SageMaker ecosystem.

In this post, we demonstrate how to leverage the new EMR Serverless integration with SageMaker Studio to streamline your data processing and machine learning workflows.

Benefits of integrating EMR Serverless with SageMaker Studio

The EMR Serverless application integration in SageMaker Studio offers several key benefits that can transform the way your organization approaches big data:

Simplified infrastructure management

Seamless integration with SageMaker

Cost optimization

Scalability and performance

Reduced operational overhead

Solution overview

SageMaker Studio is a fully integrated development environment (IDE) for ML that enables data scientists and developers to build, train, debug, deploy, and monitor models within a single web-based interface. SageMaker Studio runs inside an AWS managed virtual private cloud (VPC), with network access for SageMaker Studio domains, in this setup configured as VPC-only. SageMaker Studio automatically creates an elastic network interface within your VPC’s private subnet, which connects to the required AWS services through VPC endpoints. This same interface is also used for provisioning EMR clusters. The following diagram illustrates this solution.

An ML platform administrator can manage permissioning for the EMR Serverless integration in SageMaker Studio. The administrator can configure the appropriate privileges by updating the runtime role with an inline policy, allowing SageMaker Studio users to interactively create, update, list, start, stop, and delete EMR Serverless clusters. SageMaker Studio users are presented with built-in forms within the SageMaker Studio UI that don’t require additional configuration to interact with both EMR Serverless and Amazon Elastic Compute Cloud (Amazon EC2) based clusters.

Apache Spark and its Python API, PySpark, empower users to process massive datasets effortlessly by using distributed computing across multiple nodes. These powerful frameworks simplify the complexities of parallel processing, enabling you to write code in a familiar syntax while the underlying engine manages data partitioning, task distribution, and fault tolerance. With scalability as a core strength, Spark and PySpark allow you to handle datasets of virtually any size, eliminating the constraints of a single machine.

Empowering knowledge retrieval and generation with scalable Retrieval Augmented Generation (RAG) architecture is increasingly important in today’s era of ever-growing information. Effectively using data to provide contextual and informative responses has become a crucial challenge. This is where RAG systems excel, combining the strengths of information retrieval and text generation to deliver comprehensive and accurate results. In this post, we explore how to build a scalable and efficient RAG system using the new EMR Serverless integration, Spark’s distributed processing, and an Amazon OpenSearch Service vector database powered by the LangChain orchestration framework. This solution enables you to process massive volumes of textual data, generate relevant embeddings, and store them in a powerful vector database for seamless retrieval and generation.

Authentication mechanism

When integrating EMR Serverless in SageMaker Studio, you can use runtime roles. Runtime roles are AWS Identity and Access Management (IAM) roles that you can specify when submitting a job or query to an EMR Serverless application. These runtime roles provide the necessary permissions for your workloads to access AWS resources, such as Amazon Simple Storage Service (Amazon S3) buckets. When integrating EMR Serverless in SageMaker Studio, you can configure the IAM role to be used by SageMaker Studio. By using EMR runtime roles, you can make sure your workloads have the minimum set of permissions required to access the necessary resources, following the principle of least privilege. This enhances the overall security of your data processing pipelines and helps you maintain better control over the access to your AWS resources.

Cost attribution of EMR Serverless clusters

EMR Serverless clusters created within SageMaker Studio are automatically tagged with system default tags, specifically the domain-arn and user-profile-arn tags. These system-generated tags simplify cost allocation and attribution of Amazon EMR resources. See the following code:

# domain tagsagemaker:domain-arn: arn:aws:sagemaker:<region>:<account-id>:domain/<domain-id># user profile tagsagemaker:user-profile-arn: arn:aws:sagemaker:<region>:<account-id>:user-profile/<domain-id>/<user-profile-name>

To learn more about enterprise-level cost allocation for ML environments, refer to Set up enterprise-level cost allocation for ML environments and workloads using resource tagging in Amazon SageMaker.

Prerequisites

Before you get started, complete the prerequisite steps in this section.

Create a SageMaker Studio domain

This post walks you through the integration between SageMaker Studio and EMR Serverless using an interactive SageMaker Studio notebook. We assume you already have a SageMaker Studio domain provisioned with a UserProfile and an ExecutionRole. If you don’t have a SageMaker Studio domain available, refer to Quick setup to Amazon SageMaker to provision one.

Create an EMR Serverless job runtime role

EMR Serverless allows you to specify IAM role permissions that an EMR Serverless job run can assume when calling other services on your behalf. This includes access to Amazon S3 for data sources and targets, as well as other AWS resources like Amazon Redshift clusters and Amazon DynamoDB tables. To learn more about creating a role, refer to Create a job runtime role.

The sample following IAM inline policy attached to a runtime role allows EMR Serverless to assume a runtime role that provides access to an S3 bucket and AWS Glue. You can modify the role to include any additional services that EMR Serverless needs to access at runtime. Additionally, make sure you scope down the resources in the runtime policies to adhere to the principle of least privilege.

{  "Version": "2012-10-17",  "Statement": [    {      "Sid": "ReadAccessForEMRSamples",      "Effect": "Allow",      "Action": [        "s3:GetObject",        "s3:ListBucket"      ],      "Resource": [        "arn:aws:s3:::*.elasticmapreduce",        "arn:aws:s3:::*.elasticmapreduce/*"      ]    },    {      "Sid": "FullAccessToOutputBucket",      "Effect": "Allow",      "Action": [        "s3:PutObject",        "s3:GetObject",        "s3:ListBucket",        "s3:DeleteObject"      ],      "Resource": [        "arn:aws:s3:::<emrs-sample-s3-bucket-name>",        "arn:aws:s3:::<emrs-sample-s3-bucket-name>/*"      ]    },    {      "Sid": "GlueCreateAndReadDataCatalog",      "Effect": "Allow",      "Action": [        "glue:GetDatabase",        "glue:CreateDatabase",        "glue:GetDataBases",        "glue:CreateTable",        "glue:GetTable",        "glue:UpdateTable",        "glue:DeleteTable",        "glue:GetTables",        "glue:GetPartition",        "glue:GetPartitions",        "glue:CreatePartition",        "glue:BatchCreatePartition",        "glue:GetUserDefinedFunctions"      ],      "Resource": [        "*"      ]    }  ]}

Lastly, make sure your role has a trust relationship with EMR Serverless:

{    "Version": "2012-10-17",    "Statement": [        {            "Effect": "Allow",            "Principal": {                "Service": "emr-serverless.amazonaws.com"            },            "Action": "sts:AssumeRole"        }    ]}

Optionally, you can create a runtime role and policy using infrastructure as code (IaC), such as with AWS CloudFormation or Terraform, or using the AWS Command Line Interface (AWS CLI).

Update the SageMaker role to allow EMR Serverless access

This one-time task enables SageMaker Studio users to create, update, list, start, stop, and delete EMR Serverless clusters. We begin by creating an inline policy that grants the necessary permissions for these actions on EMR Serverless clusters, then attach the policy to the Studio domain or user profile role:

{  "Version": "2012-10-17",  "Statement": [    {      "Sid": "EMRServerlessUnTaggedActions",      "Effect": "Allow",      "Action": [        "emr-serverless:ListApplications"      ],      "Resource": "arn:aws:emr-serverless:<region>:<aws-account-id>:/*"    },    {      "Sid": "EMRServerlessPassRole",      "Effect": "Allow",      "Action": "iam:PassRole",      "Resource": "arn:aws:iam:<region>:<aws-account-id>:role/SM-EMRServerless-RunTime-role",      "Condition": {        "StringLike": {          "iam:PassedToService": "emr-serverless.amazonaws.com"        }      }    },    {      "Sid": "EMRServerlessCreateApplicationAction",      "Effect": "Allow",      "Action": [        "emr-serverless:CreateApplication",        "emr-serverless:TagResource"      ],      "Resource": "arn:aws:emr-serverless:<region>:<aws-account-id>:/*",      "Condition": {        "ForAllValues:StringEquals": {          "aws:TagKeys": [            "sagemaker:domain-arn",            "sagemaker:user-profile-arn",            "sagemaker:space-arn"          ]        },        "Null": {          "aws:RequestTag/sagemaker:domain-arn": "false",          "aws:RequestTag/sagemaker:user-profile-arn": "false",          "aws:RequestTag/sagemaker:space-arn": "false"        }      }    },    {      "Sid": "EMRServerlessDenyPermissiveTaggingAction",      "Effect": "Deny",      "Action": [        "emr-serverless:TagResource",        "emr-serverless:UntagResource"      ],      "Resource": "arn:aws:emr-serverless:<region>:<aws-account-id>:/*",      "Condition": {        "Null": {          "aws:ResourceTag/sagemaker:domain-arn": "true",          "aws:ResourceTag/sagemaker:user-profile-arn": "true",          "aws:ResourceTag/sagemaker:space-arn": "true"        }      }    },    {      "Sid": "EMRServerlessActions",      "Effect": "Allow",      "Action": [        "emr-serverless:StartApplication",        "emr-serverless:StopApplication",        "emr-serverless:GetApplication",        "emr-serverless:DeleteApplication",        "emr-serverless:AccessLivyEndpoints",        "emr-serverless:GetDashboardForJobRun"      ],      "Resource": "arn:aws:emr-serverless:<region>:<aws-account-id>:/applications/*",      "Condition": {        "Null": {          "aws:ResourceTag/sagemaker:domain-arn": "false",          "aws:ResourceTag/sagemaker:user-profile-arn": "false",          "aws:ResourceTag/sagemaker:space-arn": "false"        }      }    }  ]}

Update the domain with EMR Serverless runtime roles

SageMaker Studio supports access to EMR Serverless clusters in two ways: in the same account as the SageMaker Studio domain or across accounts.

To interact with EMR Serverless clusters created in the same account as the SageMaker Studio domain, create a file named same-account-update-domain.json:

{    "DomainId": "<emr-s-sm-studio-domain-id>",    "DefaultUserSettings": {        "JupyterLabAppSettings": {            "EmrSettings": {                 "ExecutionRoleArns": [ "arn:aws:iam:<region>:<aws-account-id>:role/<same-account-emr-runtime-role>" ]            }        }    }}

Then run an update-domain command to allow all users inside a domain to allow users to use the runtime role:

aws –region <region> \sagemaker update-domain \--cli-input-json file://same-account-update-domain.json

For EMR Serverless clusters created in a different account, create a file named cross-account-update-domain.json:

{    "DomainId": "<emr-s-sm-studio-domain-id>",    "DefaultUserSettings": {        "JupyterLabAppSettings": {            "EmrSettings": {                 "AssumableRoleArns": [ "arn:aws:iam:<region>:<aws-account-id>:role/<cross-account-emr-runtime-role>" ]            }        }    }}

Then run an update-domain command to allow all users inside a domain to allow users to use the runtime role:

aws --region <region> \sagemaker update-domain \--cli-input-json file://cross-account-update-domain.json

Update the user profile with EMR Serverless runtime roles

Optionally, this update can be applied more granularly at the user profile level instead of the domain level. Similar to domain update, to interact with EMR Serverless clusters created in the same account as the SageMaker Studio domain, create a file named same-account-update-user-profile.json:

{    "DomainId": "<emr-s-sm-studio-domain-id>",    "UserProfileName": "<emr-s-sm-studio-user-profile-name>",    "UserSettings": {        "JupyterLabAppSettings": {            "EmrSettings": {                 "ExecutionRoleArns": [ "arn:aws:iam:<region>:<aws-account-id>:role/<same-account-emr-runtime-role>" ]            }        }    }}

Then run an update-user-profile command to allow this user profile use this run time role:

aws –region <region> \sagemaker update-domain \--cli-input-json file://same-account-update-user-profile.json

For EMR Serverless clusters created in a different account, create a file named cross-account-update-user-profile.json:

{    "DomainId": "<emr-s-sm-studio-domain-id>",    "UserProfileName": "<emr-s-sm-studio-user-profile-name>",    "UserSettings": {        "JupyterLabAppSettings": {            "EmrSettings": {                 "AssumableRoleArns": [ "arn:aws:iam:<region>:<aws-account-id>:role/<cross-account-emr-runtime-role>" ]            }        }    }}

Then run an update-user-profile command to allow all users inside a domain to allow users to use the runtime role:

aws --region <region> \sagemaker update-user-profile \--cli-input-json file://cross-account-update-user-profile.json

Grant access to the Amazon ECR repository

The recommended way to customize environments within EMR Serverless clusters is by using custom Docker images.

Make sure you have an Amazon ECR repository in the same AWS Region where you launch EMR Serverless applications. To create an ECR private repository, refer to Creating an Amazon ECR private repository to store images.

To grant users access to your ECR repository, add the following policies to the users and roles that create or update EMR Serverless applications using images from this repository:

{    "Version": "2012-10-17",    "Statement": [        {            "Sid": "ECRRepositoryListGetPolicy",            "Effect": "Allow",            "Action": [                "ecr:GetDownloadUrlForLayer",                "ecr:BatchGetImage",                "ecr:DescribeImages"            ],            "Resource": "ecr-repository-arn"        }    ]}

Customize the runtime environment in EMR Serverless clusters

Customizing cluster runtimes in advance is crucial for a seamless experience. As mentioned earlier, we use custom-built Docker images from an ECR repository to optimize our cluster environment, including the necessary packages and binaries. The simplest way to build these images is by using the SageMaker Studio built-in Docker functionality, as discussed in Accelerate ML workflows with Amazon SageMaker Studio Local Mode and Docker support. In this post, we build a Docker image that includes the Python 3.11 runtime and essential packages for a typical RAG workflow, such as langchain, sagemaker, opensearch-py, PyPDF2, and more.

Complete the following steps:

Accelerate ML workflows with Amazon SageMaker Studio Local Mode and Docker support

docker --version#ORdocker info

Using custom images with EMR Serverless

# example docker file for EMR ServerlessFROM --platform=linux/amd64 public.ecr.aws/emr-serverless/spark/emr-7.0.0:latestUSER rootRUN dnf install python3.11 python3.11-pipWORKDIR /tmpRUN jar xf /usr/lib/livy/repl_2.12-jars/livy-repl_2.12-0.7.1-incubating.jar fake_shell.py && \    sed -ie 's/version < \"3\.8\"/version_info < \(3,8\)/' fake_shell.py && \    jar uvf /usr/lib/livy/repl_2.12-jars/livy-repl_2.12-0.7.1-incubating.jar fake_shell.pyWORKDIR /home/hadoopENV PYSPARK_PYTHON=/usr/bin/python3.11RUN python3.11 -m pip install cython numpy matplotlib requests boto3 pandas PyPDF2 pikepdf pycryptodome langchain==0.0.310 opensearch-py seaborn plotly dashUSER hadoop:hadoop

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com

docker build --network sagemaker -t emr-serverless-langchain .docker tag emr-serverless-langchain:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/emr-serverless-langchain:latestdocker push --network sagemaker 123456789012.dkr.ecr.us-east-1.amazonaws.com/emr-serverless-langchain:latest

Use the EMR Serverless integration with SageMaker Studio

In this section, we demonstrate the integration of EMR Serverless into SageMaker Studio and how you can effortlessly interact with your clusters, whether they are in the same account or across different accounts. To access SageMaker Studio, complete the following steps:

AWS IAM Identity Center

The new Studio experience is a serverless web UI, which makes sure any updates occur seamlessly and asynchronously, without interrupting your development experience.

Data

EMR Clusters

You can navigate to two different tabs: EMR Serverless Applications or EMR Clusters (on Amazon EC2). For this post, we focus on EMR Serverless.

Create an EMR Serverless cluster

To create a new EMR Serverless cluster, complete the following steps:

EMR Serverless Applications

Create

Network connections

Connect to your VPC

Interact with EMR Serverless clusters

EMR Serverless clusters can automatically scale down to zero when not in use, eliminating costs associated with idling resources. This feature makes EMR Serverless clusters highly flexible and cost-effective. You can list, view, create, start, stop, and delete all your EMR Serverless clusters directly within SageMaker Studio.

You can also interactively attach an existing cluster to a notebook by choosing Attach to new notebook.

Build a RAG document processing engine using PySpark

In this section, we use the SageMaker Studio cluster integration to parallelize data processing at a massive scale. A typical RAG framework consists of two main components:

Offline document embedding generation

vector database

Online text generation with context

In the following sections, we focus on the offline document embedding generation process and explore how to use PySpark on EMR Serverless using an interactive SageMaker Studio JupyterLab notebook to efficiently parallel process PDF documents.

Deploy an embeddings model

For this use case, we use the Hugging Face All MiniLM L6 v2 embeddings model from Amazon SageMaker JumpStart. To quickly deploy this embedding model, complete the following steps:

JumpStart

All MiniLM L6 v2

Deploy

Your model will be ready within a few minutes. Alternatively, you can choose any other embedding models from SageMaker JumpStart by filtering Task type to Text embedding.

Interactively build an offline document embedding generator

In this section, we use code from the following GitHub repo and interactively build a document processing engine using LangChain and PySpark. Complete the following steps:

Boost productivity on Amazon SageMaker Studio: Introducing JupyterLab Spaces and generative AI tools

EBS

You can change the instance type at any time by stopping and restarting the space.

GitHub repository

use-cases/pyspark-langchain-rag-processor/Offline_RAG_Processor_on_SageMaker_Studio_using_EMR-Serverless.ipynb

Data

EMR Clusters

EMR Serverless Applications

Create

Attach to new notebook

Alternatively, you can attach your cluster to any notebook within your JupyterLab space by choosing Cluster and selecting the EMR Serverless cluster you want to attach to the notebook.

Make sure you choose the SparkMagic PySpark kernel when interactively running PySpark workloads.

A successful cluster connection to a notebook should result in a useable Spark session and links to the Spark UI and driver logs.

When a notebook cell is run within a SparkMagic PySpark kernel, the operations are, by default, run inside a Spark cluster. However, if you decorate the cell with %%local, it allows the code to be run on the local compute where the JupyterLab notebook is hosted. We begin by reading a list of PDF documents from Amazon S3 directly into the cluster memory, as illustrated in the following diagram.

default_bucket = sess.default_bucket()destination_prefix = "test/raw-pdfs"# send default bucket context to spark using send_to_spark command%%send_to_spark -i default_bucket -t str -n SRC_BUCKET_NAME%%send_to_spark -i destination_prefix -t str -n SRC_FILE_PREFIX...def list_files_in_s3_bucket_prefix(bucket_name, prefix):        s3 = boto3.client('s3')    # Paginate through the objects in the specified bucket and prefix, and collect all keys (file paths)    paginator = s3.get_paginator('list_objects_v2')    page_iterator = paginator.paginate(Bucket=bucket_name, Prefix=prefix)    file_paths = []    for page in page_iterator:        if "Contents" in page:            for obj in page["Contents"]:                if os.path.basename(obj["Key"]):                    file_paths.append(obj["Key"])    return file_pathsdef load_pdf_from_s3_into_memory(row):    """    Load a PDF file from an S3 bucket directly into memory.    """    try:        src_bucket_name, src_file_key = row         s3 = boto3.client('s3')        pdf_file = io.BytesIO()        s3.download_fileobj(src_bucket_name, src_file_key, pdf_file)        pdf_file.seek(0)        pdf_reader = PdfReader(pdf_file)        return (src_file_key, pdf_reader, len(pdf_reader.pages))        except Exception as e:            return (os.path.basename(src_file_key), str(e))# create a list of file references in S3all_pdf_files = list_files_in_s3_bucket_prefix(    bucket_name=SRC_BUCKET_NAME,     prefix=SRC_FILE_PREFIX)print(f"Found {len(all_pdf_files)} files ---> {all_pdf_files}")# Found 3 files ---> ['Lab03/raw-pdfs/AmazonSageMakerDeveloperGuide.pdf', 'Lab03/raw-pdfs/EC2DeveloperGuide.pdf', 'Lab03/raw-pdfs/S3DeveloperGuide.pdf']   # load documents into memory and return a single list of text-documents - map-reduce oppdfs_in_memory = pdfs_rdd.map(load_pdf_from_s3_into_memory).collect()

Next, you can visualize the size of each document to understand the volume of data you’re processing.

import numpy as npimport matplotlib.pyplot as pltx_labels = [pdfx.split('/')[-1] for pdfx, _, _ in pdfs_in_memory]y_values = [pages_count for _, _, pages_count in pdfs_in_memory]x = range(len(y_values))...# Adjust the layoutplt.tight_layout()# Show the plotplt.show()%matplot plt

Every PDF document contains multiple pages to process, and this task can be run in parallel using Spark. Each document is split page by page, with each page referencing the global in-memory PDFs. We achieve parallelism at the page level by creating a list of pages and processing each one in parallel. The following diagram provides a visual representation of this process.

The extracted text from each page of multiple documents is converted into a LangChain-friendly Document class.

CustomDocument

Document

class CustomDocument:    def __init__(self, text, path, number):     ...documents_custom = [    CustomDocument(text=text, path=doc_source, number=page_num)     for text, doc_source, page_num in documents]global_text_splitter = RecursiveCharacterTextSplitter(    chunk_size=500,    chunk_overlap=50)docs = global_text_splitter.split_documents(documents_custom)print(f"Total number of docs pre-split {len(documents_custom)} | after split {len(docs)}")

OpenSearchVectorSearch

EmbeddingsGenerator

model endpoint

import timefrom langchain.vectorstores import OpenSearchVectorSearchendpoint_name = 'jumpstart-all-MiniLM-L6-v2-endpoint'interface_component = 'jumpstart-all-MiniLM-L6-v2-endpoint-comp'client = boto3.client('runtime.sagemaker', region_name=REGION)def generate_embeddings(input):    body = input.encode('utf-8')        response = client.invoke_endpoint(       ...        class EmbeddingsGenerator:     @staticmethod    def embed_documents(input_text, normalize=True):        assert isinstance(input_text, list), "Input type must me list to embed_documents function"            input_text_rdd = spark.sparkContext.parallelize(input_text)        embeddings_generated = input_text_rdd.map(generate_embeddings).collect()        ...        @staticmethod    def embed_query(input_text):        status_code, embedding = generate_embeddings(input_text)        if status_code == 200:            return embedding        else:             return Nonestart = time.time()docsearch = OpenSearchVectorSearch.from_documents(    docs,     EmbeddingsGenerator,     opensearch_url=OPENSEARCH_DOMAIN_URL,    bulk_size=len(docs),    http_auth=(user, pwd),    index_name=INDEX_NAME_OSE,    engine="faiss")end = time.time()print(f"Total Time for ingestion: {round(end - start, 2)} secs")

The custom EmbeddingsGenerator class can generate embeddings for approximately 2,500 pages (12,000 chunks) of documents in under 180 seconds using just two concurrent load-balanced SageMaker embedding model endpoints and 10 PySpark worker nodes. This process can be further accelerated by increasing the number of load-balanced embedding endpoints and worker nodes in the cluster.

Conclusion

The integration of EMR Serverless with SageMaker Studio represents a significant leap forward in simplifying and enhancing big data processing and ML workflows. By eliminating the complexities of infrastructure management, enabling seamless scalability, and optimizing costs, this powerful combination empowers organizations to use petabyte-scale data processing without the overhead typically associated with managing Spark clusters. The streamlined experience within SageMaker Studio enables data scientists and engineers to focus on what truly matters—driving insights and innovation from their data. Whether you’re processing massive datasets, building RAG systems, or exploring other advanced analytics, this integration opens up new possibilities for efficiency and scale, all within the familiar and user-friendly environment of SageMaker Studio.

As data continues to grow in volume and complexity, adopting tools like EMR Serverless and SageMaker Studio will be key to maintaining a competitive edge in the ever-evolving landscape of data-driven decision-making. We encourage you to try this feature today by setting up SageMaker Studio using the SageMaker quick setup guide. To learn more about the EMR Serverless integration with SageMaker Studio, refer to Prepare data using EMR Serverless. You can explore more generative AI samples and use cases in the GitHub repository.

About the authors

Raj Ramasubbu is a Senior Analytics Specialist Solutions Architect focused on big data and analytics and AI/ML with Amazon Web Services. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS. Raj provided technical expertise and leadership in building data engineering, big data analytics, business intelligence, and data science solutions for over 18 years prior to joining AWS. He helped customers in various industry verticals like healthcare, medical devices, life science, retail, asset management, car insurance, residential REIT, agriculture, title insurance, supply chain, document management, and real estate.

Pranav Murthy is an AI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked in the semiconductor industry developing large computer vision (CV) and natural language processing (NLP) models to improve semiconductor processes using state of the art ML techniques. In his free time, he enjoys playing chess and traveling. You can find Pranav on LinkedIn.

Naufal Mir is an Senior GenAI/ML Specialist Solutions Architect at AWS. He focuses on helping customers build, train, deploy and migrate machine learning (ML) workloads to SageMaker. He previously worked at financial services institutes developing and operating systems at scale. He enjoys ultra endurance running and cycling.

Kunal Jha is a Senior Product Manager at AWS. He is focused on building Amazon SageMaker Studio as the best-in-class choice for end-to-end ML development. In his spare time, Kunal enjoys skiing and exploring the Pacific Northwest. You can find him on LinkedIn.

Ashwin Krishna is a Senior SDE working for SageMaker Studio at Amazon Web Services. He is focused on building interactive ML solutions for AWS enterprise customers to achieve their business needs. He is a big supporter of Arsenal football club and spends spare time playing and watching soccer.

Harini Narayanan is a software engineer at AWS, where she’s excited to build cutting-edge data preparation technology for machine learning at SageMaker Studio. With a keen interest in sustainability, interior design, and a love for all things green, Harini brings a thoughtful approach to innovation, blending technology with her diverse passions.