Build a serverless audio summarization solution with Amazon Bedrock and Whisper

Recordings of business meetings, interviews, and customer interactions have become essential for preserving important information. However, transcribing and summarizing these recordings manually is often time-consuming and labor-intensive. With the progress in generative AI and automatic speech recognition (ASR), automated solutions have emerged to make this process faster and more efficient.

Protecting personally identifiable information (PII) is a vital aspect of data security, driven by both ethical responsibilities and legal requirements. In this post, we demonstrate how to use the Open AI Whisper foundation model (FM) Whisper Large V3 Turbo, available in Amazon Bedrock Marketplace, which offers access to over 140 models through a dedicated offering, to produce near real-time transcription. These transcriptions are then processed by Amazon Bedrock for summarization and redaction of sensitive information.

Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon Nova through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Additionally, you can use Amazon Bedrock Guardrails to automatically redact sensitive information, including PII, from the transcription summaries to support compliance and data protection needs.

In this post, we walk through an end-to-end architecture that combines a React-based frontend with Amazon Bedrock, AWS Lambda, and AWS Step Functions to orchestrate the workflow, facilitating seamless integration and processing.

Solution overview

The solution highlights the power of integrating serverless technologies with generative AI to automate and scale content processing workflows. The user journey begins with uploading a recording through a React frontend application, hosted on Amazon CloudFront and backed by Amazon Simple Storage Service (Amazon S3) and Amazon API Gateway. When the file is uploaded, it triggers a Step Functions state machine that orchestrates the core processing steps, using AI models and Lambda functions for seamless data flow and transformation. The following diagram illustrates the solution architecture.

The workflow consists of the following steps:

Amazon EventBridge

The following diagram illustrates the state machine workflow.

The Step Functions state machine orchestrates a series of tasks to transcribe, summarize, and redact sensitive information from uploaded audio/video recordings:

A Lambda function is triggered to gather input details (for example, Amazon S3 object path, metadata) and prepare the payload for transcription. The payload is sent to the OpenAI Whisper Large V3 Turbo model through the Amazon Bedrock Marketplace to generate a near real-time transcription of the recording. The raw transcript is passed to Anthropic’s Claude Sonnet 3.5 through Amazon Bedrock, which produces a concise and coherent summary of the conversation or content. A second Lambda function validates and forwards the summary to the redaction step. The summary is processed through Amazon Bedrock Guardrails, which automatically redacts PII and other sensitive data. The redacted summary is stored or returned to the frontend application through an API, where it is displayed to the user.

Prerequisites

Before you start, make sure that you have the following prerequisites in place:

Access Amazon Bedrock foundation models

Remove PII from conversations by using sensitive information filters

Amazon Bedrock Marketplace

AWS Command Line Interface

Installing or updating to the latest version of the AWS CLI

Node.js 14.x

AWS CDK CLI

Python 3.8+

Create a guardrail in the Amazon Bedrock console

For instructions for creating guardrails in Amazon Bedrock, refer to Create a guardrail. For details on detecting and redacting PII, see Remove PII from conversations by using sensitive information filters. Configure your guardrail with the following key settings:

Names and identities Phone numbers Email addresses Physical addresses Financial information Other sensitive personal information

After you deploy the guardrail, note the Amazon Resource Name (ARN), and you will be using this when deploys the model.

Deploy the Whisper model

Complete the following steps to deploy the Whisper Large V3 Turbo model:

Model catalog

Foundation models

Whisper Large V3 Turbo

Deploy

Advanced settings

Deploy

This creates a new AWS Identity and Access Management IAM role and deploys the model.

You can choose Marketplace deployments in the navigation pane, and in the Managed deployments section, you can see the endpoint status as Creating. Wait for the endpoint to finish deployment and the status to change to In Service, then copy the Endpoint Name, and you will be using this when deploying the

Deploy the solution infrastructure

In the GitHub repo, follow the instructions in the README file to clone the repository, then deploy the frontend and backend infrastructure.

We use the AWS Cloud Development Kit (AWS CDK) to define and deploy the infrastructure. The AWS CDK code deploys the following resources:

React frontend application Backend infrastructure S3 buckets for storing uploads and processed results Step Functions state machine with Lambda functions for audio processing and PII redaction API Gateway endpoints for handling requests IAM roles and policies for secure access CloudFront distribution for hosting the frontend

Implementation deep dive

The backend is composed of a sequence of Lambda functions, each handling a specific stage of the audio processing pipeline:

Upload handler

Transcription with Whisper

Speaker detection

Summarization using Amazon Bedrock

PII redaction

Let’s examine some of the key components:

The transcription Lambda function uses the Whisper model to convert audio files to text:

def transcribe_with_whisper(audio_chunk, endpoint_name):    # Convert audio to hex string format    hex_audio = audio_chunk.hex()        # Create payload for Whisper model    payload = {        "audio_input": hex_audio,        "language": "english",        "task": "transcribe",        "top_p": 0.9    }        # Invoke the SageMaker endpoint running Whisper    response = sagemaker_runtime.invoke_endpoint(        EndpointName=endpoint_name,        ContentType='application/json',        Body=json.dumps(payload)    )        # Parse the transcription response    response_body = json.loads(response['Body'].read().decode('utf-8'))    transcription_text = response_body['text']        return transcription_text

We use Amazon Bedrock to generate concise summaries from the transcriptions:

def generate_summary(transcription):    # Format the prompt with the transcription    prompt = f"{transcription}\n\nGive me the summary, speakers, key discussions, and action items with owners"        # Call Bedrock for summarization    response = bedrock_runtime.invoke_model(        modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",        body=json.dumps({            "prompt": prompt,            "max_tokens_to_sample": 4096,            "temperature": 0.7,            "top_p": 0.9,        })    )        # Extract and return the summary    result = json.loads(response.get('body').read())    return result.get('completion')

A critical component of our solution is the automatic redaction of PII. We implemented this using Amazon Bedrock Guardrails to support compliance with privacy regulations:

def apply_guardrail(bedrock_runtime, content, guardrail_id):# Format content according to API requirementsformatted_content = [{"text": {"text": content}}]# Call the guardrail APIresponse = bedrock_runtime.apply_guardrail(guardrailIdentifier=guardrail_id,guardrailVersion="DRAFT",source="OUTPUT",  # Using OUTPUT parameter for proper flowcontent=formatted_content)# Extract redacted text from responseif 'action' in response and response['action'] == 'GUARDRAIL_INTERVENED':if len(response['outputs']) > 0:output = response['outputs'][0]if 'text' in output and isinstance(output['text'], str):return output['text']# Return original content if redaction failsreturn content

When PII is detected, it’s replaced with type indicators (for example, {PHONE} or {EMAIL}), making sure that summaries remain informative while protecting sensitive data.

To manage the complex processing pipeline, we use Step Functions to orchestrate the Lambda functions:

{"Comment": "Audio Summarization Workflow","StartAt": "TranscribeAudio","States": {"TranscribeAudio": {"Type": "Task","Resource": "arn:aws:states:::lambda:invoke","Parameters": {"FunctionName": "WhisperTranscriptionFunction","Payload": {"bucket": "$.bucket","key": "$.key"}},"Next": "IdentifySpeakers"},"IdentifySpeakers": {"Type": "Task","Resource": "arn:aws:states:::lambda:invoke","Parameters": {"FunctionName": "SpeakerIdentificationFunction","Payload": {"Transcription.$": "$.Payload"}},"Next": "GenerateSummary"},"GenerateSummary": {"Type": "Task","Resource": "arn:aws:states:::lambda:invoke","Parameters": {"FunctionName": "BedrockSummaryFunction","Payload": {"SpeakerIdentification.$": "$.Payload"}},"End": true}}}

This workflow makes sure each step completes successfully before proceeding to the next, with automatic error handling and retry logic built in.

Test the solution

After you have successfully completed the deployment, you can use the CloudFront URL to test the solution functionality.

Security considerations

Security is a critical aspect of this solution, and we’ve implemented several best practices to support data protection and compliance:

Sensitive data redaction

Fine-Grained IAM Permissions

Amazon S3 access controls

API security

Amazon Cognito

CloudFront protection

Amazon Bedrock data security

Clean up

To prevent unnecessary charges, make sure to delete the resources provisioned for this solution when you’re done:

Guardrails

Delete

Marketplace deployments

Managed deployments

Delete

cdk destroy

Conclusion

This serverless audio summarization solution demonstrates the benefits of combining AWS services to create a sophisticated, secure, and scalable application. By using Amazon Bedrock for AI capabilities, Lambda for serverless processing, and CloudFront for content delivery, we’ve built a solution that can handle large volumes of audio content efficiently while helping you align with security best practices.

The automatic PII redaction feature supports compliance with privacy regulations, making this solution well-suited for regulated industries such as healthcare, finance, and legal services where data security is paramount. To get started, deploy this architecture within your AWS environment to accelerate your audio processing workflows.

About the Authors

Kaiyin Hu is a Senior Solutions Architect for Strategic Accounts at Amazon Web Services, with years of experience across enterprises, startups, and professional services. Currently, she helps customers build cloud solutions and drives GenAI adoption to cloud. Previously, Kaiyin worked in the Smart Home domain, assisting customers in integrating voice and IoT technologies.

Sid Vantair is a Solutions Architect with AWS covering Strategic accounts. He thrives on resolving complex technical issues to overcome customer hurdles. Outside of work, he cherishes spending time with his family and fostering inquisitiveness in his children.