AWS Machine Learning Blog 2024年08月27日
GenASL: Generative AI-powered American Sign Language avatars
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

GenASL是一种利用AWS生成式AI能力的解决方案,能将语音或文本转化为ASL虚拟形象动画,弥合了口语、书面语和手语之间的差距。

🎈GenASL由多个AWS服务协同工作,用户可输入音频、视频或文本,系统会生成解释数据的ASL虚拟形象视频。它使用了多种AI和机器学习服务,包括Amazon Transcribe、Amazon SageMaker、Amazon Bedrock等。

💻ASL Lexicon Video Dataset包含丰富的手语信息,通过S3存储并利用RTMPose和PyTorch工具生成ASL虚拟形象视频,相关信息存储在DynamoDB表中。

📱后端处理包括将音频转为英语文本,再将英语文本译为ASL注释,最后从ASL注释生成虚拟形象视频。此过程使用了Amazon Transcribe、Amazon Bedrock等服务,并通过API Gateway进行前端交互。

🌐前端应用通过Amplify构建,可添加认证功能。在音频文件上传到Amazon S3时,前端利用Amazon Cognito身份池提供的临时身份进行连接。

In today’s world, effective communication is essential for fostering inclusivity and breaking down barriers. However, for individuals who rely on visual communication methods like American Sign Language (ASL), traditional communication tools often fall short. That’s where GenASL comes in. GenASL is a generative artificial intelligence (AI)-powered solution that translates speech or text into expressive ASL avatar animations, bridging the gap between spoken and written language and sign language.

The rise of foundation models (FMs), and the fascinating world of generative AI that we live in, is incredibly exciting and opens doors to imagine and build what wasn’t previously possible. AWS makes it possible for organizations of all sizes and developers of all skill levels to build and scale generative AI applications with security, privacy, and responsible AI.

In this post, we dive into the architecture and implementation details of GenASL, which uses AWS generative AI capabilities to create human-like ASL avatar videos.

Solution overview

The GenASL solution comprises several AWS services working together to enable seamless translation from speech or text to ASL avatar animations. Users can input audio, video, or text into GenASL, which generates an ASL avatar video that interprets the provided data. The solution uses AWS AI and machine learning (AI/ML) services, including Amazon Transcribe, Amazon SageMaker, Amazon Bedrock, and FMs.

The following diagram shows a high-level overview of the architecture.

The workflow includes the following steps:

    An Amazon Elastic Compute Cloud (Amazon EC2) instance initiates a batch process to create ASL avatars from a video dataset consisting of over 8,000 poses using RTMPose, a real-time multi-person pose estimation toolkit based on MMPose. AWS Amplify distributes the GenASL web app consisting of HTML, JavaScript, and CSS to users’ mobile devices. An Amazon Cognito identity pool grants temporary access to the Amazon Simple Storage Service (Amazon S3) bucket. Users upload audio, video, or text to the S3 bucket using the AWS SDK through the web app. The GenASL web app invokes the backend services by sending the S3 object key in the payload to an API hosted on Amazon API Gateway. API Gateway instantiates an AWS Step Functions The state machine orchestrates the AI/ML services Amazon Transcribe and Amazon Bedrock and the NoSQL data store Amazon DynamoDB using AWS Lambda functions. The Step Functions workflow generates a pre-signed URL of the ASL avatar video for the corresponding audio file. A pre-signed URL for the video file stored in Amazon S3 is sent back to the user’s browser through API Gateway asynchronously through polling. The user’s mobile device plays the video file using the pre-signed URL.

As shown in the following figure, speech or text is converted to an ASL gloss, which is then used to produce an ASL video.

Let’s dive into the implementation details of each component.

Batch process

The ASL Lexicon Video Dataset (ASLLVD) consists of multiple synchronized videos showing the signing from different angles of more than 3,300 ASL signs in citation form, each produced by 1–6 native ASL signers. Linguistic annotations include gloss labels, sign start and end time codes, start and end handshape labels for both hands, and morphological and articulatory classifications of sign type. For compound signs, the dataset includes annotations for each morpheme. To facilitate computer vision-based sign language recognition, the dataset also includes numeric ID labels for sign variants, video sequences in uncompressed raw format, and camera calibration sequences.

We store the input dataset in an S3 bucket (video dataset) and use RTMPose and a PyTorch-based pose estimation open source toolkit to generate the ASL avatar videos. MMPose is a member of the OpenMMLab Project and contains a rich set of algorithms for 2D multi-person human pose estimation, 2D hand pose estimation, 2D face landmark detection, and 133 keypoint whole-body human pose estimations.

The EC2 instance initiates the batch process that stores the ASL avatar videos in another S3 bucket (ASL avatars) for every ASL gloss and stores the ASL gloss and its corresponding ASL avatar video’s S3 key in the DynamoDB table.

Backend

The backend process has three steps: process the input audio to English text, translate the English text to an ASL gloss, and generate an ASL avatar video from the ASL gloss. This API layer is fronted by API Gateway, which allows the user to authenticate, monitor, and throttle the API request. Because API Gateway has a timeout of 29 seconds, this asynchronous solution uses polling. Whenever the API gets a request to generate the sign video, it invokes a Step Functions workflow and then returns the Step Functions runtime URL back to the frontend application. The Step Functions workflow has three steps:

    Convert the audio input to English text using Amazon Transcribe, an automatic speech-to-text AI service that uses deep learning for speech recognition. Amazon Transcribe is a fully managed and continuously training service designed to handle a wide range of speech and acoustic characteristics, including variations in volume, pitch, and speaking rate. Translate the English text to an ASL gloss using Amazon Bedrock, which is used to build and scale generative AI applications using FMs. Amazon Bedrock is a fully managed service that offers a choice of high-performing FMs from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. We used Anthropic Claude v3 Sonnet on AWS Bedrock to create an ASL gloss. Generate the ASL avatar video from the ASL gloss. Using the ASL gloss created in the translation layer, we look up the corresponding ASL sign from the DynamoDB table. If the gloss is not available in the GenASL database, the logic falls back to fingerspelling each alphabet letter. The Lookup ASL Avatar Lambda function stitches the videos together, generates a temporary video, uploads that to the S3 bucket, creates a pre-signed URL, and sends the pre-signed URL for both the sign video and the avatar video back to the frontend through polling. The frontend plays the video in a loop.

Frontend

The frontend application is built using Amplify, a framework that allows you to build, develop, and deploy full stack applications, including mobile and web applications. You can add the authentication to a frontend Amplify app using the Amplify command Add Auth, which generates the sign-up and sign-in pages, as well as the backend and the Amazon Cognito identity pools. During the audio file upload to Amazon S3, the frontend connects with Amazon S3 using the temporary identity provided by the Amazon Cognito identity pool.

Best practices

The following are best practices for creating the ASL avatar video application.

API design

API Gateway supports a maximum timeout of 29 seconds. Additionally, it’s a best practice to not build synchronous APIs for long-running processes. Therefore, we built an asynchronous API consisting of two stages by allowing the client to poll a REST resource to check the status of its request. We implemented this pattern using API Gateway and Step Functions. In the first stage, the S3 key and bucket name are sent to an API endpoint that delegates the request to a Step Functions workflow and sends a response back with the execution ARN. In the second stage, the API checks the status of the workflow run based on the ARN provided as an input to this API endpoint. If the ASL avatar is successfully created, this API returns the pre-signed URL. Otherwise, it sends a RUNNING status and the frontend waits for a couple of seconds, and then calls the second API endpoint again. This step is repeated until the API returns the pre-signed URL to the caller.

Step Functions supports direct optimized integration with Amazon Bedrock, so we don’t need to have a Lambda function in the middle to create the ASL gloss. We can call the Amazon Bedrock API directly from the Step Functions workflow to save on Lambda compute cost.

DevOps

From a DevOps perspective, the frontend uses Amplify to build and deploy, and the backend is uses AWS Serverless Application Model (AWS SAM) to build, package, and deploy the serverless applications. We used Amazon CloudWatch to build a dashboard to capture the metrics, including the number of API invocations (number of ASL avatar videos generated), average response time to create the video, and error metrics, to create a good user experience by tracking if there is a failure and alerting the DevOps team appropriately.

Prompt engineering

We provided a prompt to convert English text to an ASL gloss along with the input text message to the Amazon Bedrock API to invoke Anthropic Claude. We use the few-shot prompting technique by providing a few examples to produce an accurate ASL gloss.

The code sample is available in the accompanying GitHub repository.

Prerequisites

Before you begin, make sure you have the following set up:

Set up the solution

This section provides steps to deploy an ASL avatar generator using AWS services. We outline the steps for cloning the repository, processing data, deploying the backend, and setting up the frontend.

    Clone the GitHub repository using the following command:
    git clone https://github.com/aws-samples/genai-asl-avatar-generator.git
    Follow the instructions in the dataprep folder to initialize the database:
      Modify genai-asl-avatar-generator/dataprep/config.ini with information specific to your environment:
      [DEFAULT]s3_bucket= <your S3 bucket>s3_prefix= <files will be generated in this prefix within your S3 bucket>table_name=<dynamodb table name>region=<your preferred AWS region> 
      Set up your environment by installing the required Python packages:
      cd genai-asl-avatar-generator/dataprepchmod +x env_setup.cmd./env_setup.cmd
      Prepare the sign video annotation file for each processing run:
      python prep_metadata.py
      Download the sign videos, segment them, and store them in Amazon S3:
      python create_sign_videos.py
      Generate avatar videos:
      python create_pose_videos.py
    Use the following command to deploy the backend application:
    cd genai-asl-avatar-generator/backendsam deploy --guided
    Set up the frontend:
      Initialize your Amplify environment:
      amplify init
      Modify the frontend configuration to point to the backend API:
        Open frontend/amplify/backend/function/Audio2Sign/index.py. Modify the stateMachineArn variable to have the state machine ARN shown in the output generated from the backend deployment.
      Add hosting to the Amplify project:
      amplify add hosting
      In the prompt, choose Amazon CloudFront and S3 and choose the bucket to host the GenASL application. Install the relevant packages by running the following command:
      npm install --force
    Deploy the Amplify project:
    amplify publish

Run the solution

After you deploy the Amplify project using the amplify publish command, an Amazon CloudFront URL will be returned. You can use this URL to access the GenASL demo application. With the application open, you can register a new user and test the ASL avatar generation functionality.

Clean up

To avoid incurring costs, clean up the resources you created for this application when you no longer need them.

    Remove all the frontend resources created by Amplify using the following command:
    amplify delete
    Remove all the backend resources created by AWS SAM using the following command:
    sam delete
    Clean up resources used by the batch process.
      If you created a new EC2 instance for running the batch process, delete the instance using the Amazon EC2 console. If you reused an existing EC2 instance, delete the project folder recursively to clean up all the resources:
      rm -rf genai-asl-avatar-generator
    Empty and delete the S3 bucket using the following commands:
    aws s3 rm s3://<bucket-name> --recursiveaws s3 rb s3://<bucket-name> --force  

Next steps

Although GenASL has achieved its initial goals, we’re working to expand its capabilities with advancements like 3D pose estimation, blending techniques, and bi-directional translation between ASL and spoken languages:

Conclusion

By combining speech-to-text, machine translation, text-to-video generation, and AWS AI/ML services, the GenASL solution creates expressive ASL avatar animations, fostering inclusive and effective communication. This post provided an overview of the GenASL architecture and implementation details. As generative AI continues to evolve, we can create groundbreaking applications that enhance accessibility and inclusivity for all.


About the Authors

Alak Eswaradass is a Senior Solutions Architect at AWS based in Chicago, Illinois. She is passionate about helping customers architect solutions utilizing AWS cloud technologies to solve business challenges. She is enthusiastic about leveraging cutting-edge technologies like Generative AI to drive innovation in cloud architectures. When she’s not working, Alak enjoys spending time with her daughters and exploring the outdoors with her dogs.

Suresh Poopandi is a Principal Solutions Architect at AWS, based in Chicago, Illinois, helping Healthcare Life Science customers with their cloud journey by providing architectures utilizing AWS services to achieve their business goals. He is passionate about building home automation and AI/ML solutions.

Rob Koch is a tech enthusiast who thrives on steering projects from their initial spark to successful fruition, Rob Koch is Principal at Slalom Build in Seattle, an AWS Data Hero, and Co-chair of the CNCF Deaf and Hard of Hearing Working Group. His expertise in architecting event-driven systems is firmly rooted in the belief that data should be harnessed in real time. Rob relishes the challenge of examining existing systems and mapping the journey towards an event-driven architecture.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GenASL AWS ASL转换 语言交流
相关文章