AWS Machine Learning Blog 2024年11月22日
Enhance speech synthesis and video generation models with RLHF using audio and video segmentation in Amazon SageMaker
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

随着生成式AI模型在多媒体内容创作方面的进步,高质量输出的关键在于捕捉人类反馈才能获取的细节。音频和视频分割提供了一种结构化方式来收集这些详细反馈,使模型能够通过人类反馈强化学习(RLHF)和监督微调(SFT)进行学习。标注人员可以精确标记和评估音频或视频内容中的特定时刻,帮助模型理解什么使内容对人类观众和听众产生真实感。例如,在文本到视频生成中,模型不仅需要学习生成什么,还需要学习如何在时间上保持一致性和自然流畅性。在创建一个人执行一系列动作的场景时,动作的时机、视觉一致性和过渡的流畅性等因素都会影响质量。通过精确的分割和标注,人类标注人员可以对这些方面提供详细的反馈,帮助模型学习什么使生成的视频序列感觉自然而不是人工。同样,在文本到语音应用中,理解人类语音的细微差别——从短语之间停顿的长度到情绪语气的变化——需要在片段级别提供详细的人类反馈。这种细粒度的输入帮助模型学习如何生成听起来自然的语音,并具有适当的节奏和情绪一致性。随着大型语言模型(LLM)越来越多地集成更多多媒体功能,人类反馈在训练它们生成符合人类质量标准的丰富、多模态内容方面变得更加关键。

🤔 **音频和视频分割提供了一种结构化方式来收集人类反馈,帮助AI模型学习生成更自然、更符合人类期望的音频和视频内容。** 通过精确标记和评估音频或视频内容中的特定时刻,标注人员可以帮助模型理解什么使内容对人类观众和听众产生真实感,例如动作的时机、视觉一致性和过渡的流畅性等因素。

🎬 **Amazon SageMaker Ground Truth通过自定义人类标注工作流,为标注人员提供高精度分割工具,将详细的人类反馈直接集成到模型训练中。** 这使得模型能够从人类标记的数据中学习,从而提高其生成符合自然人类期望的内容的能力。

⏱️ **Wavesurfer.js的集成提供了详细的波形可视化,使标注人员能够立即看到语音、静音和音频强度中的模式。** 例如,在处理语音合成时,标注人员可以直观地识别单词之间不自然的间隙或音量突然变化,这些变化可能使生成的语音听起来像机器人。

🔄 **多轨功能允许标注人员创建单独的轨道来评估内容的不同方面。** 例如,在文本到语音任务中,一个轨道可能侧重于发音准确性,另一个侧重于情绪一致性,第三个侧重于自然节奏。对于视频生成任务,标注人员可以标记运动自然流畅的片段、对象一致性保持的片段以及场景转换良好的片段。

📝 **高质量数据对于训练能够生成自然、类似人类的音频和视频内容的生成式AI模型至关重要。** 模型的性能直接取决于人类反馈的准确性和细节,而这些反馈则源于标注过程的精确性和完整性。对于音频和视频内容,这意味着不仅要捕捉什么听起来或看起来不自然,还要捕捉这些问题何时以及如何发生。

As generative AI models advance in creating multimedia content, the difference between good and great output often lies in the details that only human feedback can capture. Audio and video segmentation provides a structured way to gather this detailed feedback, allowing models to learn through reinforcement learning from human feedback (RLHF) and supervised fine-tuning (SFT). Annotators can precisely mark and evaluate specific moments in audio or video content, helping models understand what makes content feel authentic to human viewers and listeners.

Take, for instance, text-to-video generation, where models need to learn not just what to generate but how to maintain consistency and natural flow across time. When creating a scene of a person performing a sequence of actions, factors like the timing of movements, visual consistency, and smoothness of transitions contribute to the quality. Through precise segmentation and annotation, human annotators can provide detailed feedback on each of these aspects, helping models learn what makes a generated video sequence feel natural rather than artificial. Similarly, in text-to-speech applications, understanding the subtle nuances of human speech—from the length of pauses between phrases to changes in emotional tone—requires detailed human feedback at a segment level. This granular input helps models learn how to produce speech that sounds natural, with appropriate pacing and emotional consistency. As large language models (LLMs) increasingly integrate more multimedia capabilities, human feedback becomes even more critical in training them to generate rich, multi-modal content that aligns with human quality standards.

The path to creating effective AI models for audio and video generation presents several distinct challenges. Annotators need to identify precise moments where generated content matches or deviates from natural human expectations. For speech generation, this means marking exact points where intonation changes, where pauses feel unnatural, or where emotional tone shifts unexpectedly. In video generation, annotators must pinpoint frames where motion becomes jerky, where object consistency breaks, or where lighting changes appear artificial. Traditional annotation tools, with basic playback and marking capabilities, often fall short in capturing these nuanced details.

Amazon SageMaker Ground Truth enables RLHF by allowing teams to integrate detailed human feedback directly into model training. Through custom human annotation workflows, organizations can equip annotators with tools for high-precision segmentation. This setup enables the model to learn from human-labeled data, refining its ability to produce content that aligns with natural human expectations.

In this post, we show you how to implement an audio and video segmentation solution in the accompanying GitHub repository using SageMaker Ground Truth. We guide you through deploying the necessary infrastructure using AWS CloudFormation, creating an internal labeling workforce, and setting up your first labeling job. We demonstrate how to use Wavesurfer.js for precise audio visualization and segmentation, configure both segment-level and full-content annotations, and build the interface for your specific needs. We cover both console-based and programmatic approaches to creating labeling jobs, and provide guidance on extending the solution with your own annotation needs. By the end of this post, you will have a fully functional audio/video segmentation workflow that you can adapt for various use cases, from training speech synthesis models to improving video generation capabilities.

Feature Overview

The integration of Wavesurfer.js in our UI provides a detailed waveform visualization where annotators can instantly see patterns in speech, silence, and audio intensity. For instance, when working on speech synthesis, annotators can visually identify unnatural gaps between words or abrupt changes in volume that might make generated speech sound robotic. The ability to zoom into these waveform patterns means they can work with millisecond precision—marking exactly where a pause is too long or where an emotional transition happens too abruptly.

In this snapshot of audio segmentation, we are capturing a customer-representative conversation, annotating speaker segments, emotions, and transcribing the dialogue. The UI allows for playback speed adjustment and zoom functionality for precise audio analysis.

The multi-track feature lets annotators create separate tracks for evaluating different aspects of the content. In a text-to-speech task, one track might focus on pronunciation accuracy, another on emotional consistency, and a third on natural pacing. For video generation tasks, annotators can mark segments where motion flows naturally, where object consistency is maintained, and where scene transitions work well. They can adjust playback speed to catch subtle details, and the visual timeline for precise start and end points for each marked segment.

In this snapshot of video segmentation, we’re annotating a scene with dogs, tracking individual animals, their colors, emotions, and gaits. The UI also enables overall video quality assessment, scene change detection, and object presence classification.

Annotation process

Annotators begin by choosing Add New Track and selecting appropriate categories and tags for their annotation task. After you create the track, you can choose Begin Recording at the point where you want to start a segment. As the content plays, you can monitor the audio waveform or video frames until you reach the desired end point, then choose Stop Recording. The newly created segment appears in the right pane, where you can add classifications, transcriptions, or other relevant labels. This process can be repeated for as many segments as needed, with the ability to adjust segment boundaries, delete incorrect segments, or create new tracks for different annotation purposes.

Importance of high-quality data and reducing labeling errors

High-quality data is essential for training generative AI models that can produce natural, human-like audio and video content. The performance of these models depends directly on the accuracy and detail of human feedback, which stems from the precision and completeness of the annotation process. For audio and video content, this means capturing not just what sounds or looks unnatural, but exactly when and how these issues occur.

Our purpose built UI in SageMaker Ground Truth addresses common challenges in audio and video annotation that often lead to inconsistent or imprecise feedback. When annotators work with long audio or video files, they need to mark precise moments where generated content deviates from natural human expectations. For example, in speech generation, an unnatural pause might last only a fraction of a second, but its impact on perceived quality is significant. The tool’s zoom functionality allows annotators to expand these brief moments across their screen, making it possible to mark the exact start and end points of these subtle issues. This precision helps models learn the fine details that separate natural from artificial-sounding speech.

Solution overview

This audio/video segmentation solution combines several AWS services to create a robust annotation workflow. At its core, Amazon Simple Storage Service (Amazon S3) serves as the secure storage for input files, manifest files, annotation outputs, and the web UI components. SageMaker Ground Truth provides annotators with a web portal to access their labeling jobs and manages the overall annotation workflow. The following diagram illustrates the solution architecture.

The UI template, which includes our specialized audio/video segmentation interface built with Wavesurfer.js, requires specific JavaScript and CSS files. These files are hosted through Amazon CloudFront distribution, providing reliable and efficient delivery to annotators’ browsers. By using CloudFront with an origin access identity and appropriate bucket policies, we allow the UI components to be served to annotators. This setup follows AWS best practices for least-privilege access, making sure CloudFront can only access the specific UI files needed for the annotation interface.

Pre-annotation and post-annotation AWS Lambda functions are optional components that can enhance the workflow. The pre-annotation Lambda function can process the input manifest file before data is presented to annotators, enabling any necessary formatting or modifications. Similarly, the post-annotation Lambda function can transform the annotation outputs into specific formats required for model training. These functions provide flexibility to adapt the workflow to specific needs without requiring changes to the core annotation process.

The solution uses AWS Identity and Access Management (IAM) roles to manage permissions:

Let’s walk through the process of setting up your annotation workflow. We start with a simple scenario: you have an audio file stored in Amazon S3, along with some metadata like a call ID and its transcription. By the end of this walkthrough, you will have a fully functional annotation system where your team can segment and classify this audio content.

Prerequisites

For this walkthrough, make sure you have the following:

Create your internal workforce

Before we dive into the technical setup, let’s create a private workforce in SageMaker Ground Truth. This allows you to test the annotation workflow with your internal team before scaling to a larger operation.

    On the SageMaker console, choose Labeling workforces. Choose Private for the workforce type and create a new private team. Add team members using their email addresses—they will receive instructions to set up their accounts.

Deploy the infrastructure

Although this demonstrates using a CloudFormation template for quick deployment, you can also set up the components manually. The assets (JavaScript and CSS files) are available in our GitHub repository. Complete the following steps for manual deployment:

    Download these assets directly from the GitHub repository. Host them in your own S3 bucket. Set up your own CloudFront distribution to serve these files. Configure the necessary permissions and CORS settings.

This manual approach gives you more control over infrastructure setup and might be preferred if you have existing CloudFront distributions or a need to customize security controls and assets.

The rest of this post will focus on the CloudFormation deployment approach, but the labeling job configuration steps remain the same regardless of how you choose to host the UI assets.

This CloudFormation template creates and configures the following AWS resources:

After the CloudFormation stack deployment is complete, you can find the CloudFront URLs for accessing the JavaScript and CSS files on the AWS CloudFormation console. You need these CloudFront URLs to update your UI template before creating the labeling job. Note these values—you will use them when creating the labeling job.

Prepare your input manifest

Before you create the labeling job, you need to prepare an input manifest file that tells SageMaker Ground Truth what data to present to annotators. The manifest structure is flexible and can be customized based on your needs. For this post, we use a simple structure:

{ "source": "s3://YOUR-BUCKET/audio/sample1.mp3", "call-id": "call-123", "transcription": "Customer: I'm really happy with your smart home security system. However, I have feature request that would make it better\nRepresentative: We're always eager to hear from our customers. What feature would you like to see added ? " }

You can adapt this structure to include additional metadata that your annotation workflow requires. For example, you might want to add speaker information, timestamps, or other contextual data. The key is making sure your UI template is designed to process and display these attributes appropriately.

Create your labeling job

With the infrastructure deployed, let’s create the labeling job in SageMaker Ground Truth. For full instructions, refer to Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda.

    On the SageMaker console, choose Create labeling job. Give your job a name. Specify your input data location in Amazon S3. Specify an output bucket where annotations will be stored. For the task type, select Custom labeling task. In the UI template field, locate the placeholder values for the JavaScript and CSS files and update as follows:
      Replace audiovideo-wavesufer.js with your CloudFront JavaScript URL from the CloudFormation stack outputs. Replace audiovideo-stylesheet.css with your CloudFront CSS URL from the CloudFormation stack outputs.
<!-- Custom Javascript and Stylesheet --><script src="audiovideo-wavesufer.js"></script><link rel="stylesheet" href="audiovideo-stylesheet.css">
    Before you launch the job, use the Preview feature to verify your interface.

You should see the Wavesurfer.js interface load correctly with all controls working properly. This preview step is crucial—it confirms that your CloudFront URLs are correctly specified and the interface is properly configured.

Programmatic setup

Alternatively, you can create your labeling job programmatically using the CreateLabelingJob API. This is particularly useful for automation or when you need to create multiple jobs. See the following code:

response = sagemaker.create_labeling_job(    LabelingJobName="audio-segmentation-job-demo",    LabelAttributeName="label",    InputConfig={        "DataSource": {            "S3DataSource": {                "ManifestS3Uri": "s3://your-bucket-name/path-to-manifest"            }        }    },    OutputConfig={        "S3OutputPath": "s3://your-bucket-name/path-to-output-file"    },    RoleArn="arn:aws:iam::012345678910:role/SagemakerExecutionRole",    # Optionally add PreHumanTaskLambdaArn or AnnotationConsolidationConfig    HumanTaskConfig={        "TaskAvailabilityLifetimeInSeconds": 21600,        "TaskTimeLimitInSeconds": 3600,        "WorkteamArn": "arn:aws:sagemaker:us-east-1:012345678910:workteam/private-crowd/work-team-name",        "TaskDescription": " Evaluate model-generated text responses based on a reference image.",        "MaxConcurrentTaskCount": 1000,        "TaskTitle": " Evaluate Model Responses Based on Image References",        "NumberOfHumanWorkersPerDataObject": 1,        "UiConfig": {            "UiTemplateS3Uri": "s3://your-bucket-name/path-to-ui-template"

The API approach offers the same functionality as the SageMaker console, but allows for automation and integration with existing workflows. Whether you choose the SageMaker console or API approach, the result is the same: a fully configured labeling job ready for your annotation team.

Understanding the output

After your annotators complete their work, SageMaker Ground Truth will generate an output manifest in your specified S3 bucket. This manifest contains rich information at two levels:

Let’s look at a sample output to understand its structure:

{  "answers": [    {      "acceptanceTime": "2024-11-04T18:33:38.658Z",      "answerContent": {        "annotations": {          "categories": {            "language": [              "English",              "Hindi",              "Spanish",              "French",              "German",              "Dutch"            ],            "speaker": [              "Customer",              "Representative"            ]          },          "startTimestamp": 1730745219028,          "startUTCTime": "Mon, 04 Nov 2024 18:33:39 GMT",          "streams": {            "language": [              {                "id": "English",                "start": 0,                "end": 334.808635,                "text": "Sample text in English",                "emotion": "happy"              },              {                "id": "Spanish",                "start": 334.808635,                "end": 550.348471,                "text": "Texto de ejemplo en español",                "emotion": "neutral"              }            ]          },          "endTimestamp": 1730745269602,          "endUTCTime": "Mon, 04 Nov 2024 18:34:29 GMT",          "elapsedTime": 50574        },        "backgroundNoise": {          "ambient": false,          "music": true,          "traffic": false        },        "emotiontag": "Neutral",        "environmentalSounds": {          "birdsChirping": false,          "doorbell": true,          "footsteps": false        },        "rate": {          "1": false,          "2": false,          "3": false,          "4": false,          "5": true        },        "textTranslationFinal": "sample text for transcription"      }    }  ]} 

This two-level annotation structure provides valuable training data for your AI models, capturing both fine-grained details and overall content assessment.

Customizing the solution

Our audio/video segmentation solution is designed to be highly customizable. Let’s walk through how you can adapt the interface to match your specific annotation requirements.

Customize segment-level annotations

The segment-level annotations are controlled in the report() function of the JavaScript code. The following code snippet shows how you can modify the annotation options for each segment:

ranges.forEach(function (r) {   // ... existing code ...      // Example: Adding a custom dropdown for speaker identification   var speakerDropdown = $('<select>').attr({       name: 'speaker',       class: 'custom-dropdown-width'   });   var speakerOptions = ['Speaker A', 'Speaker B', 'Multiple Speakers', 'Background Noise'];   speakerOptions.forEach(function(option) {       speakerDropdown.append($('<option>').val(option).text(option));   });      // Example: Adding a checkbox for quality issues   var qualityCheck = $('<input>').attr({       type: 'checkbox',       name: 'quality_issue'   });   var qualityLabel = $('<label>').text('Contains Quality Issues');   tr.append($('<TD>').append(speakerDropdown));   tr.append($('<TD>').append(qualityCheck).append(qualityLabel));      // Add event listeners for your new fields   speakerDropdown.on('change', function() {       r.speaker = $(this).val();       updateTrackListData(r);   });      qualityCheck.on('change', function() {       r.hasQualityIssues = $(this).is(':checked');       updateTrackListData(r);   });});

You can remove existing fields or add new ones based on your needs. Make sure you’re updating the data model (updateTrackListData function) to handle your custom fields.

Modify full-content classifications

For classifications that apply to the entire audio/video file, you can modify the HTML template. The following code is an example of adding custom classification options:

<div class="row">    <div class="col-6">        <p><strong>Audio Quality Assessment:</strong></p>        <label class="radio">            <input type="radio" name="audioQuality" value="excellent" style="width: 20px;">            Excellent        </label>        <label class="radio">            <input type="radio" name="audioQuality" value="good" style="width: 20px;">            Good        </label>        <label class="radio">            <input type="radio" name="audioQuality" value="poor" style="width: 20px;">            Poor        </label>    </div>    <div class="col-6">        <p><strong>Content Type:</strong></p>        <label class="checkbox">            <input type="checkbox" name="contentType" value="interview" style="width: 20px;">            Interview        </label>        <label class="checkbox">            <input type="checkbox" name="contentType" value="presentation" style="width: 20px;">            Presentation        </label>    </div></div>

The classifications you add here will be included in your output manifest, allowing you to capture both segment-level and full-content annotations.

Extending Wavesurfer.js functionality

Our solution uses Wavesurfer.js, an open source audio visualization library. Although we’ve implemented core functionality for segmentation and annotation, you can extend this further using Wavesurfer.js’s rich feature set. For example, you might want to:

For these customizations, we recommend consulting the Wavesurfer.js documentation. When implementing additional Wavesurfer.js features, remember to test thoroughly in the SageMaker Ground Truth preview to review compatibility with the labeling workflow.

Wavesurfer.js is distributed under the BSD-3-Clause license. Although we’ve tested the integration thoroughly, modifications you make to the Wavesurfer.js implementation should be tested in your environment. The Wavesurfer.js community provides excellent documentation and support for implementing additional features.

Clean up

To clean up the resources created during this tutorial, follow these steps:

    Stop the SageMaker Ground Truth labeling job if it’s still running and you no longer need it. This will halt ongoing labeling tasks and stop additional charges from accruing. Empty the S3 buckets by deleting all objects within them. S3 buckets must be emptied before they can be deleted, so removing all stored files facilitates a smooth cleanup process. Delete the CloudFormation stack to remove all the AWS resources provisioned by the template. This action will automatically delete associated services like the S3 buckets, CloudFront distribution, Lambda function, and related IAM roles.

Conclusion

In this post, we walked through implementing an audio and video segmentation solution using SageMaker Ground Truth. We saw how to deploy the necessary infrastructure, configure the annotation interface, and create labeling jobs both through the SageMaker console and programmatically. The solution’s ability to capture precise segment-level annotations along with overall content classifications makes it particularly valuable for generating high-quality training data for generative AI models, whether you’re working on speech synthesis, video generation, or other multimedia AI applications. As you develop your AI models for audio and video generation, remember that the quality of human feedback directly impacts your model’s performance—whether you’re training models to generate more natural-sounding speech, create coherent video sequences, or understand complex audio patterns.

We encourage you to visit our GitHub repository to explore the solution further and adapt it to your specific needs. You can enhance your annotation workflows by customizing the interface, adding new classification categories, or implementing additional Wavesurfer.js features. To learn more about creating custom labeling workflows in SageMaker Ground Truth, visit Accelerate custom labeling workflows in Amazon SageMaker Ground Truth without using AWS Lambda and Custom labeling workflows.

If you’re looking for a turnkey data labeling solution, consider Amazon SageMaker Ground Truth Plus, which provides access to an expert workforce trained in various machine learning tasks. With SageMaker Ground Truth Plus, you can quickly receive high-quality annotations without the need to build and manage your own labeling workflows, reducing costs by up to 40% and accelerating the delivery of labeled data at scale.

Start building your annotation workflow today and contribute to the next generation of AI models that push the boundaries of what’s possible in audio and video generation.


About the Authors

Sundar Raghavan is an AI/ML Specialist Solutions Architect at AWS, helping customers leverage SageMaker and Bedrock to build scalable and cost-efficient pipelines for computer vision applications, natural language processing, and generative AI. In his free time, Sundar loves exploring new places, sampling local eateries and embracing the great outdoors.

Vineet Agarwal is a Senior Manager of Customer Delivery in the Amazon Bedrock team responsible for Human in the Loop services. He has been in AWS for over 2 years managing Go-to-Market activities, business and technical operations. Prior to AWS, he worked in SaaS , Fintech and Telecommunications industry in services leadership role. He has MBA from the Indian School of Business and B. Tech in Electronics and Communications Engineering from National Institute of Technology, Calicut (India). In his free time, Vineet loves playing racquetball and enjoying outdoor activities with his family.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

音频视频分割 生成式AI 人类反馈 强化学习 SageMaker Ground Truth
相关文章