AWS Machine Learning Blog 05月21日 02:42
Build a domain‐aware data preprocessing pipeline: A multi‐agent collaboration approach
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了一种多代理协作管道,旨在解决保险行业在处理大量非结构化数据时面临的挑战。该方案通过一组专门的代理,分别负责分类、转换、元数据提取和领域特定任务,实现对各种多模态非结构化数据的自动化摄取和转换,从而提高准确性并实现端到端的洞察。该方案的核心在于利用 Amazon Bedrock Agents 创建智能的、领域感知的代理,这些代理可以从 Amazon Bedrock Knowledge Bases 中检索上下文,调用 API 并协调多步骤任务,最终将丰富的数据和元数据存储到数据湖中,为欺诈检测、高级分析和 360 度客户视图奠定基础。

📁 **多代理协作管道**:该管道由多个专门的代理组成,每个代理处理不同的功能,例如分类、转换、元数据提取和特定领域的分析。这种模块化设计提高了可扩展性、可维护性和可重用性。

🤖 **非结构化数据中心主管代理**:主管代理负责协调工作流程、委派任务和调用专门的下游代理。它接收来自用户门户的多模态数据和处理指令,并将每种非结构化数据类型转发给分类协作代理。

🗂️ **分类协作代理**:该代理使用特定领域的规则确定每个文件的类型,并确保在需要时进行转换或直接分类。它识别文件扩展名,如果文件是 DOCX、PPT 或 XLS,则首先将其路由到文档转换代理。

🔑 **元数据提取与人工参与**:通过自动化提取关键数据,对数据进行分阶段审查,并利用领域专家进行人工验证和更正,确保元数据的准确性和完整性,从而支持下游分析。

🚀 **Amazon Bedrock Agents 的应用**:利用 Amazon Bedrock Agents 构建智能的、领域感知的代理,这些代理可以从 Amazon Bedrock Knowledge Bases 中检索上下文,调用 API 并协调多步骤任务。

Enterprises—especially in the insurance industry—face increasing challenges in processing vast amounts of unstructured data from diverse formats, including PDFs, spreadsheets, images, videos, and audio files. These might include claims document packages, crash event videos, chat transcripts, or policy documents. All contain critical information across the claims processing lifecycle.

Traditional data preprocessing methods, though functional, might have limitations in accuracy and consistency. This might affect metadata extraction completeness, workflow velocity, and the extent of data utilization for AI-driven insights (such as fraud detection or risk analysis). To address these challenges, this post introduces a multi‐agent collaboration pipeline: a set of specialized agents for classification, conversion, metadata extraction, and domain‐specific tasks. By orchestrating these agents, you can automate the ingestion and transformation of a wide range of multimodal unstructured data—boosting accuracy and enabling end‐to‐end insights.

For teams processing a small volume of uniform documents, a single-agent setup might be more straightforward to implement and sufficient for basic automation. However, if your data spans diverse domains and formats—such as claims document packages, collision footage, chat transcripts, or audio files—a multi-agent architecture offers distinct advantages. Specialized agents allow for targeted prompt engineering, better debugging, and more accurate extraction, each tuned to a specific data type.

As volume and variety grow, this modular design scales more gracefully, allowing you to plug in new domain-aware agents or refine individual prompts and business logic—without disrupting the broader pipeline. Feedback from domain experts in the human-in-the-loop phase can also be mapped back to specific agents, supporting continuous improvement.

To support this adaptive architecture, you can use Amazon Bedrock, a fully managed service that makes it straightforward to build and scale generative AI applications using foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, DeepSeek, Luma, Meta, Mistral AI, poolside (coming soon), Stability AI, and Amazon through a single API. A powerful feature of Amazon Bedrock—Amazon Bedrock Agents—enables the creation of intelligent, domain-aware agents that can retrieve context from Amazon Bedrock Knowledge Bases, call APIs, and orchestrate multi-step tasks. These agents provide the flexibility and adaptability needed to process unstructured data at scale, and can evolve alongside your organization’s data and business workflows.

Solution overview

Our pipeline functions as an insurance unstructured data preprocessing hub with the following features:

Enriched outputs and associated metadata will ultimately land in a metadata‐rich unstructured data lake, forming the foundation for fraud detection, advanced analytics, and 360‐degree customer views.

The following diagram illustrates the solution architecture.

The end-to-end workflow features a supervisor agent at the center, classification and conversion agents branching off, a human‐in‐the‐loop step, and Amazon Simple Storage Service (Amazon S3) as the final unstructured data lake destination.

Multi‐agent collaboration pipeline

This pipeline is composed of multiple specialized agents, each handling a distinct function such as classification, conversion, metadata extraction, and domain-specific analysis. Unlike a single monolithic agent that attempts to manage all tasks, this modular design promotes scalability, maintainability, and reuse. Individual agents can be independently updated, swapped, or extended to accommodate new document types or evolving business rules without impacting the overall system. This separation of concerns improves fault tolerance and enables parallel processing, resulting in faster and more reliable data transformation workflows.

Multi-agent collaboration offers the following metrics and efficiency gains:

Unstructured Data Hub Supervisor Agent

The Supervisor Agent orchestrates the workflow, delegates tasks, and invokes specialized downstream agents. It has the following key responsibilities:

    Receive incoming multimodal data and processing instructions from the user portal (multimodal claims document packages, vehicle damage images, audio transcripts, or repair estimates). Forward each unstructured data type to the Classification Collaborator Agent to determine whether a conversion step is needed or direct classification is possible. Coordinate specialized domain processing by invoking the appropriate agent for each data type—for example, a claims documents package is handled by the Claims Documentation Package Processing Agent, and repair estimates go to the Vehicle Repair Estimate Processing Agent. Make sure that every incoming data eventually lands, along with its metadata, in the S3 data lake.

Classification Collaborator Agent

The Classification Collaborator Agent determines each file’s type using domain‐specific rules and makes sure it’s either converted (if needed) or directly classified. This includes the following steps:

    Identify the file extension. If it’s DOCX, PPT, or XLS, it routes the file to the Document Conversion Agent first. Output a unified classification result for each standardized document—specifying the category, confidence, extracted metadata, and next steps.

Document Conversion Agent

The Document Conversion Agent converts non‐PDF files into PDF and extracts initial metadata (creation date, file size, and so on). This includes the following steps:

    Transform DOCX, PPT, XLS, and XLSX into PDF. Capture embedded metadata. Return the new PDF to the Classification Collaborator Agent for final classification.

Specialized classification agents

Each agent handles specific modalities of data:

Additionally, we have defined specialized downstream agents:

After the high‐level classification identifies a file as, for example, a claims document package or repair estimate, the Supervisor Agent invokes the appropriate specialized agent to perform deeper domain‐specific transformation and extraction.

Metadata extraction and human-in-the-loop

Metadata is essential for automated workflows. Without accurate metadata fields—like claim numbers, policy numbers, coverage dates, loss dates, or claimant names—downstream analytics lack context. This part of the solution handles data extraction, error handling, and recovery through the following features:

Eventually, automated issue resolver agents can be introduced in iterations to handle an increasing share of data fixes, further reducing the need for manual review. Several strategies can be introduced to enable this progression to improve resilience and adaptability over time:

By combining these strategies, the pipeline becomes increasingly adaptive—continually improving data quality and enabling scalable, metadata-driven insights across the enterprise.

Metadata‐rich unstructured data lake

After each unstructured data type is converted and classified, both the standardized content

and metadata JSON files are stored in an unstructured data lake (Amazon S3). This repository unifies different data types (images, transcripts, documents) through shared metadata, enabling the following:

Multi‐modal, multi‐agentic pattern

In our AWS CloudFormation template, each multimodal data type follows a specialized flow:

Human-in-the-loop and future improvements

The human‐in‐the‐loop component is key for verifying and adding missing metadata and fixing incorrect categorization of data. However, the pipeline is designed to evolve as follows:

The pipeline evolves toward full automation, minimizing human oversight except for the most complex cases.

Prerequisites

Before deploying this solution, make sure that you have the following in place:

Deploy the solution with AWS CloudFormation

Complete the following steps to set up the solution resources:

    Sign in to the AWS Management Console as an IAM administrator or appropriate IAM user. Choose Launch Stack to deploy the CloudFormation template.

    Provide the necessary parameters and create the stack.

For this setup, we use us-west-2 as our Region, Anthropic’s Claude 3.5 Haiku model for orchestrating the flow between the different agents, and Anthropic’s Claude 3.5 Sonnet V2 model for conversion, categorization, and processing of multimodal data.

If you want to use other models on Amazon Bedrock, you can do so by making appropriate changes in the CloudFormation template. Check for appropriate model support in the Region and the features that are supported by the models.

It will take about 30 minutes to deploy the solution. After the stack is deployed, you can view the various outputs of the CloudFormation stack on the Outputs tab, as shown in the following screenshot.

The provided CloudFormation template creates multiple S3 buckets (such as DocumentUploadBucket, SampleDataBucket, and KnowledgeBaseDataBucket) for raw uploads, sample files, Amazon Bedrock Knowledge Bases references, and more. Each specialized Amazon Bedrock agent or Lambda function uses these buckets to store intermediate or final artifacts.

The following screenshot is an illustration of the Amazon Bedrock agents that are deployed in the AWS account.

The next section outlines how to test the unstructured data processing workflow.

Test the unstructured data processing workflow

In this section, we present different use cases to demonstrate the solution. Before you begin, complete the following steps:

    Locate the APIGatewayInvokeURL value from the CloudFormation stack’s outputs. This URL launches the Insurance Unstructured Data Preprocessing Hub in your browser.

    Download the sample data files from the designated S3 bucket (SampleDataBucketName) to your local machine. The following screenshots show the bucket details from CloudFormation stack’s outputs and the contents of the sample data bucket.

With these details, you can now test the pipeline by uploading the following sample multimodal files through the Insurance Unstructured Data Preprocessing Hub Portal:

Each multimodal data type will be processed through a series of agents:

Finally, the processed files, along with their enriched metadata, are stored in the S3 data lake. Now, let’s proceed to the actual use cases.

Use Case 1: Claims document package

This use case demonstrates the complete workflow for processing a multimodal claims document package. By uploading a PDF document to the pipeline, the system automatically classifies the document type, extracts essential metadata, and categorizes each page into specific components.

    Choose Upload File in the UI and choose the pdf file.

The file upload might take some time depending on the document size.

    When the upload is complete, you can confirm that the extracted metadata values are follows:
      Claim Number: 0112233445 Policy Number: SF9988776655 Date of Loss: 2025-01-01 Claimant Name: Jane Doe

The Classification Collaborator Agent identifies the document as a Claims Document Package. Metadata (such as claim ID and incident date) is automatically extracted and displayed for review.

    For this use case, no changes are made—simply choose Continue Preprocessing to proceed.

The processing stage might take up to 15 minutes to complete. Rather than manually checking the S3 bucket (identified in the CloudFormation stack outputs as KnowledgeBaseDataBucket) to verify that 72 files—one for each page and its corresponding metadata JSON—have been generated, you can monitor the progress by periodically choosing Check Queue Status. This lets you view the current state of the processing queue in real time.

The pipeline further categorizes each page into specific types (for example, lawyer letter, police report, medical bills, doctor’s report, health forms, x-rays). It also generates corresponding markup text files and metadata JSON files.

Finally, the processed text and metadata JSON files are stored in the unstructured S3 data lake.

The following diagram illustrates the complete workflow.

Use Case 2: Collision center workbook for vehicle repair estimate

In this use case, we upload a collision center workbook to trigger the workflow that converts the file, extracts repair estimate details, and stages the data for review before final storage.

    Choose Upload File and choose the xlsx workbook. Wait for the upload to complete and confirm that the extracted metadata is accurate:
      Claim Number: CLM20250215 Policy Number: SF9988776655 Claimant Name: John Smith Vehicle: Truck

The Document Conversion Agent converts the file to PDF if needed, or the Classification Collaborator Agent identifies it as a repair estimate. The Vehicle Repair Estimate Processing Agent extracts cost lines, part numbers, and labor hours.

    Review and update the displayed metadata as necessary, then choose Continue Preprocessing to trigger final storage.

The finalized file and metadata are stored in Amazon S3.

The following diagram illustrates this workflow.

Use Case 3: Collision video with audio transcript

For this use case, we upload a video showing the accident scene to trigger a workflow that analyzes both visual and audio data, extracts key frames for collision severity, and stages metadata for review before final storage.

    Choose Upload File and choose the mp4 video. Wait until the upload is complete, then review the collision scenario and adjust the displayed metadata to correct omissions or inaccuracies as follows:
      Claim Number: 0112233445 Policy Number: SF9988776655 Date of Loss: 01-01-2025 Claimant Name: Jane Doe Policy Holder Name: John Smith

The Classification Collaborator Agent directs the video to either the Audio/Video Transcript or Vehicle Damage Analysis agent. Key frames are analyzed to determine collision severity.

    Review and update the displayed metadata (for example, policy number, location), then choose Continue Preprocessing to initiate final storage.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics such as verifying story consistency.

The following diagram illustrates this workflow.

Use Case 4: Audio transcript between claimant and customer service associate

Next, we upload a video that captures the claimant reporting an accident to trigger the workflow that extracts an audio transcript and identifies key metadata for review before final storage.

    Choose Upload File and choose mp4. Wait until the upload is complete, then review the call scenario and adjust the displayed metadata to correct any omissions or inaccuracies as follows:
      Claim Number: Not Assigned Yet Policy Number: SF9988776655 Claimant Name: Jane Doe Policy Holder Name: John Smith Date Of Loss: January 1, 2025 8:30 AM

The Classification Collaborator Agent routes the file to the Audio/Video Transcript Agent for processing. Key metadata attributes are automatically identified from the call.

    Review and correct any incomplete metadata, then choose Continue Preprocessing to proceed.

Final transcripts and metadata are stored in Amazon S3, ready for advanced analytics (for example, verifying story consistency).

The following diagram illustrates this workflow.

Use Case 5: Auto insurance policy document

For our final use case, we upload an insurance policy document to trigger the workflow that converts and classifies the document, extracts key metadata for review, and stores the finalized output in Amazon S3.

    Choose Upload File and choose docx. Wait until the upload is complete, and confirm that the extracted metadata values are as follows:
      Policy Number: SF9988776655 Policy type: Auto Insurance Effective Date: 12/12/2024 Policy Holder Name: John Smith

The Document Conversion Agent transforms the document into a standardized PDF format if required. The Classification Collaborator Agent then routes it to the Document Classification Agent for categorization as an Auto Insurance Policy Document. Key metadata attributes are automatically identified and presented for user review.

    Review and correct incomplete metadata, then choose Continue Preprocessing to trigger final storage.

The finalized policy document in markup format, along with its metadata, is stored in Amazon S3—ready for advanced analytics such as verifying story consistency.

The following diagram illustrates this workflow.

Similar workflows can be applied to other types of insurance multimodal data and documents by uploading them on the Data Preprocessing Hub Portal. Whenever needed, this process can be enhanced by introducing specialized downstream Amazon Bedrock agents that collaborate with the existing Supervisor Agent, Classification Agent, and Conversion Agents.

Amazon Bedrock Knowledge Bases integration

To use the newly processed data in the data lake, complete the following steps to ingest the data in Amazon Bedrock Knowledge Bases and interact with the data lake using a structured workflow. This integration allows for dynamic querying across different document types, enabling deeper insights from multimodal data.

    Choose Chat with Your Documents to open the chat interface.

    Choose Sync Knowledge Base to initiate the job that ingests and indexes the newly processed files and the available metadata into the Amazon Bedrock knowledge base. After the sync is complete (which might take a couple of minutes), enter your queries in the text box. For example, set Policy Number to SF9988776655 and try asking:
      “Retrieve details of all claims filed against the policy number by multiple claimants.” “What is the nature of Jane Doe’s claim, and what documents were submitted?” “Has the policyholder John Smith submitted any claims for vehicle repairs, and are there any estimates on file?”
    Choose Send and review the system’s response.

This integration enables cross-document analysis, so you can query across multimodal data types like transcripts, images, claims document packages, repair estimates, and claim records to reveal customer 360-degree insights from your domain-aware multi-agent pipeline. By synthesizing data from multiple sources, the system can correlate information, uncover hidden patterns, and identify relationships that might not have been evident in isolated documents.

A key enabler of this intelligence is the rich metadata layer generated during preprocessing. Domain experts actively validate and refine this metadata, providing accuracy and consistency across diverse document types. By reviewing key attributes—such as claim numbers, policyholder details, and event timelines—domain experts enhance the metadata foundation, making it more reliable for downstream AI-driven analysis.

With rich metadata in place, the system can now infer relationships between documents more effectively, enabling use cases such as:

By continuously improving metadata through human validation, the system becomes more adaptive, paving the way for future automation, where issue resolver agents can proactively identify and self-correct missing and inconsistent metadata with minimal manual intervention during the data ingestion process.

Clean up

To avoid unexpected charges, complete the following steps to clean up your resources:

    Delete the contents from the S3 buckets mentioned in the outputs of the CloudFormation stack. Delete the deployed stack using the AWS CloudFormation console.

Conclusion

By transforming unstructured insurance data into metadata‐rich outputs, you can accomplish the following:

As this multi‐agent collaboration pipeline matures, specialized issue resolver agents and refined LLM prompts can further reduce human involvement—unlocking end‐to‐end automation and improved decision‐making. Ultimately, this domain‐aware approach future‐proofs your claims processing workflows by harnessing raw, unstructured data as actionable business intelligence.

To get started with this solution, take the following next steps:

    Deploy the CloudFormation stack and experiment with the sample data. Refine domain rules or agent prompts based on your team’s feedback. Use the metadata in your S3 data lake for advanced analytics like real‐time risk assessment or fraud detection. Connect an Amazon Bedrock knowledge base to KnowledgeBaseDataBucket for advanced Q&A and RAG.

With a multi‐agent architecture in place, your insurance data ceases to be a scattered liability, becoming instead a unified source of high‐value insights.

Refer to the following additional resources to explore further:


About the Author

Piyali Kamra is a seasoned enterprise architect and a hands-on technologist who has over two decades of experience building and executing large scale enterprise IT projects across geographies. She believes that building large scale enterprise systems is not an exact science but more like an art, where you can’t always choose the best technology that comes to one’s mind but rather tools and technologies must be carefully selected based on the team’s culture , strengths, weaknesses and risks, in tandem with having a futuristic vision as to how you want to shape your product a few years down the road.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

多代理协作 非结构化数据 Amazon Bedrock 保险行业 元数据提取
相关文章