Build an AI-powered document processing platform with open source NER model and LLM on Amazon SageMaker

Archival data in research institutions and national laboratories represents a vast repository of historical knowledge, yet much of it remains inaccessible due to factors like limited metadata and inconsistent labeling. Traditional keyword-based search mechanisms are often insufficient for locating relevant documents efficiently, requiring extensive manual review to extract meaningful insights.

To address these challenges, a U.S. National Laboratory has implemented an AI-driven document processing platform that integrates named entity recognition (NER) and large language models (LLMs) on Amazon SageMaker AI. This solution improves the findability and accessibility of archival records by automating metadata enrichment, document classification, and summarization. By using Mixtral-8x7B for abstractive summarization and title generation, alongside a BERT-based NER model for structured metadata extraction, the system significantly improves the organization and retrieval of scanned documents.

Designed with a serverless, cost-optimized architecture, the platform provisions SageMaker endpoints dynamically, providing efficient resource utilization while maintaining scalability. The integration of modern natural language processing (NLP) and LLM technologies enhances metadata accuracy, enabling more precise search functionality and streamlined document management. This approach supports the broader goal of digital transformation, making sure that archival data can be effectively used for research, policy development, and institutional knowledge retention.

In this post, we discuss how you can build an AI-powered document processing platform with open source NER and LLMs on SageMaker.

Solution overview

The NER & LLM Gen AI Application is a document processing solution built on AWS that combines NER and LLMs to automate document analysis at scale. The system addresses the challenges of processing large volumes of textual data by using two key models: Mixtral-8x7B for text generation and summarization, and a BERT NER model for entity recognition.

The following diagram illustrates the solution architecture.

The architecture implements a serverless design with dynamically managed SageMaker endpoints that are created on demand and destroyed after use, optimizing performance and cost-efficiency. The application follows a modular structure with distinct components handling different aspects of document processing, including extractive summarization, abstractive summarization, title generation, and author extraction. These modular pieces can be removed, replaced, duplicated, and patterned against for optimal reusability.

The processing workflow begins when documents are detected in the Extracts Bucket, triggering a comparison against existing processed files to prevent redundant operations. The system then orchestrates the creation of necessary model endpoints, processes documents in batches for efficiency, and automatically cleans up resources upon completion. Multiple specialized Amazon Simple Storage Service Buckets (Amazon S3 Bucket) store different types of outputs.

Click here to open the AWS console and follow along.

Solution Components

Storage architecture

The application uses a multi-bucket Amazon S3 storage architecture designed for clarity, efficient processing tracking, and clear separation of document processing stages. Each bucket serves a specific purpose in the pipeline, providing organized data management and simplified access control. Amazon DynamoDB is used to track the processing of each document.

The bucket types are as follows:

Extracts

Extractive summary

Abstractive summary

Generated titles

Author information

Model weights

SageMaker endpoints

The SageMaker endpoints in this application represent a dynamic, cost-optimized approach to machine learning (ML) model deployment. Rather than maintaining constantly running endpoints, the system creates them on demand when document processing begins and automatically stops them upon completion. Two primary endpoints are managed: one for the Mixtral-8x7B LLM, which handles text generation tasks including abstractive summarization and title generation, and another for the BERT-based NER model responsible for author extraction. This endpoint based architecture provides decoupling between the other processing, allowing independent scaling, versioning, and maintenance of each component. The decoupled nature of the endpoints also provides flexibility to update or replace individual models without impacting the broader system architecture.

The endpoint lifecycle is orchestrated through dedicated AWS Lambda functions that handle creation and deletion. When processing is triggered, endpoints are automatically initialized and model artifacts are downloaded from Amazon S3. The LLM endpoint is provisioned on ml.p4d.24xlarge (GPU) instances to provide sufficient computational power for the LLM operations. The NER endpoint is deployed on a ml.c5.9xlarge instance (CPU), which is sufficient to support this language model. To maximize cost-efficiency, the system processes documents in batches while the endpoints are active, allowing multiple documents to be processed during a single endpoint deployment cycle and maximizing the usage of the endpoints.

For usage awareness, the endpoint management system includes notification mechanisms through Amazon Simple Notification Service (Amazon SNS). Users receive notifications when endpoints are destroyed, providing visibility that a large instance is destroyed and not idling. The entire endpoint lifecycle is integrated into the broader workflow through AWS Step Functions, providing coordinated processing across all components of the application.

Step Functions workflow

The following figure illustrates the Step Functions workflow.

The application implements a processing pipeline through AWS Step Functions, orchestrating a series of Lambda functions that handle distinct aspects of document analysis. Multiple documents are processed in batches while endpoints are active, maximizing resource utilization. When processing is complete, the workflow automatically triggers endpoint deletion, preventing unnecessary resource consumption.

The highly modular Lambda functions are designed for flexibility and extensibility, enabling their adaptation for diverse use cases beyond their default implementations. For example, the abstractive summarization can be reused to do QnA or other forms of generation, and the NER model can be used to recognize other entity types such as organizations or locations.

Logical flow

The document processing workflow orchestrates multiple stages of analysis that operate both in parallel and sequential patterns. The Step Functions coordinates the movement of documents through extractive summarization, abstractive summarization, title generation, and author extraction processes. Each stage is managed as a discrete step, with clear input and output specifications, as illustrated in the following figure.

In the following sections, we look at each step of the logical flow in more detail.

Extractive summarization:

The extractive summarization process employs the TextRank algorithm, powered by sumy and NLTK libraries, to identify and extract the most significant sentences from source documents. This approach treats sentences as nodes within a graph structure, where the importance of each sentence is determined by its relationships and connections to other sentences. The algorithm analyzes these interconnections to identify key sentences that best represent the document’s core content, functioning similarly to how an editor would select the most important passages from a text. This method preserves the original wording while reducing the document to its most essential components.

Generate title:

The title generation process uses the Mixtral-8x7B model but focuses on creating concise, descriptive titles that capture the document’s main theme. It uses the extractive summary as input to provide efficiency and focus on key content. The LLM is prompted to analyze the main topics and themes present in the summary and generate an appropriate title that effectively represents the document’s content. This approach makes sure that generated titles are both relevant and informative, providing users with a quick understanding of the document’s subject matter without needing to read the full text.

Abstractive summarization:

Abstractive summarization also uses the Mixtral-8x7B LLM to generate entirely new text that captures the essence of the document. Unlike extractive summarization, this method doesn’t simply select existing sentences, but creates new content that paraphrases and restructures the information. The process takes the extractive summary as input, which helps reduce computation time and costs by focusing on the most relevant content. This approach results in summaries that read more naturally and can effectively condense complex information into concise, readable text.

Extract author:

Author extraction employs a BERT NER model to identify and classify author names within documents. The process specifically focuses on the first 1,500 characters of each document, where author information typically appears. The system follows a three-stage process: first, it detects potential name tokens with confidence scoring; second, it assembles related tokens into complete names; and finally, it validates the assembled names to provide proper formatting and eliminate false positives. The model can recognize various entity types (PER, ORG, LOC, MISC) but is specifically tuned to identify person names in the context of document authorship.

Cost and Performance

The solution achieves remarkable throughput by processing 100,000 documents within a 12-hour window. Key architectural decisions drive both performance and cost optimization. By implementing extractive summarization as an initial step, the system reduces input tokens by 75-90% (depending on the size of the document), substantially decreasing the workload for downstream LLM processing. The implementation of a dedicated NER model for author extraction yields an additional 33% reduction in LLM calls by bypassing the need for the more resource-intensive language model. These strategic optimizations create a compound effect – accelerating processing speeds while simultaneously reducing operational costs – establishing the platform as an efficient and cost-effective solution for enterprise-scale document processing needs. To estimate cost for processing 100,000 documents, multiply 12 by the cost per hour of the ml.p4d.24xlarge instance in your AWS region. It’s important to note that instance costs vary by region and may change over time, so current pricing should be consulted for accurate cost projections.

Deploy the Solution

To deploy follow along the instruction in the GitHub repo.

Clean up

Clean up instructions can be found in this section.

Conclusion

The NER & LLM Gen AI Application represents an organizational advancement in automated document processing, using powerful language models in an efficient serverless architecture. Through its implementation of both extractive and abstractive summarization, named entity recognition, and title generation, the system demonstrates the practical application of modern AI technologies in handling complex document analysis tasks. The application’s modular design and flexible architecture enable organizations to adapt and extend its capabilities to meet their specific needs, while the careful management of AWS resources through dynamic endpoint creation and deletion maintains cost-effectiveness. As organizations continue to face growing demands for efficient document processing, this solution provides a scalable, maintainable and customizable framework for automating and streamlining these workflows.

References:

What is Intelligent Document Processing (IDP)?

Serverless on AWS

AWS GovCloud (US)

Generative AI on AWS

About the Authors

Nick Biso is a Machine Learning Engineer at AWS Professional Services. He solves complex organizational and technical challenges using data science and engineering. In addition, he builds and deploys AI/ML models on the AWS Cloud. His passion extends to his proclivity for travel and diverse cultural experiences.

Dr. Ian Lunsford is an Aerospace Cloud Consultant at AWS Professional Services. He integrates cloud services into aerospace applications. Additionally, Ian focuses on building AI/ML solutions using AWS services.

Max Rathmann is a Senior DevOps Consultant at Amazon Web Services, where she specializes in architecting cloud-native, server-less applications. She has a background in operationalizing AI/ML solutions and designing MLOps solutions with AWS Services.

Michael Massey is a Cloud Application Architect at Amazon Web Services, where he specializes in building frontend and backend cloud-native applications. He designs and implements scalable and highly-available solutions and architectures that help customers achieve their business goals.

Jeff Ryan is a DevOps Consultant at AWS Professional Services, specializing in AI/ML, automation, and cloud security implementations. He focuses on helping organizations leverage AWS services like Bedrock, Amazon Q, and SageMaker to build innovative solutions. His expertise spans MLOps, GenAI, serverless architectures, and Infrastructure as Code (IaC).

Dr. Brian Weston is a research manager at the Center for Applied Scientific Computing, where he is the AI/ML Lead for the Digital Twins for Additive Manufacturing Strategic Initiative, a project focused on building digital twins for certification and qualification of 3D printed components. He also holds a program liaison role between scientists and IT staff, where Weston champions the integration of cloud computing with digital engineering transformation, driving efficiency and innovation for mission science projects at the laboratory.

Ian Thompson is a Data Engineer at Enterprise Knowledge, specializing in graph application development and data catalog solutions. His experience includes designing and implementing graph architectures that improve data discovery and analytics across organizations. He is also the #1 Square Off player in the world.

Anna D’Angela is a Data Engineer at Enterprise Knowledge within the Semantic Engineering and Enterprise AI practice. She specializes in the design and implementation of knowledge graphs.