AWS Machine Learning Blog 2024年08月09日
How Deltek uses Amazon Bedrock for question and answering on government solicitation documents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了AWS Generative AI Innovation Center (GenAIIC) 为Deltek开发的基于检索增强生成(RAG)的定制解决方案,该解决方案能够对单个或多个政府招标文件进行问答。该方案利用了亚马逊文本提取、亚马逊 OpenSearch 服务和亚马逊 Bedrock 等 AWS 服务。

📑 **数据摄取**:该解决方案首先对 PDF 文档进行预处理,包括使用 Amazon Textract 提取文本和表格,并使用文本嵌入模型生成每个文本块的嵌入向量。这些向量以及相关元数据被索引到 OpenSearch Service 中。

📊 **问答**:用户提出有关已摄取文档的问题,应用程序检索问题嵌入表示。然后,应用程序将检索到的数据和查询传递给 Amazon Bedrock,以生成自然语言响应。模型执行语义搜索以从文档中查找相关文本块(也称为上下文)。

💻 **文本嵌入模型**:文本嵌入模型用于将文本中的单词或短语映射到密集向量表示。它们用于文档嵌入(将文档内容编码并映射到嵌入空间)和查询嵌入(将用户查询嵌入到向量中,以便通过执行语义搜索与文档块匹配)。

📝 **Amazon Bedrock**:Amazon Bedrock 提供来自 AI21 Labs、Anthropic、Cohere、Meta、Stability AI 和 Amazon 等顶级 AI 公司的即用型基础模型 (FM)。它提供了一个单一接口来访问这些模型并构建生成式 AI 应用程序,同时维护隐私和安全性。

📗 **OpenSearch Service**:OpenSearch Service 是一种开源的分布式搜索和分析套件,源自 Elasticsearch。它使用向量数据库结构来高效地存储和查询大量数据。

📖 **Amazon Textract**:Amazon Textract 可以将 PDF、PNG、JPEG 和 TIFF 转换为机器可读文本。它还可以格式化表格等复杂结构,以便于分析。

📕 **Temporal Aspects Handling**:当对随时间推移而演变的文档使用问答时,必须考虑文档的年代顺序,尤其是在询问随着时间推移而发生变化的概念时。

📔 **Document Layout and Structure**:某些类型的文档可能遵循标准布局或格式。这种结构可以用于优化数据摄取。例如,RFP 文档往往具有特定的布局,包含定义的章节。

📓 **Text Chunking**:对于长文档,提取的文本可能会超过 LLM 的输入大小限制。在这种情况下,可以将文本划分为较小、重叠的块。

📒 **Table Parsing**:为了使语言模型能够更好地回答有关表格的问题,创建了一个解析器将 Amazon Textract 输出中的表格转换为 CSV 格式。

📑 **Document Metadata**:除了嵌入向量外,还将文本块和文档元数据(例如文档名称、文档章节名称或文档发布日期)添加到索引中,作为文本字段。

📊 **Semantic Search**:模型执行语义搜索以从文档中查找相关文本块(也称为上下文)。

💻 **Question and Context**:问题和上下文被组合并作为提示提供给 LLM。语言模型生成对用户问题的自然语言响应。

📝 **Embedding Vector**:应用程序检索问题嵌入表示。

📗 **OpenSearch Service**:应用程序将检索到的数据和查询传递给 Amazon Bedrock,以生成自然语言响应。

📖 **Amazon Bedrock**:Amazon Bedrock 提供来自 AI21 Labs、Anthropic、Cohere、Meta、Stability AI 和 Amazon 等顶级 AI 公司的即用型基础模型 (FM)。

📕 **Text Embedding Models**:文本嵌入模型用于将文本中的单词或短语映射到密集向量表示。

📔 **Data Ingestion**:该解决方案首先对 PDF 文档进行预处理,包括使用 Amazon Textract 提取文本和表格,并使用文本嵌入模型生成每个文本块的嵌入向量。这些向量以及相关元数据被索引到 OpenSearch Service 中。

📓 **RAG**:RAG 是一种通过允许 LLM 在生成响应之前引用其训练数据源之外的权威知识库来优化 LLM 输出的过程。

📒 **Deltek**:Deltek 是一家全球公认的项目制企业标准,服务于政府承包和专业服务领域。

📑 **Government Solicitation Documents**:该解决方案针对单个或多个政府招标文件进行问答。

📊 **AWS Generative AI Innovation Center (GenAIIC)**:AWS GenAIIC 团队创建了一个基于 RAG 的解决方案,以实现对单个或多个政府招标文件的问答。

💻 **Amazon Textract**:Amazon Textract 是一种服务,可以将 PDF、PNG、JPEG 和 TIFF 等文档转换为机器可读文本。

📝 **Amazon OpenSearch Service**:Amazon OpenSearch Service 是一种完全托管的 OpenSearch 服务,它提供了一种灵活且可扩展的方式来构建和运行搜索应用程序。

📗 **Amazon Bedrock**:Amazon Bedrock 是一种完全托管的服务,提供来自领先人工智能 (AI) 公司(如 AI21 Labs、Anthropic、Cohere、Meta、Stability AI 和 Amazon)的高性能基础模型 (FM) 和 LLM,通过单个 API 提供。

📖 **Retrieval Augmented Generation (RAG)**:RAG 是一种使用大型语言模型 (LLM) 与文档以自然语言交互的领先方法。

📕 **Large Language Models (LLMs)**:LLM 是一种人工智能 (AI) 模型,经过训练可以理解和生成人类语言。

📔 **Question and Answering (Q&A)**:问答 (Q&A) 是一个常见的应用程序,用于各种用例,例如客户支持聊天机器人、法律研究助手和医疗保健顾问。

📓 **Deltek**:Deltek 是一家全球公认的项目制企业标准,服务于政府承包和专业服务领域。

📒 **AWS Generative AI Innovation Center (GenAIIC)**:AWS GenAIIC 团队创建了一个基于 RAG 的解决方案,以实现对单个或多个政府招标文件的问答。

📑 **Government Solicitation Documents**:该解决方案针对单个或多个政府招标文件进行问答。

📊 **RAG**:RAG 是一种通过允许 LLM 在生成响应之前引用其训练数据源之外的权威知识库来优化 LLM 输出的过程。

💻 **LLMs**:LLM 是一种人工智能 (AI) 模型,经过训练可以理解和生成人类语言。

📝 **Q&A**:问答 (Q&A) 是一个常见的应用程序,用于各种用例,例如客户支持聊天机器人、法律研究助手和医疗保健顾问。

📗 **Deltek**:Deltek 是一家全球公认的项目制企业标准,服务于政府承包和专业服务领域。

📖 **AWS Generative AI Innovation Center (GenAIIC)**:AWS GenAIIC 团队创建了一个基于 RAG 的解决方案,以实现对单个或多个政府招标文件的问答。

📕 **Government Solicitation Documents**:该解决方案针对单个或多个政府招标文件进行问答。

📔 **RAG**:RAG 是一种通过允许 LLM 在生成响应之前引用其训练数据源之外的权威知识库来优化 LLM 输出的过程。

📓 **LLMs**:LLM 是一种人工智能 (AI) 模型,经过训练可以理解和生成人类语言。

📒 **Q&A**:问答 (Q&A) 是一个常见的应用程序,用于各种用例,例如客户支持聊天机器人、法律研究助手和医疗保健顾问。

📑 **Deltek**:Deltek 是一家全球公认的项目制企业标准,服务于政府承包和专业服务领域。

This post is co-written by Kevin Plexico and Shakun Vohra from Deltek.

Question and answering (Q&A) using documents is a commonly used application in various use cases like customer support chatbots, legal research assistants, and healthcare advisors. Retrieval Augmented Generation (RAG) has emerged as a leading method for using the power of large language models (LLMs) to interact with documents in natural language.

This post provides an overview of a custom solution developed by the AWS Generative AI Innovation Center (GenAIIC) for Deltek, a globally recognized standard for project-based businesses in both government contracting and professional services. Deltek serves over 30,000 clients with industry-specific software and information solutions.

In this collaboration, the AWS GenAIIC team created a RAG-based solution for Deltek to enable Q&A on single and multiple government solicitation documents. The solution uses AWS services including Amazon Textract, Amazon OpenSearch Service, and Amazon Bedrock. Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) and LLMs from leading artificial intelligence (AI) companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI.

Deltek is continuously working on enhancing this solution to better align it with their specific requirements, such as supporting file formats beyond PDF and implementing more cost-effective approaches for their data ingestion pipeline.

What is RAG?

RAG is a process that optimizes the output of LLMs by allowing them to reference authoritative knowledge bases outside of their training data sources before generating a response. This approach addresses some of the challenges associated with LLMs, such as presenting false, outdated, or generic information, or creating inaccurate responses due to terminology confusion. RAG enables LLMs to generate more relevant, accurate, and contextual responses by cross-referencing an organization’s internal knowledge base or specific domains, without the need to retrain the model. It provides organizations with greater control over the generated text output and offers users insights into how the LLM generates the response, making it a cost-effective approach to improve the capabilities of LLMs in various contexts.

The main challenge

Applying RAG for Q&A on a single document is straightforward, but applying the same across multiple related documents poses some unique challenges. For example, when using question answering on documents that evolve over time, it is essential to consider the chronological sequence of the documents if the question is about a concept that has transformed over time. Not considering the order could result in providing an answer that was accurate at a past point but is now outdated based on more recent information across the collection of temporally aligned documents. Properly handling temporal aspects is a key challenge when extending question answering from single documents to sets of interlinked documents that progress over the course of time.

Solution overview

As an example use case, we describe Q&A on two temporally related documents: a long draft request-for-proposal (RFP) document, and a related subsequent government response to a request-for-information (RFI response), providing additional and revised information.

The solution develops a RAG approach in two steps.

The first step is data ingestion, as shown in the following diagram. This includes a one-time processing of PDF documents. The application component here is a user interface with minor processing such as splitting text and calling the services in the background. The steps are as follows:

    The user uploads documents to the application. The application uses Amazon Textract to get the text and tables from the input documents. The text embedding model processes the text chunks and generates embedding vectors for each text chunk. The embedding representations of text chunks along with related metadata are indexed in OpenSearch Service.

The second step is Q&A, as shown in the following diagram. In this step, the user asks a question about the ingested documents and expects a response in natural language. The application component here is a user interface with minor processing such as calling different services in the background. The steps are as follows:

    The user asks a question about the documents. The application retrieves an embedding representation of the input question. The application passes the retrieved data from OpenSearch Service and the query to Amazon Bedrock to generate a response. The model performs a semantic search to find relevant text chunks from the documents (also called context). The embedding vector maps the question from text to a space of numeric representations. The question and context are combined and fed as a prompt to the LLM. The language model generates a natural language response to the user’s question.

We used Amazon Textract in our solution, which can convert PDFs, PNGs, JPEGs, and TIFFs into machine-readable text. It also formats complex structures like tables for easier analysis. In the following sections, we provide an example to demonstrate Amazon Textract’s capabilities.

OpenSearch is an open source and distributed search and analytics suite derived from Elasticsearch. It uses a vector database structure to efficiently store and query large volumes of data. OpenSearch Service currently has tens of thousands of active customers with hundreds of thousands of clusters under management processing hundreds of trillions of requests per month. We used OpenSearch Service and its underlying vector database to do the following:

The vector database inside OpenSearch Service enabled efficient storage and fast retrieval of related data chunks to power our question answering system. By modeling documents as vectors, we could find relevant passages even without explicit keyword matches.

Text embedding models are machine learning (ML) models that map words or phrases from text to dense vector representations. Text embeddings are commonly used in information retrieval systems like RAG for the following purposes:

For this post, we used the Amazon Titan model, Amazon Titan Embeddings G1 – Text v1.2, which intakes up to 8,000 tokens and outputs a numerical vector of 1,536 dimensions. The model is available through Amazon Bedrock.

Amazon Bedrock provides ready-to-use FMs from top AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon. It offers a single interface to access these models and build generative AI applications while maintaining privacy and security. We used Anthropic Claude v2 on Amazon Bedrock to generate natural language answers given a question and a context.

In the following sections, we look at the two stages of the solution in more detail.

Data ingestion

First, the draft RFP and RFI response documents are processed to be used at the Q&A time. Data ingestion includes the following steps:

    Documents are passed to Amazon Textract to be converted into text. To better enable our language model to answer questions about tables, we created a parser that converts tables from the Amazon Textract output into CSV format. Transforming tables into CSV improves the model’s comprehension. For instance, the following figures show part of an RFI response document in PDF format, followed by its corresponding extracted text. In the extracted text, the table has been converted to CSV format and sits among the rest of the text.
    For long documents, the extracted text may exceed the LLM’s input size limitation. In these cases, we can divide the text into smaller, overlapping chunks. The chunk sizes and overlap proportions may vary depending on the use case. We apply section-aware chunking, (perform chunking independently on each document section), which we discuss in our example use case later in this post. Some classes of documents may follow a standard layout or format. This structure can be used to optimize data ingestion. For example, RFP documents tend to have a certain layout with defined sections. Using the layout, each document section can be processed independently. Also, if a table of contents exists but is not relevant, it can potentially be removed. We provide a demonstration of detecting and using document structure later in this post. The embedding vector for each text chunk is retrieved from an embedding model. At the last step, the embedding vectors are indexed into an OpenSearch Service database. In addition to the embedding vector, the text chunk and document metadata such as document, document section name, or document release date are also added to the index as text fields. The document release date is useful metadata when documents are related chronologically, so that LLM can identify the most updated information. The following code snippet shows the index body:
index_body = {    "embedding_vector": <embedding vector of a text chunk>,    "text_chunk": <text chunk>,    "document_name": <document name>,    "section_name": <document section name>,    "release_date": <document release date>,    # more metadata can be added}

Q&A

In the Q&A phrase, users can submit a natural language question about the draft RFP and RFI response documents ingested in the previous step. First, semantic search is used to retrieve relevant text chunks to the user’s question. Then, the question is augmented with the retrieved context to create a prompt. Finally, the prompt is sent to Amazon Bedrock for an LLM to generate a natural language response. The detailed steps are as follows:

    An embedding representation of the input question is retrieved from the Amazon Titan embedding model on Amazon Bedrock. The question’s embedding vector is used to perform semantic search on OpenSearch Service and find the top K relevant text chunks. The following is an example of a search body passed to OpenSearch Service. For more details see the OpenSearch documentation on structuring a search query.
search_body = {    "size": top_K,    "query": {        "script_score": {            "query": {                "match_all": {}, # skip full text search            },            "script": {                "lang": "knn",                "source": "knn_score",                "params": {                    "field": "embedding-vector",                    "query_value": question_embedding,                    "space_type": "cosinesimil"                }            }        }    }}
    Any retrieved metadata, such as section name or document release date, is used to enrich the text chunks and provide more information to the LLM, such as the following:
    def opensearch_result_to_context(os_res: dict) -> str:    """    Convert OpenSearch result to context    Args:    os_res (dict): Amazon OpenSearch results    Returns:    context (str): Context to be included in LLM's prompt    """    data = os_res["hits"]["hits"]    context = []    for item in data:        text = item["_source"]["text_chunk"]        doc_name = item["_source"]["document_name"]        section_name = item["_source"]["section_name"]        release_date = item["_source"]["release_date"]        context.append(            f"<<Context>>: [Document name: {doc_name}, Section name: {section_name}, Release date: {release_date}] {text}"        )    context = "\n \n ------ \n \n".join(context)    return context
    The input question is combined with retrieved context to create a prompt. In some cases, depending on the complexity or specificity of the question, an additional chain-of-thought (CoT) prompt may need to be added to the initial prompt in order to provide further clarification and guidance to the LLM. The CoT prompt is designed to walk the LLM through the logical steps of reasoning and thinking that are required to properly understand the question and formulate a response. It lays out a type of internal monologue or cognitive path for the LLM to follow in order to comprehend the key information within the question, determine what kind of response is needed, and construct that response in an appropriate and accurate way. We use the following CoT prompt for this use case:
"""Context below includes a few paragraphs from draft RFP and RFI response documents:Context: {context}Question: {question}Think step by step:1- Find all the paragraphs in the context that are relevant to the question.2- Sort the paragraphs by release date.3- Use the paragraphs to answer the question.Note: Pay attention to the updated information based on the release dates."""
    The prompt is passed to an LLM on Amazon Bedrock to generate a response in natural language. We use the following inference configuration for the Anthropic Claude V2 model on Amazon Bedrock. The Temperature parameter is usually set to zero for reproducibility and also to prevent LLM hallucination. For regular RAG applications, top_k and top_p are usually set to 250 and 1, respectively. Set max_tokens_to_sample to maximum number of tokens expected to be generated (1 token is approximately 3/4 of a word). See Inference parameters for more details.
{    "temperature": 0,    "top_k": 250,    "top_p": 1,    "max_tokens_to_sample": 300,    "stop_sequences": [“\n\nHuman:\n\n”]}

Example use case

As a demonstration, we describe an example of Q&A on two related documents: a draft RFP document in PDF format with 167 pages, and an RFI response document in PDF format with 6 pages released later, which includes additional information and updates to the draft RFP.

The following is an example question asking if the project size requirements have changed, given the draft RFP and RFI response documents:

Have the original scoring evaluations changed? if yes, what are the new project sizes?

The following figure shows the relevant sections of the draft RFP document that contain the answers.

The following figure shows the relevant sections of the RFI response document that contain the answers.

For the LLM to generate the correct response, the retrieved context from OpenSearch Service should contain the tables shown in the preceding figures, and the LLM should be able to infer the order of the retrieved contents from metadata, such as release dates, and generate a readable response in natural language.

The following are the data ingestion steps:

    The draft RFP and RFI response documents are uploaded to Amazon Textract to extract text and tables as the content. Additionally, we used regular expression to identify document sections and table of contents (see the following figures, respectively). The table of contents can be removed for this use case because it doesn’t have any relevant information.

    We split each document section independently into smaller chunks with some overlaps. For this use case, we used a chunk size of 500 tokens with the overlap size of 100 tokens (1 token is approximately 3/4 a word). We used a BPE tokenizer, where each token corresponds to about 4 bytes. An embedding representation of each text chunk is obtained using the Amazon Titan Embeddings G1 – Text v1.2 model on Amazon Bedrock. Each text chunk is stored into an OpenSearch Service index along with metadata such as section name and document release date.

The Q&A steps are as follows:

    The input question is first transformed to a numeric vector using the embedding model. The vector representation used for semantic search and retrieval of relevant context in the next step. The top K relevant text chunk and metadata are retrieved from OpenSearch Service. The opensearch_result_to_context function and the prompt template (defined earlier) are used to create the prompt given the input question and retrieved context. The prompt is sent to the LLM on Amazon Bedrock to generate a response in natural language. The following is the response generated by Anthropic Claude v2, which matched with the information presented in the draft RFP and RFI response documents. The question was “Have the original scoring evaluations changed? If yes, what are the new project sizes?” Using CoT prompting, the model can correctly answer the question.

Key features

The solution contains the following key features:

These contributions helped improve the accuracy and capabilities of the solution for answering questions about documents. In fact, based on Deltek’s subject matter experts’ evaluations of LLM-generated responses, the solution achieved a 96% overall accuracy rate.

Conclusion

This post outlined an application of generative AI for question answering across multiple government solicitation documents. The solution discussed was a simplified presentation of a pipeline developed by the AWS GenAIIC team in collaboration with Deltek. We described an approach to enable Q&A on lengthy documents published separately over time. Using Amazon Bedrock and OpenSearch Service, this RAG architecture can scale for enterprise-level document volumes. Additionally, a prompt template was shared that uses CoT logic to guide the LLM in producing accurate responses to user questions. Although this solution is simplified, this post aimed to provide a high-level overview of a real-world generative AI solution for streamlining review of complex proposal documents and their iterations.

Deltek is actively refining and optimizing this solution to ensure it meets their unique needs. This includes expanding support for file formats other than PDF, as well as adopting more cost-efficient strategies for their data ingestion pipeline.

Learn more about prompt engineering and generative AI-powered Q&A in the Amazon Bedrock Workshop. For technical support or to contact AWS generative AI specialists, visit the GenAIIC webpage.

Resources

To learn more about Amazon Bedrock, see the following resources:

To learn more about OpenSearch Service, see the following resources:

See the following links for RAG resources on AWS:


About the Authors

Kevin Plexico is Senior Vice President of Information Solutions at Deltek, where he oversees research, analysis, and specification creation for clients in the Government Contracting and AEC industries. He leads the delivery of GovWin IQ, providing essential government market intelligence to over 5,000 clients, and manages the industry’s largest team of analysts in this sector. Kevin also heads Deltek’s Specification Solutions products, producing premier construction specification content including MasterSpec® for the AIA and SpecText.

Shakun Vohra is a distinguished technology leader with over 20 years of expertise in Software Engineering, AI/ML, Business Transformation, and Data Optimization. At Deltek, he has driven significant growth, leading diverse, high-performing teams across multiple continents. Shakun excels in aligning technology strategies with corporate goals, collaborating with executives to shape organizational direction. Renowned for his strategic vision and mentorship, he has consistently fostered the development of next-generation leaders and transformative technological solutions.

Amin Tajgardoon is an Applied Scientist at the AWS Generative AI Innovation Center. He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Anila Joshi has more than a decade of experience building AI solutions. As an Applied Science Manager at AWS Generative AI Innovation Center, Anila pioneers innovative applications of AI that push the boundaries of possibility and accelerate the adoption of AWS services with customers by helping customers ideate, identify, and implement secure generative AI solutions.

Yash Shah and his team of scientists, specialists and engineers at AWS Generative AI Innovation Center, work with some of AWS most strategic customers on helping them realize art of the possible with Generative AI by driving business value. Yash has been with Amazon for more than 7.5 years now and has worked with customers across healthcare, sports, manufacturing and software across multiple geographic regions.

Jordan Cook is an accomplished AWS Sr. Account Manager with nearly two decades of experience in the technology industry, specializing in sales and data center strategy. Jordan leverages his extensive knowledge of Amazon Web Services and deep understanding of cloud computing to provide tailored solutions that enable businesses to optimize their cloud infrastructure, enhance operational efficiency, and drive innovation.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAG 问答系统 政府招标 AWS Deltek Amazon Textract Amazon OpenSearch Service Amazon Bedrock
相关文章