AWS Machine Learning Blog 2024年09月06日
Build powerful RAG pipelines with LlamaIndex and Amazon Bedrock
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用LlamaIndex与Amazon Bedrock构建高级检索增强生成(RAG)管道,以增强大型语言模型(LLM)的能力。通过结合外部数据源的知识和LLM的生成能力,RAG可以解决需要知识和创造力的复杂任务。本文探讨了简单的RAG管道、路由查询、子查询和代理RAG等高级模式,展示了如何利用LlamaIndex和Amazon Bedrock构建生产级RAG系统。

😊 **简单的RAG管道** 简单的RAG管道涉及从外部数据源检索相关信息,并将其用于增强提供给LLM的提示。这使LLM能够生成以事实为基础的响应,并根据特定查询进行定制。在Amazon Bedrock的RAG工作流程中,来自已配置知识库的文档会经过预处理,在此过程中,它们会被拆分为块,嵌入到向量中,并索引到向量数据库中。这允许在运行时有效地检索相关信息。当用户查询进来时,会使用相同的嵌入模型将查询文本转换为向量表示形式。此查询向量与索引的文档向量进行比较,以识别来自知识库的语义上最相似的块。检索到的块提供了与用户查询相关的额外上下文。此上下文信息在传递给FM以生成响应之前附加到原始用户提示。通过使用从知识库中提取的相关数据来增强提示,模型的输出能够使用和了解组织的专有信息源。此RAG过程也可以由代理来协调,代理使用FM来确定何时查询知识库以及如何将检索到的上下文整合到工作流程中。

🧐 **路由查询** RouterQueryEngine允许您根据查询的性质将查询路由到不同的索引或查询引擎。例如,您可以将摘要问题路由到摘要索引,将事实问题路由到向量存储索引。RouterQueryEngine通过定义逻辑来路由查询,并提供一系列查询引擎,例如摘要查询引擎和向量查询引擎,来实现不同的查询处理策略。

🤔 **子查询** SubQuestionQueryEngine将复杂查询分解为更简单的子查询,然后组合来自每个子查询的答案以生成全面的响应。这对于跨越多个文档的查询特别有用。它首先将复杂查询分解为每个相关数据源的子查询,然后收集中间响应,并综合一个最终响应,该响应整合了来自每个子查询的相关信息。例如,如果原始查询是“欧洲 GDP 最高的国家的首都的人口是多少”,引擎将首先将其分解为子查询,例如“欧洲 GDP 最高的国家是哪个”,“该国的首都是什么”,“该首都的人口是多少”,然后将对这些子查询的答案组合成一个最终的综合响应。

🤖 **代理RAG** 代理方法使用LLM来推理查询,并确定使用哪些工具(例如索引或查询引擎)以及按什么顺序使用。这使得 RAG 管道更加动态和适应性强。代理RAG在Amazon Bedrock中结合了代理和知识库的功能,以实现 RAG 工作流程。代理充当智能协调器,可以在其工作流程期间查询知识库,以检索相关信息和上下文,以增强 FM 生成的响应。

🚀 **LlamaCloud 和 LlamaParse** LlamaCloud 代表了 LlamaIndex 环境中的重大进步,提供了一套全面的托管服务,专为 LLM 和 RAG 应用程序中的企业级上下文增强而量身定制。此服务使 AI 工程师能够通过简化数据整理的复杂过程来专注于开发核心业务逻辑。LlamaParse 是一个专有的解析引擎,擅长处理复杂的半结构化文档,这些文档包含嵌入的对象,如表格和图形,并与 LlamaIndex 的提取和检索管道无缝集成。

This post was co-written with Jerry Liu from LlamaIndex.

Retrieval Augmented Generation (RAG) has emerged as a powerful technique for enhancing the capabilities of large language models (LLMs). By combining the vast knowledge stored in external data sources with the generative power of LLMs, RAG enables you to tackle complex tasks that require both knowledge and creativity. Today, RAG techniques are used in every enterprise, small and large, where generative artificial intelligence (AI) is used as an enabler for solving document-based question answering and other types of analysis.

Although building a simple RAG system is straightforward, building production RAG systems using advanced patterns is challenging. A production RAG pipeline typically operates over a larger data volume and larger data complexity, and must meet a higher quality bar compared to building a proof of concept. A general broad challenge that developers face is low response quality; the RAG pipeline is not able to sufficiently answer a large number of questions. This can be due to a variety of reasons; the following are some of the most common:

This necessitates more advanced RAG techniques on the query understanding, retrieval, and generation components in order to handle these failure modes.

This is where LlamaIndex comes in. LlamaIndex is an open source library with both simple and advanced techniques that enables developers to build production RAG pipelines. It provides a flexible and modular framework for building and querying document indexes, integrating with various LLMs, and implementing advanced RAG patterns.

Amazon Bedrock is a managed service providing access to high-performing foundation models (FMs) from leading AI providers through a unified API. It offers a wide range of large models to choose from, along with capabilities to securely build and customize generative AI applications. Key advanced features include model customization with fine-tuning and continued pre-training using your own data, as well as RAG to augment model outputs by retrieving context from configured knowledge bases containing your private data sources. You can also create intelligent agents that orchestrate FMs with enterprise systems and data. Other enterprise capabilities include provisioned throughput for guaranteed low-latency inference at scale, model evaluation to compare performance, and AI guardrails to implement safeguards. Amazon Bedrock abstracts away infrastructure management through a fully managed, serverless experience.

In this post, we explore how to use LlamaIndex to build advanced RAG pipelines with Amazon Bedrock. We discuss how to set up the following:

Simple RAG pipeline

At its core, RAG involves retrieving relevant information from external data sources and using it to augment the prompts fed to an LLM. This allows the LLM to generate responses that are grounded in factual knowledge and tailored to the specific query.

For RAG workflows in Amazon Bedrock, documents from configured knowledge bases go through preprocessing, where they are split into chunks, embedded into vectors, and indexed in a vector database. This allows efficient retrieval of relevant information at runtime. When a user query comes in, the same embedding model is used to convert the query text into a vector representation. This query vector is compared against the indexed document vectors to identify the most semantically similar chunks from the knowledge base. The retrieved chunks provide additional context related to the user’s query. This contextual information is appended to the original user prompt before being passed to the FM to generate a response. By augmenting the prompt with relevant data pulled from the knowledge base, the model’s output is able to use and be informed by an organization’s proprietary information sources. This RAG process can also be orchestrated by agents, which use the FM to determine when to query the knowledge base and how to incorporate the retrieved context into the workflow.

The following diagram illustrates this workflow.

The following is a simplified example of a RAG pipeline using LlamaIndex:

from llama_index import SimpleDirectoryReader, VectorStoreIndex# Load documentsdocuments = SimpleDirectoryReader("data/").load_data()# Create a vector store indexindex = VectorStoreIndex.from_documents(documents)# Query the indexresponse = index.query("What is the capital of France?")# Print the responseprint(response)

The pipeline includes the following steps:

    Use the SimpleDirectoryReader to load documents from the “data/” Create a VectorStoreIndex from the loaded documents. This type of index converts documents into numerical representations (vectors) that capture their semantic meaning. Query the index with the question “What is the capital of France?” The index uses similarity measures to identify the documents most relevant to the query. The retrieved documents are then used to augment the prompt for the LLM, which generates a response based on the combined information.

LlamaIndex goes beyond simple RAG and enables the implementation of more sophisticated patterns, which we discuss in the following sections.

Router query

RouterQueryEngine allows you to route queries to different indexes or query engines based on the nature of the query. For example, you could route summarization questions to a summary index and factual questions to a vector store index.

The following is a code snippet from the example notebooks demonstrating RouterQueryEngine:

from llama_index import SummaryIndex, VectorStoreIndexfrom llama_index.core.query_engine import RouterQueryEngine# Create summary and vector indicessummary_index = SummaryIndex.from_documents(documents)vector_index = VectorStoreIndex.from_documents(documents)# Define query enginessummary_query_engine = summary_index.as_query_engine()vector_query_engine = vector_index.as_query_engine()# Create router query enginequery_engine = RouterQueryEngine( # Define logic for routing queries # ... query_engine_tools=[ summary_query_engine, vector_query_engine, ],)# Query the engineresponse = query_engine.query("What is the main idea of the document?")

Sub-question query

SubQuestionQueryEngine breaks down complex queries into simpler sub-queries and then combines the answers from each sub-query to generate a comprehensive response. This is particularly useful for queries that span across multiple documents. It first breaks down the complex query into sub-questions for each relevant data source, then gathers the intermediate responses and synthesizes a final response that integrates the relevant information from each sub-query. For example, if the original query was “What is the population of the capital city of the country with the highest GDP in Europe,” the engine would first break it down into sub-queries like “What is the highest GDP country in Europe,” “What is the capital city of that country,” and “What is the population of that capital city,” and then combine the answers to those sub-queries into a final comprehensive response.

The following is an example of using SubQuestionQueryEngine:

from llama_index.core.query_engine import SubQuestionQueryEngine# Create sub-question query enginesub_question_query_engine = SubQuestionQueryEngine.from_defaults( # Define tools for generating sub-questions and answering them # ...)# Query the engineresponse = sub_question_query_engine.query( "Compare the revenue growth of Uber and Lyft from 2020 to 2021")

Agentic RAG

An agentic approach to RAG uses an LLM to reason about the query and determine which tools (such as indexes or query engines) to use and in what sequence. This allows for a more dynamic and adaptive RAG pipeline. The following architecture diagram shows how agentic RAG works on Amazon Bedrock.

Agentic RAG in Amazon Bedrock combines the capabilities of agents and knowledge bases to enable RAG workflows. Agents act as intelligent orchestrators that can query knowledge bases during their workflow to retrieve relevant information and context to augment the responses generated by the FM.

After the initial preprocessing of the user input, the agent enters an orchestration loop. In this loop, the agent invokes the FM, which generates a rationale outlining the next step the agent should take. One potential step is to query an attached knowledge base to retrieve supplemental context from the indexed documents and data sources.

If a knowledge base query is deemed beneficial, the agent invokes an InvokeModel call specifically for knowledge base response generation. This fetches relevant document chunks from the knowledge base based on semantic similarity to the current context. These retrieved chunks provide additional information that is included in the prompt sent back to the FM. The model then generates an observation response that is parsed and can invoke further orchestration steps, like invoking external APIs (through action group AWS Lambda functions) or provide a final response to the user. This agentic orchestration augmented by knowledge base retrieval continues until the request is fully handled.

One example of an agent orchestration loop is the ReAct agent, which was initially introduced by Yao et al. ReAct interleaves chain-of-thought and tool use. At every stage, the agent takes in the input task along with the previous conversation history and decides whether to invoke a tool (such as querying a knowledge base) with the appropriate input or not.

The following is an example of using the ReAct agent with the LlamaIndex SDK:

from llama_index.core.agent import ReActAgent# Create ReAct agent with defined toolsagent = ReActAgent.from_tools( query_engine_tools, llm=llm,)# Chat with the agentresponse = agent.chat("What was Lyft's revenue growth in 2021?")

The ReAct agent will analyze the query and decide whether to use the Lyft 10K tool or another tool to answer the question. To try out agentic RAG, refer to the GitHub repo.

LlamaCloud and LlamaParse

LlamaCloud represents a significant advancement in the LlamaIndex landscape, offering a comprehensive suite of managed services tailored for enterprise-grade context augmentation within LLM and RAG applications. This service empowers AI engineers to concentrate on developing core business logic by streamlining the intricate process of data wrangling.

One key component is LlamaParse, a proprietary parsing engine adept at handling complex, semi-structured documents replete with embedded objects like tables and figures, seamlessly integrating with LlamaIndex’s ingestion and retrieval pipelines. Another key component is the Managed Ingestion and Retrieval API, which facilitates effortless loading, processing, and storage of data from diverse sources, including LlamaParse outputs and LlamaHub’s centralized data repository, while accommodating various data storage integrations.

Collectively, these features enable the processing of vast production data volumes, culminating in enhanced response quality and unlocking unprecedented capabilities in context-aware question answering for RAG applications. To learn more about these features, refer to Introducing LlamaCloud and LlamaParse.

For this post, we use LlamaParse to showcase the integration with Amazon Bedrock. LlamaParse is an API created by LlamaIndex to efficiently parse and represent files for efficient retrieval and context augmentation using LlamaIndex frameworks. What is unique about LlamaParse is that it is the world’s first generative AI native document parsing service, which allows users to submit documents along with parsing instructions. The key insight behind parsing instructions is that you know what kind of documents you have, so you already know what kind of output you want. The following figure shows a comparison of parsing a complex PDF with LlamaParse vs. two popular open source PDF parsers.

A green highlight in a cell means that the RAG pipeline correctly returned the cell value as the answer to a question over that cell. A red highlight means that the question was answered incorrectly.

Integrate Amazon Bedrock and LlamaIndex to build an Advanced RAG Pipeline

In this section, we show you how to build an advanced RAG stack combining LlamaParse and LlamaIndex with Amazon Bedrock services – LLMs, embedding models, and Bedrock Knowledge Base.

To use LlamaParse with Amazon Bedrock, you can follow these high-level steps:

    Download your source documents. Send the documents to LlamaParse using the Python SDK:
    from llama_parse import LlamaParsefrom llama_index.core import SimpleDirectoryReaderparser = LlamaParse(    api_key=os.environ.get('LLAMA_CLOUD_API_KEY'),  # set via api_key param or in your env as LLAMA_CLOUD_API_KEY    result_type="markdown",  # "markdown" and "text" are available    num_workers=4,  # if multiple files passed, split in `num_workers` API calls    verbose=True,    language="en",  # Optionally you can define a language, default=en)file_extractor = {".pdf": parser}reader = SimpleDirectoryReader(    input_dir='data/10k/',    file_extractor=file_extractor)
    Wait for the parsing job to finish and upload the resulting Markdown documents to Amazon Simple Storage Service (Amazon S3). Create an Amazon Bedrock knowledge base using the source documents. Choose your preferred embedding and generation model from Amazon Bedrock using the LlamaIndex SDK:
    llm = Bedrock(model = "anthropic.claude-v2")embed_model = BedrockEmbedding(model = "amazon.titan-embed-text-v1")
    Implement an advanced RAG pattern using LlamaIndex. In the following example, we use SubQuestionQueryEngine and a retriever specially created for Amazon Bedrock knowledge bases:
    from llama_index.retrievers.bedrock import AmazonKnowledgeBasesRetriever
    Finally, query the index with your question:
    response = await query_engine.aquery('Compare revenue growth of Uber and Lyft from 2020 to 2021')

We tested Llamaparse on a real-world, challenging example of asking questions about a document containing Bank of America Q3 2023 financial results. An example slide from the full slide deck (48 complex slides!) is shown below.

Using the procedure outlined above, we asked “What is the trend in digital households/relationships from 3Q20 to 3Q23?”; take a look at the answer generated using Llamaindex tools vs. the reference answer from human annotation.

LlamaIndex + LlamaParse answer Reference answer
The trend in digital households/relationships shows a steady increase from 3Q20 to 3Q23. In 3Q20, the number of digital households/relationships was 550K, which increased to 645K in 3Q21, then to 672K in 3Q22, and further to 716K in 3Q23. This indicates consistent growth in the adoption of digital services among households and relationships over the reported quarters. The trend shows a steady increase in digital households/relationships from 645,000 in 3Q20 to 716,000 in 3Q23. The digital adoption percentage also increased from 76% to 83% over the same period.

The following are example notebooks to try out these steps on your own examples. Note the prerequisite steps and cleanup resources after testing them.

Conclusion

In this post, we explored various advanced RAG patterns with LlamaIndex and Amazon Bedrock. To delve deeper into the capabilities of LlamaIndex and its integration with Amazon Bedrock, check out the following resources:

By combining the power of LlamaIndex and Amazon Bedrock, you can build robust and sophisticated RAG pipelines that unlock the full potential of LLMs for knowledge-intensive tasks.


About the Author

Shreyas Subramanian is a Principal data scientist and helps customers by using Machine Learning to solve their business challenges using the AWS platform. Shreyas has a background in large scale optimization and Machine Learning, and in use of Machine Learning and Reinforcement Learning for accelerating optimization tasks.

Jerry Liu is the co-founder/CEO of LlamaIndex, a data framework for building LLM applications. Before this, he has spent his career at the intersection of ML, research, and startups. He led the ML monitoring team at Robust Intelligence, did self-driving AI research at Uber ATG, and worked on recommendation systems at Quora.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RAG LlamaIndex Amazon Bedrock LLM 知识库
相关文章