AWS Machine Learning Blog 07月17日 20:32
Building enterprise-scale RAG applications with Amazon S3 Vectors and DeepSeek R1 on Amazon SageMaker AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了Amazon S3 Vectors如何通过与Amazon SageMaker AI的结合,革新企业级检索增强生成(RAG)应用的开发和部署。文章指出,传统RAG方案在成本、操作复杂性、可扩展性和集成方面面临挑战。Amazon S3 Vectors作为首个原生支持向量存储和查询的对象存储服务,提供了成本效益高、可扩展性强且易于管理的解决方案。它能够将向量数据与元数据存储在一起,简化数据管理,并支持高达90%的成本节省。结合SageMaker AI的LLM部署、实验跟踪和治理能力,开发者可以更高效地构建、测试和扩展AI驱动的RAG应用,从而更好地利用企业内部的非结构化数据。

📝 **RAG应用的挑战与S3 Vectors的解决方案**:企业在实践LLM和RAG技术时,常面临模型幻觉、知识过时和无法访问专有数据等问题。检索增强生成(RAG)通过结合语义搜索和生成式AI来解决这些痛点,但大规模部署RAG系统时,传统向量数据库在成本、运维复杂性、可扩展性和集成方面存在显著障碍。Amazon S3 Vectors作为一项新的对象存储服务,原生支持向量存储和查询,旨在以更低的成本(最高可节省90%)和更简化的操作来管理大规模向量数据,解决了传统向量数据库的痛点。

🚀 **SageMaker AI与S3 Vectors的协同增强**:Amazon SageMaker AI通过与MLflow的集成,为企业级RAG应用提供了端到端的解决方案。它加速了LLM的实验和治理,支持快速部署模型(如GTE-Qwen2-7B用于嵌入,DeepSeek R1 Distill Qwen 7B用于生成),提供优化的基础设施建议和可扩展的推理端点。SageMaker managed MLflow能够跟踪实验、评估模型性能(如正确性和相关性),并确保结果的可复现性和合规性,从而使开发者能够专注于构建高质量的RAG应用。

💡 **S3 Vectors的成本效益与灵活性**:Amazon S3 Vectors将Amazon S3成熟的经济模型和简洁性引入向量存储领域。它适用于对延迟要求略高于毫秒级的RAG场景,以换取显著的成本节省和简化的操作。S3 Vectors具备弹性扩展能力,可从GB扩展到PB,并在不使用时缩减至零。此外,它支持将向量与元数据(如文档类型、年份)一同存储,最长可达40KB,实现更灵活的数据管理和增强的检索性能,减少了对独立数据库的需求。

🛠️ **构建企业级RAG应用的实践步骤**:文章详细阐述了如何利用Amazon S3 Vectors、SageMaker AI和LangGraph构建一个成本效益高且可扩展的企业级RAG应用。该过程包括文档摄取、文本分块、使用SageMaker部署的LLM生成向量嵌入、将向量存储在S3 Vectors中并附加元数据、实现基于向量相似性的检索逻辑,以及利用SageMaker部署的LLM生成响应。通过LangGraph的图结构,可以自动化检索和生成流程,并利用MLflow进行性能评估,确保响应符合企业标准。

Organizations are adopting large language models (LLMs), such as DeepSeek R1, to transform business processes, enhance customer experiences, and drive innovation at unprecedented speed. However, standalone LLMs have key limitations such as hallucinations, outdated knowledge, and no access to proprietary data. Retrieval Augmented Generation (RAG) addresses these gaps by combining semantic search with generative AI, enabling models to retrieve relevant information from enterprise knowledge bases before responding. This approach grounds outputs in accurate, up-to-date context, making applications more reliable, transparent, and capable of using domain expertise without retraining. Yet, as the usage of RAG solutions grows, so do the operational and technical hurdles associated with scaling these solutions in production. Such hurdles include the costs and infrastructure complexities that come with vector databases that enterprises need to seamlessly store, search, and manage high-dimensional embeddings at scale.

Although RAG unlocks remarkable capabilities, organizations building production-grade applications often encounter four major obstacles with existing vector databases:

With the launch of Amazon Simple Storage Service (Amazon S3) Vectors, the first cloud object storage service with native support to store and query vectors, we have a new approach to cost-effectively manage vector data at scale. In this post, we explore how combining S3 Vectors and Amazon SageMaker AI redefines the developer experience for RAG, making it easier than ever to experiment, build, and scale AI-powered applications without the traditional trade-offs.

Amazon SageMaker AI: Streamlining LLM experimentation and governance

Enterprise-scale RAG applications involve high data volumes (often multimillion document knowledge bases, including unstructured data), high query throughput, mission-critical reliability, complex integration, and continuous evaluation and improvement. To realize these enterprise-scale RAG applications, you need more than powerful LLM model deployment. These applications also demand rigorous experimentation, governance, and performance tracking. Amazon SageMaker AI, with its native integration with managed MLflow, offers a unified system for deploying, monitoring, and optimizing LLMs at scale.

Amazon SageMaker JumpStart accelerates embedding and text generation deployment, helping teams rapidly prototype and deliver value. SageMaker JumpStart accelerates the development of RAG solutions with:

Developing effective RAG systems requires evaluating prompts, chunking strategies, retrieval methods, and model settings. SageMaker managed MLflow supports this in several ways. You can track experiments by logging and comparing chunking, retrieval, and generation configurations. Use built-in LLM-based generative AI metrics such as correctness and relevance to assess output quality. To monitor performance, you can track latency and throughput alongside output quality to verify production-readiness. Improve reproducibility of experiments by capturing parameters for reliable auditing and experiment replication. And by using the governance dashboards, you can visualize results, manage model versions, and control approvals through an intuitive interface.

As enterprises scale their RAG implementations, they face significant challenges for managing their vector stores. Traditional vector databases have unpredictable and rising costs because they incur charges based on compute, storage, and API usage, which scale with data volume. Teams spend significant time on the operational complexity that results from having to manage separate vector database infrastructure instead of focusing on application development. Capacity planning becomes increasingly complex as vector collections grow and diversify, making scaling challenging. Connecting vector stores with existing data infrastructure and security frameworks adds additional complexity.

Introducing Amazon S3 Vectors

Amazon S3 Vectors delivers purpose-built vector storage so you can harness the semantic power of your organization’s unstructured data at scale. Designed for the cost-optimized and durable storage of large vector datasets with sub-second query performance, S3 Vectors is ideal for infrequent query workloads and can help you reduce the overall cost of uploading, storing, and querying vectors by up to 90% compared to alternative solutions. With S3 Vectors, you only pay for what you use without the need for infrastructure provisioning and management. Whether you’re developing semantic search engines, RAG systems, or recommendation services, you can focus on innovation rather than cost constraints and data management complexities.

Amazon S3 Vectors brings the proven economics and simplicity of Amazon S3 to vectors. S3 Vectors is ideal for RAG applications where slightly higher latency than millisecond-level vector databases is acceptable, in exchange for substantial cost savings and simplified operations. You also have elasticity because you can scale vector search applications seamlessly from gigabytes to petabytes, and scale down to zero when these resources are not in use. Moreover, data management becomes more flexible. You can store vectors alongside metadata, minimizing the need for separate databases while enhancing retrieval performance through consolidated data access. S3 Vectors supports up to 40 KB of (filterable and nonfilterable) metadata per vector with schema-less filtering capabilities, using separate vector indexes for streamlined organization.

Solution overview

In this post, we show how to build a cost-effective, enterprise-scale RAG application using Amazon S3 Vectors, SageMaker AI for scalable inference, and SageMaker managed MLflow for experiment tracking and evaluation, making sure the responses meet enterprise standards. We demonstrate this by building a RAG system that answers questions about Amazon financials using annual reports, shareholder letters, and 10-K filings as the knowledge base. Our solution consists of the following components:

    Document ingestion – Process PDF documents using LangChain’s document loaders. Text chunking – Experiment with different chunking strategies (for example, fixed-size or recursive) and configurations (such as chunk size and overlap). Vector embedding – Generate embeddings using SageMaker deployed embedding LLM models. Vector storage – Store vectors in Amazon S3 Vectors with associated metadata (such as the type of document). You can also store the entire text chunk in the vector metadata for simple retrieval. Retrieval logic – Implement semantic search using vector similarity. Response generation – Create responses using retrieved context and a SageMaker deployed LLM. Evaluation – Assess performance using ground truth datasets and SageMaker managed MLflow metrics.

The following diagram is the solution architecture.

You can follow and execute the full example code from the repository. We use code snippets from this GitHub repository to illustrate RAG solution using S3 Vectors and tracking approaches in the rest of this post.

Prerequisites

To perform the solution, you need to have these prerequisites:

Walkthrough

The following steps walk you through this solution:

    Deploy LLMs on SageMaker AI Create Amazon S3 Vectors buckets and indexes Process documents and generate embeddings Implement the RAG pipeline with LangGraph Evaluate RAG performance with MLflow

Step 1: Deploy LLMs on SageMaker AI

We use SageMaker JumpStart to deploy state-of-the-art models in minutes using a few lines of code. A RAG application requires:

    An embedding model to convert text into vector representations A text generation model to produce responses based on retrieved context

Use the following code:

# Deploy embedding model with optimized configurationsEMBEDDING_ENDPOINT_NAME = deploy_jumpstart_model(    model_id="huggingface-textembedding-gte-qwen2-7b-instruct",    instance_type="ml.g5.xlarge",    endpoint_name_base="s3-vectors-demo-embedding-model",    model_version="1.0.8")# Deploy text generation model TEXT_GENERATION_ENDPOINT_NAME = deploy_jumpstart_model(    model_id="deepseek-llm-r1-distill-qwen-7b",    instance_type="ml.g6.12xlarge",    endpoint_name_base="s3-vectors-demo-generation-model",    model_version="2.0.2")

In this post, we use Qwen2 7B instruct as the embedding model and DeepSeek R1 because they’re highly capable open source models that are among the top models in embedding leaderboards and LLM leaderboards. You can choose an embedding and text generation model from the more than 300 models available on SageMaker JumpStart.

Step 2: Create Amazon S3 Vectors buckets and indexes

Amazon S3 Vectors provide a seamless yet powerful way to store and search vector embeddings. Vectors are stored in vector indexes, which are used for logically grouping. Write and read operations are directed to a single vector index. To start using Amazon S3 Vectors, you need to:

    Create an S3 Vector bucket indicating its name:
import boto3s3vectors_client = boto3.client("s3vectors")s3vectors_client.create_vector_bucket(vectorBucketName=vectors_bucket_name)
    Create vector indexes by giving them a name and defining the number of dimensions and the distance metric to use. Vector indexes can store up to 50,000,000 vectors with a maximum dimension of 4,096 and use either cosine or euclidean as their distance metrics. Define your dimensions and distance metric:
for index in index_names:    s3vectors_client.create_index(        vectorBucketName=vectors_bucket_name,        indexName= index,        dimension=num_dimensions,        distanceMetric="cosine",        dataType="float32",        metadataConfiguration={"nonFilterableMetadataKeys": ["chunk"]}    )

Step 3: Process documents and generate embeddings

You can calculate the embedding vectors for each text chunk in the source documents and store them in your S3 vector bucket using the put_vectors API, which supports up to 500 vectors per call. When putting vectors in the vector index, you can include up to 10 fields in the vector metadata to facilitate the retrieval and generation stages of RAG. In the following example, we add the domain (such as a financial document or shareholder letter) and the year for targeted semantic queries, and the text chunk so we can use its content when generating an answer:

import boto3s3vectors_client = boto3.client("s3vectors")for chunk in chunks:    unique_id = str(uuid.uuid4())    key = f"{unique_id}"    embedding_response = embeddings_model_endpoint.predict({'inputs': chunk["chunk"]})    s3vectors_client.put_vectors(        vectorBucketName=vector_bucket_name,        indexName=index_name,        vectors=[            {                "key": key,                "vector": embedding_response[0],                "metadata": {                        "domain": domain_name,                        "year": year,                        "chunk": chunk["chunk"]                    }            }        ]    )

Step 4: Implement the RAG pipeline with LangGraph

LangGraph is a framework for building stateful, multi-step applications with LLMs using a graph-based architecture. The first step in the definition of a RAG application with LangGraph is to create Python functions for the retrieval and generation steps.

You can use the query_vectors API in S3 Vectors to run a semantic search on a vector index. In this query, you have to define the query vector (that is, the vector that represents the query or question), search parameters (for example, topK for the number of similar vectors to retrieve), and possibly filter conditions for metadata filtering:

s3vectors_client = boto3.client("s3vectors")query_args = {    "vectorBucketName": vector_bucket_name,    "indexName": index_name,    "vector": query_vector,    "topK": 5,    "returnDistance": True,    "returnMetadata": True}if metadata_filter is not None:    query_args["filter"] = metadata_filtervectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])

Metadata filters can be used to focus the search on a subset of documents. For example, if you’re asking about business and industry risks from an annual report, you can call the query_vectors function with metadata_filter being equal to “Amazon Annual Report” to only consider those documents: ({"domain": {"$eq": "Amazon Annual Report"}}). For greater specificity, we could add numerical operators to indicate the years of the annual reports to consult ({"year": {"$gt": 2023}}). The choice of operators and filters depends on the use case and on the logic that will allow that only the relevant documents are consulted.

You can automate the retrieval and generation steps of a RAG application using a LangGraph StateGraph. To define a LangGraph graph for RAG, we take the following steps:

    Define functions for retrieval and generation. A snippet of this implementation is shown in the following example. For a complete overview of these functions, visit our GitHub repository.
# Load RAG prompt from LangChain Hubprompt = hub.pull("rlm/rag-prompt")# Define RAG application state structureclass State(TypedDict):    # Keeping question, context, and answer in the state    question: str    context: List[Document]    answer: strdef retrieve(state: State, index_name: str):    # Compute query vector    query_vector = EMBEDDINGS_MODEL_ENDPOINT.predict({'inputs': state["question"]})[0]    # Retrieve chunks with semantic search    s3vectors_client = boto3.client("s3vectors")    query_args = {        "vectorBucketName": VECTOR_BUCKET_NAME,        "indexName": index_name,        "vector": query_vector,        "topK": 5,        "returnDistance": True,        "returnMetadata": True    }    vectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])    chunks = [r.get("metadata", {}).get("chunk") for v in vectors]    # Assemble retrieved chunks into a list of LangChain Documents    document_chunks = [Document(page_content=chunk) for chunk in chunks]    return {"context": document_chunks}def generate(state: State):    # Join text chunks into a context string    docs_content = "\n\n".join(doc.page_content for doc in state["context"])    # Generate LangChain messages and convert to OpenAI format    lc_messages = prompt.invoke({        "question": state["question"],        "context": docs_content    }).to_messages()    openai_messages = langchain_to_openai_messages(lc_messages)    # Invoke the text generation model and return "answer"    request = {        "messages": openai_messages,        "temperature": 0.2,        "max_new_tokens": 512,    }    response = text_generation_predictor.predict(request)        return {"answer": response["choices"][0]["message"]["content"]}
    After the LangGraph graph has been built and compiled, it can be invoked with sample questions:
response = recursive_graph.invoke({    "question": "What are the names of the people in Amazon's board of directors?"})

The response from the LangGraph graph is as follows:

The names of the people in Amazon’s board of directors are Jeffrey P. Bezos, Andrew R. Jassy, Keith B. Alexander, Edith W. Cooper, Jamie S. Gorelick, Daniel P. Huttenlocher, and Judith A. McGrath. There are seven members on the board, including Amazon’s CEO.

Step 5: Evaluate RAG performance with MLflow

To validate our RAG system performs well, we use MLflow to track experiments and evaluate performance using a ground truth dataset of questions and answers. You can set up an MLflow evaluation and run it using the following script. Notice that we can complement metrics from the evaluation (for example, latency and answer correctness using LLM as a judge) with parameters from the chunking and embedding stages to provide full visibility of the RAG application:

with mlflow.start_run():    results = mlflow.evaluate(        model=recursive_model,        data=eval_df,        targets="answer",        extra_metrics=[            mlflow.metrics.genai.answer_correctness(                model="bedrock:/anthropic.claude-3-sonnet-20240229-v1:0",                parameters={                    "temperature": 0.2,                    "max_tokens": 256,                    "anthropic_version": "bedrock-2023-05-31",                },            ),            mlflow.metrics.latency(),        ],        evaluator_config={          "col_mapping": {              "inputs": "question",          }        }    )    # Store metrics from the evaluation    mlflow.log_metrics(results.metrics)    # Store other parameters    mlflow.log_param("Chunking strategy", "recursive")    mlflow.log_param("Chunk size", CHUNK_SIZE)    mlflow.log_param("Chunk overlap", CHUNK_OVERLAP)    mlflow.log_param("Number of chunks", NUM_CHUNKS_RECURSIVE)    mlflow.log_param("Embedding model ID", EMBEDDING_MODEL_ID)    mlflow.log_param("Embedding model version", EMBEDDING_MODEL_VERSION)    mlflow.log_param("Text generation model ID", TEXT_GENERATION_MODEL_ID)    mlflow.log_param("Text generation model version", TEXT_GENERATION_MODEL_VERSION)    mlflow.log_param("Chunking time seconds", CHUNKING_TIME_RECURSIVE)

This evaluation tracks:

The following figure shows the metrics and parameters from one experiment. Using this view, a machine learning (ML) engineer can compare the performance of different RAG applications and pick the combination of retrieval and generation parameters that provide the best user experience. This could be, for example, finding the best combination between high answer correctness and low latency. The ML engineer can then learn the models and chunking parameters that resulted in that performance and implement them in the final application.

By running multiple experiments with different chunking strategies, embedding models, or retrieval configurations, you can identify the optimal setup for your use case.

Key benefits of Amazon S3 Vectors for RAG

Amazon S3 Vectors bring scalable, cost-effective vector search to your RAG applications with:

Amazon S3 vector store is ideal for use cases where ultra-low latency isn’t required, such as batch processing, periodic reporting, and agent-based workflows. It offers the durability, scalability, and operational simplicity of Amazon S3. A few industry-specific use cases that can benefit from S3 vector store are:

Performance Considerations

When building RAG applications with Amazon S3 Vectors, consider these performance optimization strategies:

To summarize, the key advantages demonstrated in using S3 Vectors with SageMaker AI are:

In this post, we’ve shown how to replace traditional vector database complexity with a streamlined AWS based approach that scales with your data and evolves with your generative AI strategy. The SageMaker managed MLflow integration means that every architectural decision is guided by quantifiable metrics, from answer correctness to latency profiles, turning experimentation into actionable insights. As you implement RAG solutions, use these tools to validate retrieval strategies against domain-specific datasets, benchmark embedding models for accuracy and storage tradeoffs, and enforce governance through version-controlled deployments.

Cleanup

To avoid unnecessary costs, delete resources such as the SageMaker-managed MLflow tracking server, S3 Vectors indexes and buckets, and SageMaker endpoints when your RAG experimentation is complete.

Conclusion

In this post, we’ve demonstrated how to build a complete RAG solution using Amazon S3 Vectors and SageMaker AI. The combination of S3 vector buckets, SageMaker AI LLM models, and SageMaker managed MLflow provides a transformative solution for organizations building enterprise-scale RAG applications. In this approach, we illustrate the use of S3 Vectors as a new approach to effectively manage vector data at scale, without the cost and scalability challenges that come with conventional vector databases.

We encourage you to explore Amazon S3 Vectors documentation and experiment with the SageMaker AI LLM models and SageMaker managed MLflow evaluation templates shown in this post. Now it’s your turn to build enterprise-scale AI solutions with serverless, observable, and relentlessly optimized generative AI strategies.

About the Authors

Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the Generative AI space. You can find Sandeep on LinkedIn.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.

Biswanath Hore is a Sr. Solutions Architect at Amazon Web Services. He works with customers early in their AWS journey, helping them adopt cloud solutions to address their business needs. He is passionate about Machine Learning and, outside of work, loves spending time with his family.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon S3 Vectors RAG Amazon SageMaker AI LLM 向量数据库
相关文章