Building enterprise-scale RAG applications with Amazon S3 Vectors and DeepSeek R1 on Amazon SageMaker AI

Organizations are adopting large language models (LLMs), such as DeepSeek R1, to transform business processes, enhance customer experiences, and drive innovation at unprecedented speed. However, standalone LLMs have key limitations such as hallucinations, outdated knowledge, and no access to proprietary data. Retrieval Augmented Generation (RAG) addresses these gaps by combining semantic search with generative AI, enabling models to retrieve relevant information from enterprise knowledge bases before responding. This approach grounds outputs in accurate, up-to-date context, making applications more reliable, transparent, and capable of using domain expertise without retraining. Yet, as the usage of RAG solutions grows, so do the operational and technical hurdles associated with scaling these solutions in production. Such hurdles include the costs and infrastructure complexities that come with vector databases that enterprises need to seamlessly store, search, and manage high-dimensional embeddings at scale.

Although RAG unlocks remarkable capabilities, organizations building production-grade applications often encounter four major obstacles with existing vector databases:

Unpredictable costs

Operational complexity

Scaling limitations

Integration overhead

With the launch of Amazon Simple Storage Service (Amazon S3) Vectors, the first cloud object storage service with native support to store and query vectors, we have a new approach to cost-effectively manage vector data at scale. In this post, we explore how combining S3 Vectors and Amazon SageMaker AI redefines the developer experience for RAG, making it easier than ever to experiment, build, and scale AI-powered applications without the traditional trade-offs.

Amazon SageMaker AI: Streamlining LLM experimentation and governance

Enterprise-scale RAG applications involve high data volumes (often multimillion document knowledge bases, including unstructured data), high query throughput, mission-critical reliability, complex integration, and continuous evaluation and improvement. To realize these enterprise-scale RAG applications, you need more than powerful LLM model deployment. These applications also demand rigorous experimentation, governance, and performance tracking. Amazon SageMaker AI, with its native integration with managed MLflow, offers a unified system for deploying, monitoring, and optimizing LLMs at scale.

Amazon SageMaker JumpStart accelerates embedding and text generation deployment, helping teams rapidly prototype and deliver value. SageMaker JumpStart accelerates the development of RAG solutions with:

One-click deployment

GTE-Qwen2-7B

DeepSeek R1 Distill Qwen 7B

Optimized infrastructure

Scalable endpoints

Developing effective RAG systems requires evaluating prompts, chunking strategies, retrieval methods, and model settings. SageMaker managed MLflow supports this in several ways. You can track experiments by logging and comparing chunking, retrieval, and generation configurations. Use built-in LLM-based generative AI metrics such as correctness and relevance to assess output quality. To monitor performance, you can track latency and throughput alongside output quality to verify production-readiness. Improve reproducibility of experiments by capturing parameters for reliable auditing and experiment replication. And by using the governance dashboards, you can visualize results, manage model versions, and control approvals through an intuitive interface.

As enterprises scale their RAG implementations, they face significant challenges for managing their vector stores. Traditional vector databases have unpredictable and rising costs because they incur charges based on compute, storage, and API usage, which scale with data volume. Teams spend significant time on the operational complexity that results from having to manage separate vector database infrastructure instead of focusing on application development. Capacity planning becomes increasingly complex as vector collections grow and diversify, making scaling challenging. Connecting vector stores with existing data infrastructure and security frameworks adds additional complexity.

Introducing Amazon S3 Vectors

Amazon S3 Vectors delivers purpose-built vector storage so you can harness the semantic power of your organization’s unstructured data at scale. Designed for the cost-optimized and durable storage of large vector datasets with sub-second query performance, S3 Vectors is ideal for infrequent query workloads and can help you reduce the overall cost of uploading, storing, and querying vectors by up to 90% compared to alternative solutions. With S3 Vectors, you only pay for what you use without the need for infrastructure provisioning and management. Whether you’re developing semantic search engines, RAG systems, or recommendation services, you can focus on innovation rather than cost constraints and data management complexities.

Amazon S3 Vectors brings the proven economics and simplicity of Amazon S3 to vectors. S3 Vectors is ideal for RAG applications where slightly higher latency than millisecond-level vector databases is acceptable, in exchange for substantial cost savings and simplified operations. You also have elasticity because you can scale vector search applications seamlessly from gigabytes to petabytes, and scale down to zero when these resources are not in use. Moreover, data management becomes more flexible. You can store vectors alongside metadata, minimizing the need for separate databases while enhancing retrieval performance through consolidated data access. S3 Vectors supports up to 40 KB of (filterable and nonfilterable) metadata per vector with schema-less filtering capabilities, using separate vector indexes for streamlined organization.

Solution overview

In this post, we show how to build a cost-effective, enterprise-scale RAG application using Amazon S3 Vectors, SageMaker AI for scalable inference, and SageMaker managed MLflow for experiment tracking and evaluation, making sure the responses meet enterprise standards. We demonstrate this by building a RAG system that answers questions about Amazon financials using annual reports, shareholder letters, and 10-K filings as the knowledge base. Our solution consists of the following components:

Document ingestion

Text chunking

Vector embedding

Vector storage

Retrieval logic

Response generation

Evaluation

The following diagram is the solution architecture.

You can follow and execute the full example code from the repository. We use code snippets from this GitHub repository to illustrate RAG solution using S3 Vectors and tracking approaches in the rest of this post.

Prerequisites

To perform the solution, you need to have these prerequisites:

Use quick setup for Amazon SageMaker AI

Amazon SageMaker Studio

setting up a new MLflow tracking server

Amazon Bedrock

Walkthrough

The following steps walk you through this solution:

Deploy LLMs on SageMaker AI Create Amazon S3 Vectors buckets and indexes Process documents and generate embeddings Implement the RAG pipeline with LangGraph Evaluate RAG performance with MLflow

Step 1: Deploy LLMs on SageMaker AI

We use SageMaker JumpStart to deploy state-of-the-art models in minutes using a few lines of code. A RAG application requires:

An embedding model to convert text into vector representations A text generation model to produce responses based on retrieved context

Use the following code:

# Deploy embedding model with optimized configurationsEMBEDDING_ENDPOINT_NAME = deploy_jumpstart_model(    model_id="huggingface-textembedding-gte-qwen2-7b-instruct",    instance_type="ml.g5.xlarge",    endpoint_name_base="s3-vectors-demo-embedding-model",    model_version="1.0.8")# Deploy text generation model TEXT_GENERATION_ENDPOINT_NAME = deploy_jumpstart_model(    model_id="deepseek-llm-r1-distill-qwen-7b",    instance_type="ml.g6.12xlarge",    endpoint_name_base="s3-vectors-demo-generation-model",    model_version="2.0.2")

In this post, we use Qwen2 7B instruct as the embedding model and DeepSeek R1 because they’re highly capable open source models that are among the top models in embedding leaderboards and LLM leaderboards. You can choose an embedding and text generation model from the more than 300 models available on SageMaker JumpStart.

Step 2: Create Amazon S3 Vectors buckets and indexes

Amazon S3 Vectors provide a seamless yet powerful way to store and search vector embeddings. Vectors are stored in vector indexes, which are used for logically grouping. Write and read operations are directed to a single vector index. To start using Amazon S3 Vectors, you need to:

Create an S3 Vector bucket indicating its name:

import boto3s3vectors_client = boto3.client("s3vectors")s3vectors_client.create_vector_bucket(vectorBucketName=vectors_bucket_name)

cosine

euclidean

for index in index_names:    s3vectors_client.create_index(        vectorBucketName=vectors_bucket_name,        indexName= index,        dimension=num_dimensions,        distanceMetric="cosine",        dataType="float32",        metadataConfiguration={"nonFilterableMetadataKeys": ["chunk"]}    )

Step 3: Process documents and generate embeddings

You can calculate the embedding vectors for each text chunk in the source documents and store them in your S3 vector bucket using the put_vectors API, which supports up to 500 vectors per call. When putting vectors in the vector index, you can include up to 10 fields in the vector metadata to facilitate the retrieval and generation stages of RAG. In the following example, we add the domain (such as a financial document or shareholder letter) and the year for targeted semantic queries, and the text chunk so we can use its content when generating an answer:

import boto3s3vectors_client = boto3.client("s3vectors")for chunk in chunks:    unique_id = str(uuid.uuid4())    key = f"{unique_id}"    embedding_response = embeddings_model_endpoint.predict({'inputs': chunk["chunk"]})    s3vectors_client.put_vectors(        vectorBucketName=vector_bucket_name,        indexName=index_name,        vectors=[            {                "key": key,                "vector": embedding_response[0],                "metadata": {                        "domain": domain_name,                        "year": year,                        "chunk": chunk["chunk"]                    }            }        ]    )

Step 4: Implement the RAG pipeline with LangGraph

LangGraph is a framework for building stateful, multi-step applications with LLMs using a graph-based architecture. The first step in the definition of a RAG application with LangGraph is to create Python functions for the retrieval and generation steps.

$eq

$gt

$and

$or

You can use the query_vectors API in S3 Vectors to run a semantic search on a vector index. In this query, you have to define the query vector (that is, the vector that represents the query or question), search parameters (for example, topK for the number of similar vectors to retrieve), and possibly filter conditions for metadata filtering:

s3vectors_client = boto3.client("s3vectors")query_args = {    "vectorBucketName": vector_bucket_name,    "indexName": index_name,    "vector": query_vector,    "topK": 5,    "returnDistance": True,    "returnMetadata": True}if metadata_filter is not None:    query_args["filter"] = metadata_filtervectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])

Metadata filters can be used to focus the search on a subset of documents. For example, if you’re asking about business and industry risks from an annual report, you can call the query_vectors function with metadata_filter being equal to “Amazon Annual Report” to only consider those documents: ({"domain": {"$eq": "Amazon Annual Report"}}). For greater specificity, we could add numerical operators to indicate the years of the annual reports to consult ({"year": {"$gt": 2023}}). The choice of operators and filters depends on the use case and on the logic that will allow that only the relevant documents are consulted.

You can automate the retrieval and generation steps of a RAG application using a LangGraph StateGraph. To define a LangGraph graph for RAG, we take the following steps:

our GitHub repository

Retrieve

Generate

# Load RAG prompt from LangChain Hubprompt = hub.pull("rlm/rag-prompt")# Define RAG application state structureclass State(TypedDict):    # Keeping question, context, and answer in the state    question: str    context: List[Document]    answer: strdef retrieve(state: State, index_name: str):    # Compute query vector    query_vector = EMBEDDINGS_MODEL_ENDPOINT.predict({'inputs': state["question"]})[0]    # Retrieve chunks with semantic search    s3vectors_client = boto3.client("s3vectors")    query_args = {        "vectorBucketName": VECTOR_BUCKET_NAME,        "indexName": index_name,        "vector": query_vector,        "topK": 5,        "returnDistance": True,        "returnMetadata": True    }    vectors = s3vectors_client.query_vectors(**query_args).get("vectors", [])    chunks = [r.get("metadata", {}).get("chunk") for v in vectors]    # Assemble retrieved chunks into a list of LangChain Documents    document_chunks = [Document(page_content=chunk) for chunk in chunks]    return {"context": document_chunks}def generate(state: State):    # Join text chunks into a context string    docs_content = "\n\n".join(doc.page_content for doc in state["context"])    # Generate LangChain messages and convert to OpenAI format    lc_messages = prompt.invoke({        "question": state["question"],        "context": docs_content    }).to_messages()    openai_messages = langchain_to_openai_messages(lc_messages)    # Invoke the text generation model and return "answer"    request = {        "messages": openai_messages,        "temperature": 0.2,        "max_new_tokens": 512,    }    response = text_generation_predictor.predict(request)        return {"answer": response["choices"][0]["message"]["content"]}

After the LangGraph graph has been built and compiled, it can be invoked with sample questions:

response = recursive_graph.invoke({    "question": "What are the names of the people in Amazon's board of directors?"})

The response from the LangGraph graph is as follows:

The names of the people in Amazon’s board of directors are Jeffrey P. Bezos, Andrew R. Jassy, Keith B. Alexander, Edith W. Cooper, Jamie S. Gorelick, Daniel P. Huttenlocher, and Judith A. McGrath. There are seven members on the board, including Amazon’s CEO.

Step 5: Evaluate RAG performance with MLflow

To validate our RAG system performs well, we use MLflow to track experiments and evaluate performance using a ground truth dataset of questions and answers. You can set up an MLflow evaluation and run it using the following script. Notice that we can complement metrics from the evaluation (for example, latency and answer correctness using LLM as a judge) with parameters from the chunking and embedding stages to provide full visibility of the RAG application:

with mlflow.start_run():    results = mlflow.evaluate(        model=recursive_model,        data=eval_df,        targets="answer",        extra_metrics=[            mlflow.metrics.genai.answer_correctness(                model="bedrock:/anthropic.claude-3-sonnet-20240229-v1:0",                parameters={                    "temperature": 0.2,                    "max_tokens": 256,                    "anthropic_version": "bedrock-2023-05-31",                },            ),            mlflow.metrics.latency(),        ],        evaluator_config={          "col_mapping": {              "inputs": "question",          }        }    )    # Store metrics from the evaluation    mlflow.log_metrics(results.metrics)    # Store other parameters    mlflow.log_param("Chunking strategy", "recursive")    mlflow.log_param("Chunk size", CHUNK_SIZE)    mlflow.log_param("Chunk overlap", CHUNK_OVERLAP)    mlflow.log_param("Number of chunks", NUM_CHUNKS_RECURSIVE)    mlflow.log_param("Embedding model ID", EMBEDDING_MODEL_ID)    mlflow.log_param("Embedding model version", EMBEDDING_MODEL_VERSION)    mlflow.log_param("Text generation model ID", TEXT_GENERATION_MODEL_ID)    mlflow.log_param("Text generation model version", TEXT_GENERATION_MODEL_VERSION)    mlflow.log_param("Chunking time seconds", CHUNKING_TIME_RECURSIVE)

This evaluation tracks:

Answer correctness

measure of answer quality

Latency

Chunking parameters

Embedding parameters

The following figure shows the metrics and parameters from one experiment. Using this view, a machine learning (ML) engineer can compare the performance of different RAG applications and pick the combination of retrieval and generation parameters that provide the best user experience. This could be, for example, finding the best combination between high answer correctness and low latency. The ML engineer can then learn the models and chunking parameters that resulted in that performance and implement them in the final application.

By running multiple experiments with different chunking strategies, embedding models, or retrieval configurations, you can identify the optimal setup for your use case.

Key benefits of Amazon S3 Vectors for RAG

Amazon S3 Vectors bring scalable, cost-effective vector search to your RAG applications with:

Cost-effective pricing

Fast similarity search

Serverless scalability

Seamless integration

Flexible filtering

Unified storage

Amazon S3 vector store is ideal for use cases where ultra-low latency isn’t required, such as batch processing, periodic reporting, and agent-based workflows. It offers the durability, scalability, and operational simplicity of Amazon S3. A few industry-specific use cases that can benefit from S3 vector store are:

Healthcare

Financial services

Retail

Manufacturing

Legal and compliance

Media and entertainment

Education

Performance Considerations

When building RAG applications with Amazon S3 Vectors, consider these performance optimization strategies:

Chunking strategy

Vector dimensions

Distance metrics

Metadata filtering

To summarize, the key advantages demonstrated in using S3 Vectors with SageMaker AI are:

Cost-efficient scalability

Integrated evaluation framework

Accelerated innovation cycle

In this post, we’ve shown how to replace traditional vector database complexity with a streamlined AWS based approach that scales with your data and evolves with your generative AI strategy. The SageMaker managed MLflow integration means that every architectural decision is guided by quantifiable metrics, from answer correctness to latency profiles, turning experimentation into actionable insights. As you implement RAG solutions, use these tools to validate retrieval strategies against domain-specific datasets, benchmark embedding models for accuracy and storage tradeoffs, and enforce governance through version-controlled deployments.

Cleanup

To avoid unnecessary costs, delete resources such as the SageMaker-managed MLflow tracking server, S3 Vectors indexes and buckets, and SageMaker endpoints when your RAG experimentation is complete.

Conclusion

In this post, we’ve demonstrated how to build a complete RAG solution using Amazon S3 Vectors and SageMaker AI. The combination of S3 vector buckets, SageMaker AI LLM models, and SageMaker managed MLflow provides a transformative solution for organizations building enterprise-scale RAG applications. In this approach, we illustrate the use of S3 Vectors as a new approach to effectively manage vector data at scale, without the cost and scalability challenges that come with conventional vector databases.

We encourage you to explore Amazon S3 Vectors documentation and experiment with the SageMaker AI LLM models and SageMaker managed MLflow evaluation templates shown in this post. Now it’s your turn to build enterprise-scale AI solutions with serverless, observable, and relentlessly optimized generative AI strategies.

About the Authors

Sandeep Raveesh is a GenAI Specialist Solutions Architect at AWS. He works with customer through their AIOps journey across model training, Retrieval-Augmented-Generation (RAG), GenAI Agents, and scaling GenAI use-cases. He also focuses on Go-To-Market strategies helping AWS build and align products to solve industry challenges in the Generative AI space. You can find Sandeep on LinkedIn.

Felipe Lopez is a Senior AI/ML Specialist Solutions Architect at AWS. Prior to joining AWS, Felipe worked with GE Digital and SLB, where he focused on modeling and optimization products for industrial applications.

Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.

Biswanath Hore is a Sr. Solutions Architect at Amazon Web Services. He works with customers early in their AWS journey, helping them adopt cloud solutions to address their business needs. He is passionate about Machine Learning and, outside of work, loves spending time with his family.