Organizations manage extensive structured data in databases and data warehouses. Large language models (LLMs) have transformed natural language processing (NLP), yet converting conversational queries into structured data analysis remains complex. Data analysts must translate business questions into SQL queries, creating workflow bottlenecks.
Amazon Bedrock Knowledge Bases enables direct natural language interactions with structured data sources. The system interprets database schemas and context, converting natural language questions into accurate queries while maintaining data reliability standards. You can chat with your structured data by setting up structured data ingestion from AWS Glue Data Catalog tables and Amazon Redshift clusters in a few steps, using the power of Amazon Bedrock Knowledge Bases structured data retrieval.
This post provides instructions to configure a structured data retrieval solution, with practical code examples and templates. It covers implementation samples and additional considerations, empowering you to quickly build and scale your conversational data interfaces. Through clear examples and proven methodologies, organizations can transform their data access capabilities and accelerate decision-making processes.
Solution overview
The solution demonstrates how to build a conversational application using Amazon Bedrock Knowledge Bases structured data retrieval. Developers often face challenges integrating structured data into generative AI applications. This includes difficulties training LLMs to convert natural language queries to SQL queries based on complex database schemas, as well as making sure appropriate data governance and security controls are in place. Amazon Bedrock Knowledge Bases alleviates these complexities by providing a managed natural language to SQL (NL2SQL) module. Amazon Bedrock Knowledge Bases offers an end-to-end managed workflow for you to build custom generative AI applications that can access and incorporate contextual information from a variety of structured and unstructured data sources. Using advanced NLP, Amazon Bedrock Knowledge Bases can transform natural language queries into SQL queries, so you can retrieve data directly from the source without the need to move or preprocess the data.
This solution includes Amazon Bedrock Knowledge Bases, Amazon Redshift, AWS Glue, and Amazon Simple Storage Service (Amazon S3). The solution architecture consists of two parts: a data ingestion pipeline, and a structured data retrieval application using Amazon Bedrock Knowledge Bases.
Amazon Bedrock Knowledge Bases structured data retrieval supports Amazon Redshift as the query engine and multiple data ingestion options. The data ingestion pipeline is a one-time setup, and supports multiple ingestion options. In this post, we discuss a common data ingestion use case using Amazon S3, AWS Glue, and Amazon Redshift.
You can configure Amazon Bedrock Knowledge Bases structured data retrieval to retrieve data from AWS Glue databases and S3 datasets. This setup uses automatic mounting of the Data Catalog in Amazon Redshift. With this ingestion option, you can seamlessly integrate existing S3 datasets and Data Catalog tables into your Retrieval Augmented Generation (RAG) applications with the access permissions configured through Lake Formation. The following diagram illustrates this pipeline.
The following screenshot shows the configuration options on the Amazon Bedrock console.
After the data ingestion is configured and the knowledge bases data source sync job is complete, users can ask natural language questions, and Amazon Bedrock Knowledge Bases will generate the SQL, execute the SQL against the query engine, and process it through the LLM to provide a user-friendly response. The following diagram illustrates a sample architecture of the structured data retrieval workflow.
The data retrieval workflow consists of the following steps:
- In a RAG application, the user can ask a natural language data analytics question through the chat interface, such as “What is the sales revenue for the Month of February 2025?” The natural language query is sent to Amazon Bedrock Knowledge Bases for data retrieval and processing. Amazon Bedrock Knowledge Bases generates a SQL query based on the underlying data schema configured during the knowledge base creation. The SQL query is executed against the query engine (Amazon Redshift) to retrieve data from a structured data store (AWS Glue tables). The query can include multiple joins and aggregation. The generated SQL response is sent to an LLM along with additional context to generate a response in natural language. The response is sent back to the user. The user can ask follow-up questions based on the retrieved response, such as “What is the product that generated highest revenue in this period?”
Amazon Bedrock Knowledge Bases structured data retrieval supports three different APIs to meet your data retrieval requirements:
- Retrieval and response generation – The retrieval and response generation API, similar to the solution workflow we’ve discussed, generates a SQL query, retrieves data through the query engine, and processes it through the LLM to generate a natural language response Retrieval only – The retrieval only API generates a SQL query, retrieves data through the query engine, and returns the data without processing it through an LLM Generate SQL queries – The generate SQL query API returns the raw SQL query that was generated by Amazon Bedrock Knowledge Bases, which can be used for review and further processing by applications
The following screenshot shows the configuration options on the Amazon Bedrock console.
Code resources and templates
The solution uses the following notebooks:
- Data ingestion notebook – Structured-rag-s3-glue-ingestion includes the step-by-step guide to ingest an open dataset to Amazon S3, configure AWS Glue tables using crawlers, and set up the Amazon Redshift Serverless query engine. Structured data retrieval notebook – Structured-rag-s3-glue-retrieval walks through the implementation steps and provides sample code for configuring Amazon Bedrock Knowledge Bases structured data retrieval using Amazon S3, AWS Glue, and the Amazon Redshift query engine.
For more details, refer to the GitHub repo.
Prerequisites
To implement the solution provided in this post, you must have an AWS account. Additionally, access to the required foundation models must be enabled in Amazon Bedrock.
Set up the data ingestion pipeline
To set up the data ingestion pipeline, we load the sample dataset in an S3 bucket and configure AWS Glue as data storage and a Redshift Serverless workgroup as the query engine. Complete the following steps in data ingestion notebook:
- For data ingestion, download the following sample ecommerce dataset, convert it to a pandas data frame, and upload it to an S3 bucket using Amazon SageMaker Data Wrangler. Create an AWS Glue database and table using an AWS Glue crawler by crawling the source S3 bucket with the dataset. You can update this step to crawl your own S3 bucket or use your existing Data Catalog tables as storage metadata. Use the data ingestion notebook to create a Redshift Serverless namespace and workgroup in the default VPC. If you plan to use your own Redshift Serverless workgroup or Amazon Redshift provisioned cluster, you can skip this step.
Set up the structured data retrieval solution
In this section, we detail the steps to set up the structured data retrieval component of the solution.
Amazon Bedrock Knowledge Bases supports multiple data access patterns, including AWS Identity and Access Management (IAM), AWS Secrets Manager, and database users. For this post, we demonstrate the setup option with IAM access. You can use IAM access with the Redshift Serverless workgroup configured as part of the ingestion workflow or an existing Redshift Serverless or provisioned cluster to compete these steps.
Complete the following steps in structured data retrieval notebook:
- Create an execution role with the necessary policies for accessing data from Amazon Redshift, AWS Glue, and the S3 bucket. Invoke the CreateKnowledgeBase API to create the knowledge base with the execution role and knowledge base configurations. In the knowledge base configuration, the AWS Glue database and tables are used as storage metadata with Amazon Redshift as the query engine. After you create the knowledge base, you must complete additional steps to make sure the IAM execution role has the necessary permissions to execute the query in Amazon Redshift and retrieve data from AWS Glue. The notebook includes the necessary instructions to create and grant database access to the execution role, and grant AWS Lake Formation permissions. The ingestion job will sync the data store schema metadata about AWS Glue database and tables with the NL2SQL module. This schema metadata will be used while generating the SQL query during structured data retrieval. After the knowledge base sync job is complete, you can use the three data retrieval APIs – retrieve and generate response, retrieval only, and generate SQL query – to query and validate the structured data retrieval solution.
For more details, refer to Create a knowledge base by connecting to a structured data store.
Clean up
We have included cleanup instructions in both the data ingestion and structured data retrieval notebooks to clean up resources after the end-to-end solution is implemented and validated.
Conclusion
Amazon Bedrock Knowledge Bases simplifies data analysis by converting natural language questions into SQL queries, eliminating the need for specialized database expertise. The service integrates with Amazon Redshift, AWS Glue, and Amazon S3, allowing business analysts, data scientists, and operations teams to query data directly using conversation-like questions. It maintains data security through built-in governance controls and access permissions. Customers can deploy this managed service to enable users to analyze data using natural language questions, while maintaining data integrity and security standards.
To learn more, refer to Build a knowledge base by connecting to a structured data store and Amazon Bedrock Knowledge Bases now supports structured data retrieval.
About the authors
George Belsian is a Senior Cloud Application Architect at Amazon Web Services, helping organizations navigate the complexities of cloud adoption, AI integration, and data-driven innovation. By transforming legacy systems into cloud-based platforms and incorporating AI/ML capabilities, he helps businesses create new opportunities for growth, optimize their processes, and deliver scalable solutions.
Sandeep Singh is a Senior Generative AI Data Scientist at Amazon Web Services, helping businesses innovate with generative AI. He specializes in generative AI, machine learning, and system design. He has successfully delivered state-of-the-art AI/ML-powered solutions to solve complex business problems for diverse industries, optimizing efficiency and scalability.
Mani Khanuja is a Principal Generative AI Specialist SA and author of the book Applied Machine Learning and High-Performance Computing on AWS. She leads machine learning projects in various domains such as computer vision, natural language processing, and generative AI. She speaks at internal and external conferences such AWS re:Invent, Women in Manufacturing West, YouTube webinars, and GHC 23. In her free time, she likes to go for long runs along the beach.
Gopikrishnan Anilkumar is a Principal Technical Product Manager in AWS Agentic AI organization. He has over 10 years of product management experience across a variety of domains and is passionate about AI/ML.