AWS Machine Learning Blog 06月07日
Build a Text-to-SQL solution for data consistency in generative AI using Amazon Nova
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了如何利用文本到SQL(Text-to-SQL)技术,通过生成精准的、特定于模式的查询,使非技术用户也能访问组织数据,从而加速决策制定并促进数据驱动文化。文章比较了RAG、生成式BI和Text-to-SQL三种数据查询方法,强调了Text-to-SQL在处理结构化数据查询时的优势,并介绍了使用亚马逊Nova模型构建Text-to-SQL解决方案的流程,包括动态模式检索、SQL查询生成和结果格式化。最后,文章提供了示例查询和代码执行流程。

💡Text-to-SQL技术弥合了非技术用户与组织数据之间的差距,通过将自然语言转化为SQL查询,赋能用户更快地做出决策。

🔍与RAG(检索增强生成)和生成式BI相比,Text-to-SQL在处理结构化数据时表现更优,能提供精确、可复现的结果,特别适用于过滤、计数或聚合数据等操作。

🤖该方案使用Amazon Nova Pro模型将自然语言转化为SQL查询,并使用Amazon Nova Lite模型格式化查询结果,从而实现高效的数据处理和用户友好的输出。

⚙️该解决方案的核心功能包括动态模式上下文、SQL查询生成、查询执行和格式化响应,确保了查询的准确性和易用性。

🛠️使用Text-to-SQL需要安装AWS CLI、配置Amazon Bedrock、获取Amazon Nova Lite和Pro的访问权限,并安装Python及相关库,以及创建一个数据库。

Businesses rely on precise, real-time insights to make critical decisions. However, enabling non-technical users to access proprietary or organizational data without technical expertise remains a challenge. Text-to-SQL bridges this gap by generating precise, schema-specific queries that empower faster decision-making and foster a data-driven culture. The problem lies in obtaining deterministic answers—precise, consistent results needed for operations such as generating exact counts or detailed reports—from proprietary or organizational data. Generative AI offers several approaches to query data, but selecting the right method is critical to achieve accuracy and reliability.

This post evaluates the key options for querying data using generative AI, discusses their strengths and limitations, and demonstrates why Text-to-SQL is the best choice for deterministic, schema-specific tasks. We show how to effectively use Text-to-SQL using Amazon Nova, a foundation model (FM) available in Amazon Bedrock, to derive precise and reliable answers from your data.

Options for querying data

Organizations have multiple options for querying data, and the choice depends on the nature of the data and the required outcomes. This section evaluates the following approaches to provide clarity on when to use each and why Text-to-SQL is optimal for deterministic, schema-based tasks:

In scenarios demanding precision and consistency, Text-to-SQL outshines RAG and generative BI by delivering accurate, schema-driven results. These characteristics make it the ideal solution for operational and structured data queries.

Solution overview

This solution uses the Amazon Nova Lite and Amazon Nova Pro large language models (LLMs) to simplify querying proprietary data with natural language, making it accessible to non-technical users.

Amazon Bedrock is a fully managed service that simplifies building and scaling generative AI applications by providing access to leading FMs through a single API. It allows developers to experiment with and customize these models securely and privately, integrating generative AI capabilities into their applications without managing infrastructure.

Within this system, Amazon Nova represents a new generation of FMs delivering advanced intelligence and industry-leading price-performance. These models, including Amazon Nova Lite and Amazon Nova Pro, are designed to handle various tasks such as text, image, and video understanding, making them versatile tools for diverse applications.

You can find the deployment code and detailed instructions in our GitHub repo.

The solution consists of the following key features:

The following diagram illustrates the solution architecture.

In this solution, we use Amazon Nova Pro and Amazon Nova Lite to take advantage of their respective strengths, facilitating efficient and effective processing at each stage:

By strategically deploying Amazon Nova Pro and Amazon Nova Lite in this manner, the solution makes sure that each component operates optimally, balancing performance, accuracy, and cost-effectiveness.

Prerequisites

Complete the following prerequisite steps:

    Install the AWS Command Line Interface (AWS CLI). For instructions, refer to Installing or updating to the latest version of the AWS CLI. Configure the basic settings that the AWS CLI uses to interact with AWS. For more information, see Configuration and credential file settings in the AWS CLI. Make sure Amazon Bedrock is enabled in your AWS account. Obtain access to Amazon Nova Lite and Amazon Nova Pro. Install Python 3.9 or later, along with required libraries (Streamlit version 1.8.0 or later, Boto3, pymssql, and environment management packages). Create a Microsoft SQL Server (version 2016 or later) database with credentials to connect.
      Create a secret in AWS Secrets Manager for database credentials and name it mssql_secrets. For instructions, see Create an AWS Secrets Manager secret.

Our sample code uses a Microsoft SQL Server database, but this solution supports the following services:

For more information about prerequisites, refer to the GitHub repo.

Set up the development environment

In the command prompt, navigate to the folder where the code exists and run the following command:

python3.9 -m pip install -r requirements.txt --upgrade

This command installs the required libraries to run the application.

Load the sample dataset in the database

Make sure you have created a secret in Secrets Manager named mssql_secrets as mentioned in the prerequisites. If you named your secret something else, update the code in app.py (line 29) and load_data.py (line 22).

After you create the secret, run the following command from the code folder:

python load_data.py

This command creates a database named Sales with tables Products, Customers, and Orders and loads the sample data in these tables.

Run the application

To run the application, execute the following command:

streamlit run app.py

Example queries

In this section, we explore some sample queries.

For our first query, we ask “Who are the customers who bought smartphones?” This generates the following SQL:

SELECT DISTINCT CustomerName, ProductName, SUM(Quantity) AS TotalSoldQuantity FROM vw_sales WHERE ProductName LIKE '%smartphone%' GROUP BY CustomerName, ProductName, OrderDate; 

We get the following formatted response:

    Alice Johnson, who bought 1 smartphone on October 14th, 2023. Ivy Martinez, who bought 2 smartphones on October 15th, 2023.

Next, we ask “How many smartphones are in stock?” This generates the following SQL:

SELECT DISTINCT ProductName, StockQuantity AS AvailableQuantity FROM vw_sales WHERE ProductName LIKE '%smartphone%'; 

We get the response “There are 100 smartphones currently in stock.”

Code execution flow

In this section, we explore the code execution flow. The code reference is from the GitHub repo. Do not run the different parts of the code individually.

Retrieve schema dynamically

Use INFORMATION_SCHEMA views to extract schema details dynamically (code reference from app.py):

def get_schema_context(db_name, db_view_name):    conn = connect_to_db()    cursor = conn.cursor()    cursor.execute(f"USE {db_name}")    query = f"SELECT COLUMN_NAME, DATA_TYPE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{db_view_name}'"    cursor.execute(query)    schema = cursor.fetchall()    print("Schema:", schema)    return '\n'.join([f"- {row[0]}: {row[1]}" for row in schema])

Dynamic schema retrieval adapts automatically to changes by querying metadata tables for updated schema details, such as table names and column types. This facilitates seamless integration of schema updates into the Text-to-SQL system, reducing manual effort and improving scalability.

Test this function to verify it adapts automatically when schema changes occur.

Before generating SQL, fetch schema details for the relevant tables to facilitate accurate query construction.

Generate a SQL query using Amazon Nova Pro

Send the user query and schema context to Amazon Nova Pro (code reference from sql_generator.py):

def generate_sql_query(question: str, schema_context: str, db_name: str, db_view_name: str = None) -> str:        nova_client = NovaClient()           # Base prompt with SQL generation rules    base_prompt = """    MS SQL DB {db_name} has one view names '{db_view_name}'.     Always use '{db_view_name}' as table name to generate your query.    Create a MS SQL query by carefully understanding the question and generate the query between tags <begin sql> and </end sql>.    The MS SQL query should selects all columns from a view named '{db_view_name}'    In your SQL query always use like condition in where clasue.    if a question is asked about product stock then always use 'distinct' in your SQL query.    Never Generate an SQL query which gives error upon execution.              Question: {question}        Database Schema : {schema_context}        Generate SQL query:    """        # Format the prompt with the question and schema context    formatted_prompt = base_prompt.format(        question=question,        db_name=db_name,        db_view_name=db_view_name if db_view_name else "No view name provided",        schema_context=schema_context if schema_context else "No additional context provided"    )            # Invoke Nova model    response = nova_client.invoke_model(        model_id='amazon.nova-pro-v1:0',        prompt=formatted_prompt,        temperature=0.1  # Lower temperature for more deterministic SQL generation    )        # Extract SQL query from response using regex    sql_match = extract_sql_from_nova_response(response)    if sql_match:        return sql_match    else:        raise ValueError("No SQL query found in the response")    def extract_sql_from_nova_response(response):    try:        # Navigate the nested dictionary structure        content = response['output']['message']['content']        # Get the text from the first content item        text = content[0]['text']                # Find the positions of begin and end tags        begin_tag = "<begin sql>"        end_tag = "</end sql>"        start_pos = text.find(begin_tag)        end_pos = text.find(end_tag)                # If both tags are found, extract the SQL between them        if start_pos != -1 and end_pos != -1:            # Add length of begin tag to start position to skip the tag itself            sql_query = text[start_pos + len(begin_tag):end_pos].strip()            return sql_query                    return None            except (KeyError, IndexError):        # Return None if the expected structure is not found        return None

This code establishes a structured context for a text-to-SQL use case, guiding Amazon Nova Pro to generate SQL queries based on a predefined database schema. It provides consistency by defining a static database context that clarifies table names, columns, and relationships, helping prevent ambiguity in query formation. Queries are required to reference the vw_sales view, standardizing data extraction for analytics and reporting. Additionally, whenever applicable, the generated queries must include quantity-related fields, making sure that business users receive key insights on product sales, stock levels, or transactional counts. To enhance search flexibility, the LLM is instructed to use the LIKE operator in WHERE conditions instead of exact matches, allowing for partial matches and accommodating variations in user input. By enforcing these constraints, the code optimizes Text-to-SQL interactions, providing structured, relevant, and business-aligned query generation for sales data analysis.

Execute a SQL query

Run the SQL query on the database and capture the result (code reference from app.py):

cursor.execute(sql_command)result = cursor.fetchall()print(result)

Format the query results using Amazon Nova Lite

Send the database result from the SQL query to Amazon Nova Lite to format it in a human-readable format and print it on the Streamlit UI (code reference from app.py):

def interact_with_nova(user_input, llm_query, query_response, model="nova"):    session = boto3.session.Session()    region = session.region_name        nova_client = NovaClient(region_name=region)        final_prompt = f"""Human: You are a expert chatbot who is happy to assist the users. User questions in given in <Question> tag and results in <query_response> tag. Understand the question and use information from <query_response> to generate an answer. If there are more than one entery, give a numbered list. Never retrun <question> and <query_response> in your response.    for example : question - "How many mouse were sold?"                  llm response :                                 " There were 3 mouse sold in total.                                 - 1 mouse sold to Mia Perez on October 2nd, 2023.                                 - 2 mouse sold to Jack Hernandez on October 1st 2023."    <Question>    {user_input}    </Question>    <query_response>    {query_response}    </query_response>"""        try:                    response = nova_client.invoke_model(                model_id='amazon.nova-lite-v1:0',                prompt=final_prompt,                max_tokens=4096,                temperature=0.7            )                        content = response['output']['message']['content']            text = content[0]['text']            return text                        return "Sorry, I couldn't process your request."        except Exception as e:        print(f"Error in LLM interaction: {str(e)}")        return "Sorry, an error occurred while processing your request."

Clean up

Follow these steps to clean up resources in your AWS environment and avoid incurring future costs:

    Clean up database resources: Clean up security resources: Clean up the frontend (only if hosting the Streamlit application on Amazon EC2):
      Stop the EC2 instance hosting the Streamlit application. Delete associated storage volumes.
    Clean up additional resources (if applicable):
      Remove Elastic Load Balancers. Delete virtual private cloud (VPC) configurations.
    Check the AWS Management Console to confirm all resources have been deleted.

Conclusion

Text-to-SQL with Amazon Bedrock and Amazon Nova LLMs provides a scalable solution for deterministic, schema-based querying. By delivering consistent and precise results, it empowers organizations to make informed decisions, improve operational efficiency, and reduce reliance on technical resources.

For a more comprehensive example of a Text-to-SQL solution built on Amazon Bedrock, explore the GitHub repo Setup Amazon Bedrock Agent for Text-to-SQL Using Amazon Athena with Streamlit. This open source project demonstrates how to use Amazon Bedrock and Amazon Nova LLMs to build a robust Text-to-SQL agent that can generate complex queries, self-correct, and query diverse data sources.

Start experimenting with Text-to-SQL use cases today by getting started with Amazon Bedrock.


About the authors

Mansi Sharma is a Solutions Architect for Amazon Web Services. Mansi is a trusted technical advisor helping enterprise customers architect and implement cloud solutions at scale. She drives customer success through technical leadership, architectural guidance, and innovative problem-solving while working with cutting-edge cloud technologies. Mansi specializes in generative AI application development and serverless technologies.

Marie Yap is a Principal Solutions Architect for Amazon Web Services.  In this role, she helps various organizations begin their journey to the cloud. She also specializes in analytics and modern data architectures.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Text-to-SQL 人工智能 数据查询 Amazon Nova
相关文章