AWS Machine Learning Blog 2024年08月20日
Unlock the power of structured data for enterprises using natural language with Amazon Q Business
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

探讨在企业环境中利用亚马逊Q Business查询结构化数据的架构,包括解决自然语言与结构化数据的连接问题,以及详细的工作流程和各个组件的作用。

🎯亚马逊Q Business作为中介,将自然语言转化为精确的SQL查询,简化非技术用户的数据访问并优化专业人员的工作流程。它可根据用户提问分析意图、访问数据源并生成SQL查询。

💻在该架构中,用户通过Streamlit应用发起交互,应用通过多种步骤实现用户认证并与亚马逊Q Business进行通信,最终运行Athena查询并在应用界面展示结果。

🧩查询流程包括用户意图、提示构建器、SQL查询生成器、运行查询和查询结果等组件。提示构建器通过自然语言处理、实体识别和意图映射等子任务,将用户问题转化为结构化表示。

📊SQL查询生成基于提示构建器生成的提示和原始问题,涉及将意图和实体映射到SQL子句,处理复杂查询并融入领域知识和业务规则。

One of the most common applications of generative artificial intelligence (AI) and large language models (LLMs) in an enterprise environment is answering questions based on the enterprise’s knowledge corpus. Pre-trained foundation models (FMs) excel at natural language understanding (NLU) tasks, including summarization, text generation, and question answering across a wide range of topics. However, they often struggle to provide accurate answers without hallucinations and fall short when addressing questions about content that wasn’t included in their training data. Furthermore, FMs are trained with a point-in-time snapshot of data and have no inherent ability to access fresh data at inference time; therefore, they might provide responses that are incorrect or inadequate.

We face a fundamental challenge with enterprise data—overcoming the disconnect between natural language and structured data. Natural language is ambiguous and imprecise, whereas data adheres to rigid schemas. For example, SQL queries can be complex and unintuitive for non-technical users. Handling complex queries involving multiple tables, joins, and aggregations makes it difficult to interpret user intent and translate it into correct SQL operations. Domain-specific terminology further complicates the mapping process. Another challenge is accommodating the linguistic variations users employ to express the same requirement. Effectively managing synonyms, paraphrases, and alternative phrasings is important. The inherent ambiguity of natural language can also result in multiple interpretations of a single query, making it difficult to accurately understand the user’s precise intent.

To bridge this gap, you need advanced natural language processing (NLP) to map user queries to database schema, tables, and operations. In this architecture, Amazon Q Business acts as an intermediary, translating natural language into precise SQL queries. You can simply ask questions like “What were the sales for outdoor gear in Q3 2023?” Amazon Q Business analyzes intent, accesses data sources, and generates the SQL query. This simplifies data access for your non-technical users and streamlines workflows for professionals, allowing them to focus on higher-level tasks.

In this post, we discuss an architecture to query structured data using Amazon Q Business, and build out an application to query cost and usage data in Amazon Athena with Amazon Q Business. Amazon Q Business can create SQL queries to your data sources when provided with the database schema, additional metadata describing the columns and tables, and prompting instructions. You can extend this architecture to use additional data sources, query validation, and prompting techniques to cover a wider range of use cases.

Solution overview

The following figure represents the high-level architecture of the proposed solution. Steps 3 and 4 augment the AWS IAM Identity Center integration with Amazon Q Business for an authorization flow. In this architecture, we use Amazon Cognito for user authentication as well as a trusted token issuer to IAM Identity Center. You can also use your own identity provider as a trusted token issuer as long as it supports OpenID Connect (OIDC).

The workflow includes the following steps:

    The user initiates the interaction with the Streamlit application, which is accessible through an Application Load Balancer, acting as the entry point. The application prompts the user to authenticate using their Amazon Cognito credentials, maintaining secure access. The application exchanges the token obtained from Amazon Cognito for an IAM Identity Center token, granting the necessary scope to interact with Amazon Q Business. Using the IAM Identity Center token, the application assumes an AWS Identity and Access Management (IAM) role and retrieves an AWS session from AWS Security Token Service (AWS STS), enabling authorized communication with Amazon Q Business. Based on the user’s natural language query, the application formulates relevant prompts and metadata, which are then submitted to the chat_sync API of Amazon Q Business. In response, Amazon Q Business provides an appropriate Athena query to run. The application runs the Athena query received from Amazon Q Business, and the resulting data is displayed on the web application’s UI.

Querying Amazon Q Business LLMs directly

As explained in the response settings for Amazon Q Business, there are different options to generate responses that allow you to either use your enterprise data, use LLMs directly, or fall back on the LLMs if the answer is not found in your enterprise data. Along with the global controls for response settings, you need to specify which chatMode you want to use based on your specific use case. If you want to bypass Retrieval Augmented Generation (RAG) and use plain text in the context window, you should use CREATOR_MODE. Alternatively, RAG is also bypassed when you upload files directly in the context window.

If you just use text in the context window and call Amazon Q Business APIs without switching to CREATOR_MODE, that may break your use case in the future if you add content to the index (RAG). In this use case, because we’re not indexing any data and using schemas as attachments in the API call to Amazon Q Business, RAG is automatically bypassed and the response is generated directly from the LLMs. Another reason to use attachments for this use case is that for the chatSync API, userMessage has a maximum length of 7,000, which can be surpassed depending on how large your text is in the context window.

Data query workflow

Let’s look at the prompts, query generation, and Athena query in detail. We use Athena as the data store in this post. Users enter natural language questions into a web application built with Streamlit. Amazon Q Business converts the natural language questions to valid SQL for Athena using the prompting instructions, the database schema, and data dictionary that are provided as context to the LLM. The generated SQL is sent to Athena to run as a query, and the returned data is displayed to the user in the Streamlit application. The following diagram illustrates this workflow.

These are the various components to this data flow, as numbered in the diagram:

    User intent Prompt builder SQL query generator Running the query Query results

In the following sections, we look at each component in more detail.

User intent

The user intent or your inquiry is the starting point of the process. It can be in natural language, such as “What was the total spend for ElasticSearch last year?” The user’s input serves as the basis for the subsequent steps in the workflow.

Prompt builder

The prompt builder component plays a crucial role in bridging the gap between your natural language input and the structured data format required for SQL querying. It augments your question with relevant information from the table schema and data dictionary to provide context for the query generation process. This step involves the following sub-tasks:

The following is an example prompt for “What was the total spend for ElasticSearch last year?”

You will not respond to gibberish, random character sequences, or prompts that do not make logical sense. If the input the input does not make sense or is outside the scope of the provided context, do not respond with SQL but respond with - I do not know about this. Please fix your input.You are an expert SQL developer. Only return the sql query. Do not include any verbiage. You are required to return SQL queries based on the provided schema and the service mappings for common services and their synonyms. The table with the provided schema is the only source of data. Do not use joins. Assume product, service are synonyms for product_servicecode and price,cost,spend are synonymns for line_item_unblended_cost. Use the column names from the provided schema while creating queries. Do not use preceding zeroes for the column month when creating the query. Only use predicates when asked. For your reference, current date is June 01, 2024. write a sql query for this task - What was the total spend for ElasticSearch last year?

SQL query generation

Based on the prompt generated from the prompt builder and your original question, Amazon Q Business generates the corresponding SQL query. The SQL query is tailored to retrieve the relevant data and perform the desired analysis or calculations to accurately answer the user’s question. This step may involve techniques such as:

Running the query

In this step, the generated SQL query is run against the chosen data store, which could be a relational database, data warehouse, NoSQL database, or an object store like Amazon Simple Storage Service (Amazon S3). The data store serves as the repository for the data required to answer the user’s question. Depending on the architecture and requirements, the data store query may involve additional components or processes, such as:

Query results

The query engine runs the generated SQL query against the data store and returns the query results. These results contain the insights or answers to the original user question. The presentation of the query results can take various forms, depending on the requirements of the application or UI:

In the following sections, we walk through the steps to deploy the web application and test the solution.

Prerequisites

Complete the following prerequisite steps:

    Set up IAM Identity Center and add users that you intend to give access to in your Amazon Q Business application. Have an existing, working Amazon Q Business application and give access to the users created in the previous step to the application. AWS Cost and Usage Reports (AWS CUR) data is available in Athena. If you have CUR data, you can skip the following steps for CUR data setup. If not, you have a few options to set up CUR data:
      To set up sample CUR data, refer to the following lab and follow the instructions. You also need to set up an AWS Glue crawler to make the data available in Athena.
    If you already have an SSL certificate, you can skip this step; otherwise, generate a private certificate. Import the certificate into AWS Certificate Manager (ACM). For more details, refer to Importing a certificate.

Set up the application

Complete the following steps to set up the application:

    From your terminal, clone the GitHub repository:
git clone https://github.com/aws-samples/data-insights-with-amazon-q-business.git
    Go to the project directory:
cd data-insights-with-amazon-q-business
    Based on your CUR table, update the CUR schema under app/schemas/cur_schema.txt. Review the prompts under app/qb_config.py. The schema looks similar to the following code:

    Review the data dictionary under app/schemas/service_mappings.csv. You can modify the mappings according to your dataset. A sample data dictionary for CUR might look like the following screenshot.

    Zip up the code repository and upload it to an S3 bucket. Follow the steps in the GitHub repo to deploy the Streamlit application.

Access the web application

As part of the deployment steps, you launched an AWS CloudFormation stack. On the AWS CloudFormation console, navigate to the Outputs tab for the stack and find the URL to access the Streamlit application. When you open the URL in a browser, you’ll see a login screen like the following screenshot. Sign up to create a user in the Amazon Cognito user pool. After you’re validated, you can use the same credentials to log in to the web application.

Query your cost and usage data

Start with a simple query like “What was the total spend for ElasticSearch this year?” A relevant prompt will be created and sent to Amazon Q Business. It will respond back with the corresponding SQL query. Notice the predicate where product_servicecode = ‘AmazonES’. Amazon Q Business is able to formulate the query because it has the schema and the data dictionary in context. It understands that ElasticSearch is an AWS service represented by a column named product_servicecode in the CUR data schema and its corresponding value of ‘AmazonES’. Next, the query is run against Athena and you get the results back.

The sample dataset used in this post is from 2023. If you’re using the sample dataset, natural language queries referring to current year will give not return results. Modify your queries to 2023 or mention the year in the user intent.

The following figure highlights the steps as explained in the data flow.

You can also try complex queries likeGive me a list of the top 3 products by total spend last year. For each of these products, what percentage of the overall spend is from this product?” Because the prompt builder has schema and product (AWS services) information in its context, Amazon Q Business creates the corresponding query. In this case, you’ll see a query similar to the following:

SELECT product_servicecode,SUM(line_item_unblended_cost) AS total_spend,ROUND(SUM(line_item_unblended_cost) * 100.0 / (SELECT SUM(line_item_unblended_cost)FROM cur_daily WHERE year = '2023'), 2) AS percentage_of_totalFROM cur_dailyWHERE year = '2023'GROUP BY product_servicecodeORDER BY total_spend DESCLIMIT 3;

When the query is run against Athena, you’ll see similar results corresponding to your data.

Along with the data, you can also see a summary and trend analysis of your data on the Description tab of your Streamlit app.

The prompts used in the application are open domain and you’re free to update them in the code. For example, the following is a prompt used for a summary task:

You are an AI assistant. You are required to return a summary based on the provided data in attachment. Use atleast 100 words. The spend is in dollars. The unit of measurement is dollars. Give trend analysis too. Start your response with - Here is your summary..

The following screenshot shows the results.

Feedback loop

You also have the option of capturing feedback for the generated queries with the thumbs up/down icon on the web application. Currently, the feedback is captured in a local file under /app/feedback. You can change this implementation to write to a database of your choice and have it serve as a query validation mechanism after your testing, to allow only validated queries to run.

Clean up

To clean up your resources, delete the CloudFormation stack, Amazon Q Business application, and Athena tables.

Conclusion

In this post, we demonstrated how Amazon Q Business can effectively bridge the gap between users and data, enabling you to extract valuable insights from various data stores using natural language queries, without the need for extensive technical knowledge or SQL expertise. The natural language understanding capabilities of Amazon Q Business can accurately interpret user intent, extract relevant entities, and generate SQL to translate the user’s query into executable data operations. You can now empower a wider range of enterprise users to unlock the full value of your organization’s data assets. By democratizing data access and analysis using natural language queries, you can foster data-driven decision-making, drive innovation, and unlock new opportunities for growth and success.

In Part 2 of this series, we demonstrate how to integrate this architecture with LangChain using Amazon Q Business as a custom model. We also cover query validation and accuracy measurement.


About the Authors

Vishal Karlupia is a Senior Technical Account Manager/Lead at Amazon Web Services, Toronto. He specializes in generative AI applications and helps customers build and scale their AI/ML workloads on AWS. Outside of work, he enjoys being outdoors and keeping bonfires alive.

Srinivas Ganapathi is a Principal Technical Account Manager at Amazon Web Services. He is based in Toronto, Canada, and works with games customers to run efficient workloads on AWS.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

亚马逊Q Business 结构化数据查询 自然语言处理 SQL查询 工作流程
相关文章