Accuracy evaluation framework for Amazon Q Business

In the first post of this series, we introduced a comprehensive evaluation framework for Amazon Q Business, a fully managed Retrieval Augmented Generation (RAG) solution that uses your company’s proprietary data without the complexity of managing large language models (LLMs). The first post focused on selecting appropriate use cases, preparing data, and implementing metrics to support a human-in-the-loop evaluation process.

In this post, we dive into the solution architecture necessary to implement this evaluation framework for your Amazon Q Business application. We explore two distinct evaluation solutions:

Comprehensive evaluation workflow

AWS CloudFormation

Lightweight AWS Lambda based evaluation

AWS Lambda

By the end of this post, you will have a clear understanding of how to implement an evaluation framework that aligns with your specific needs with a detailed walkthrough, so your Amazon Q Business application delivers accurate and reliable results.

Challenges in evaluating Amazon Q Business

Evaluating the performance of Amazon Q Business, which uses a RAG model, presents several challenges due to its integration of retrieval and generation components. It’s crucial to identify which aspects of the solution need evaluation. For Amazon Q Business, both the retrieval accuracy and the quality of the answer output are important factors to assess. In this section, we discuss key metrics that need to be included for a RAG generative AI solution.

Context recall

Context recall measures the extent to which all relevant content is retrieved. High recall provides comprehensive information gathering but might introduce extraneous data.

For example, a user might ask the question “What can you tell me about the geography of the United States?” They could get the following responses:

Expected:

High context recall:

Low context recall:

The following diagram illustrates the context recall workflow.

Context precision

Context precision assesses the relevance and conciseness of retrieved information. High precision indicates that the retrieved information closely matches the query intent, reducing irrelevant data.

For example, “Why Silicon Valley is great for tech startups?”might give the following answers:

Ground truth answer:

High precision context:

Low precision context:

The following diagram illustrates the context precision workflow.

Answer relevancy

Answer relevancy evaluates whether responses fully address the query without unnecessary details. Relevant answers enhance user satisfaction and trust in the system.

For example, a user might ask the question “What are the key features of Amazon Q Business Service, and how can it benefit enterprise customers?” They could get the following answers:

High relevance answer:

Low relevance answer:

The following diagram illustrates the answer relevancy workflow.

Truthfulness

Truthfulness verifies factual accuracy by comparing responses to verified sources. Truthfulness is crucial to maintain the system’s credibility and reliability.

For example, a user might ask “What is the capital of Canada?” They could get the following responses:

Context:

High truthfulness answer:

Low truthfulness answer:

The following diagram illustrates the truthfulness workflow.

Evaluation methods

Deciding on who should conduct the evaluation can significantly impact results. Options include:

Human-in-the-Loop (HITL)

LLM-aided evaluation

Ragas framework

Each of these preparatory and evaluative steps contributes to a structured approach to evaluating the accuracy and effectiveness of Amazon Q Business in supporting enterprise needs.

Solution overview

In this post, we explore two different solutions to provide you the details of an evaluation framework, so you can use it and adapt it for your own use case.

Solution 1: End-to-end evaluation solution

For a quick start evaluation framework, this solution uses a hybrid approach with Ragas (automated scoring) and HITL evaluation for robust accuracy and reliability. The architecture includes the following components:

User access and UI

Evaluation solution infrastructure

Amazon DynamoDB

Amazon Simple Queue Service

Ragas scoring

HITL review

By integrating a metric-based approach with human validation, this architecture makes sure Amazon Q Business delivers accurate, relevant, and trustworthy responses for enterprise users. This solution further enhances the evaluation process by incorporating HITL reviews, enabling human feedback to refine automated scores for higher precision.

A quick video demo of this solution is shown below:

Solution architecture

The solution architecture is designed with the following core functionalities to support an evaluation framework for Amazon Q Business:

User access and UI

Amazon Cognito

Amazon Simple Storage Service

AWS IAM Identity Center

infrastructure

Application Load Balancer

Amazon Elastic Compute Cloud

Upload dataset and trigger evaluation

Consuming DynamoDB stream

Ragas scoring

HITL review – Authenticated users can review and refine RAGAS scores directly through the UI, providing nuanced and accurate evaluations by incorporating human insights into the process.

This architecture uses AWS services to deliver a scalable, secure, and efficient evaluation solution for Amazon Q Business, combining automated and human-driven evaluations.

Prerequisites

For this walkthrough, you should have the following prerequisites:

run a Linux command line

An AWS account

follow the instructions

A user role with administrator access

AWS Command Line Interface

installed and configured

Amazon Bedrock

Anthropic’s Claude 3 Sonnet model

Amazon Titan Embedding G1 – Text model

Additionally, make sure that all the resources you deploy are in the same AWS Region.

Deploy the CloudFormation stack

Complete the following steps to deploy the CloudFormation stack:

repository

files

./sample-code-for-evaluating-amazon-q-business-applications-using-ragas-main/end-to-end-solution

./deploy.sh

chmod 755 ./deploy.sh

./deploy.sh -s [CNF_STACK_NAME] -r [AWS_REGION]

You can follow the deployment progress on the AWS CloudFormation console. It takes approximately 15 minutes to complete the deployment, after which you will see a similar page to the following screenshot.

Add users to Amazon Q Business

You need to provision users for the pre-created Amazon Q Business application. Refer to Setting up for Amazon Q Business for instructions to add users.

Upload the evaluation dataset through the UI

In this section, you review and upload the following CSV file containing an evaluation dataset through the deployed custom UI.

This CSV file contains two columns: prompt and ground_truth. There are four prompts and their associated ground truth in this dataset:

What are the index types of Amazon Q Business and the features of each? I want to use Q Apps, which subscription tier is required to use Q Apps? What is the file size limit for Amazon Q Business via file upload? What data encryption does Amazon Q Business support?

To upload the evaluation dataset, complete the following steps:

Stacks

evals

Outputs

The custom UI will redirect you to the Amazon Cognito login page for authentication.

The UI application authenticates the user with Amazon Cognito, and initiates the token exchange workflow to implement a secure Chatsync API call with Amazon Q Business.

Use the credentials you noted earlier to log in.

For more information about the token exchange flow between IAM Identity Center and the identity provider (IdP), refer to Building a Custom UI for Amazon Q Business.

Upload Dataset

dataset CSV file

After the file is uploaded, the evaluation framework will send the prompt to Amazon Q Business to generate the answer, and then send the prompt, ground truth, and answer to Ragas to evaluate. During this process, you can also review the uploaded dataset (including the four questions and associated ground truth) on the Amazon Q Business console, as shown in the following screenshot.

After about 7 minutes, the workflow will finish, and you should see the evaluation result for first question.

Perform HITL evaluation

After the Lambda function has completed its execution, Ragas scoring will be shown in the custom UI. Now you can review metric scores generated using Ragas (an-LLM aided evaluation method), and you can provide human feedback as an evaluator to provide further calibration. This human-in-the-loop calibration can further improve the evaluation accuracy, because the HITL process is particularly valuable in fields where human judgment, expertise, or ethical considerations are crucial.

Let’s review the first question: “What are the index types of Amazon Q Business and the features of each?” You can read the question, Amazon Q Business generated answers, ground truth, and context.

Next, review the evaluation metrics scored by using Ragas. As discussed earlier, there are four metrics:

Answer relevancy

Truthfulness

Context precision

Context recall

For this question, all metrics showed Amazon Q Business achieved a high-quality response. It’s worthwhile to compare your own evaluation with these scores generated by Ragas.

Next, let’s review a question that returned with a low answer relevancy score. For example: “I want to use Q Apps, which subscription tier is required to use Q Apps?”

Analyzing both question and answer, we can consider the answer relevant and aligned with the user question, but the answer relevancy score from Ragas doesn’t reflect this human analysis, showing a lower score than expected. It’s important to calibrate Ragas evaluation judgement as Human in the Lopp. You should read the question and answer carefully, and make necessary changes of the metric score to reflect the HITL analysis. Finally, the results will be updated in DynamoDB.

Lastly, save the metric score in the CSV file, and you can download and review the final metric scores.

Solution 2: Lambda based evaluation

If you’re already using Amazon Q Business, AmazonQEvaluationLambda allows for quick integration of evaluation methods into your application without setting up a custom UI application. It offers the following key features:

Evaluates responses from Amazon Q Business using Ragas against a predefined test set of questions and ground truth data Outputs evaluation metrics that can be visualized directly in Amazon CloudWatch Both solutions provide you results based on the input dataset and the responses from the Amazon Q Business application, using Ragas to evaluate four key evaluation metrics (context recall, context precision, answer relevancy, and truthfulness).

This solution provides you sample code to evaluate the Amazon Q Business application response. To use this solution, you need to have or create a working Amazon Q Business application integrated with IAM Identity Center or Amazon Cognito as an IdP. This Lambda function works in the same way as the Lambda function in the end-to-end evaluation solution, using RAGAS against a test set of questions and ground truth. This lightweight solution doesn’t have a custom UI, but it can provide result metrics (context recall, context precision, answer relevancy, truthfulness), for visualization in CloudWatch. For deployment instructions, refer to the following GitHub repo.

Using evaluation results to improve Amazon Q Business application accuracy

This section outlines strategies to enhance key evaluation metrics—context recall, context precision, answer relevance, and truthfulness—for a RAG solution in the context of Amazon Q Business.