Index website contents using the Amazon Q Web Crawler connector for Amazon Q Business

Amazon Q Business is a fully managed service that lets you build interactive chat applications using your enterprise data. These applications can generate answers based on your data or a large language model (LLM) knowledge. Your data is not used for training purposes, and the answers provided by Amazon Q Business are based solely on the data users have access to.

Enterprise data is often distributed across different sources, such as documents in Amazon Simple Storage Service (Amazon S3) buckets, database engines, websites, and more. In this post, we demonstrate how to create an Amazon Q Business application and index website contents using the Amazon Q Web Crawler connector for Amazon Q Business.

For this example, we use two data sources (websites). The first data source is an employee onboarding guide from a fictitious company, which requires basic authentication. We demonstrate how to set up authentication for the Web Crawler. The second data source is the official documentation for Amazon Q Business. For this data source, we demonstrate how to apply advanced settings, such as regular expressions, to instruct the Web Crawler to crawl only pages and links related to Amazon Q Business, ignoring pages related to other AWS services.

Overview of the Amazon Q Web Crawler connector

The Amazon Q Web Crawler connector makes it possible to crawl websites that use HTTPS and index their contents so you can build a generative artificial intelligence (AI) experience for your users based on the indexed data. This connector relies on the Selenium Web Crawler Package and a Chromium driver. The connector is fully managed and updates to these components are applied automatically without your intervention.

This connector crawls and indexes the contents of webpages and attachments. Amazon Q Business supports multiple connectors, and each connector has its own properties and entities that it considers documents. In the context of the Web Crawler connector, a document refers to a single page or attachment contents. Separately, an index is commonly referred to as a corpus of documents; think of it as the place where you add and sync your documents for Amazon Q Business to use for generating answers to user requests.

Each document has its own attributes, also known as metadata. Metadata can be mapped to fields in your Amazon Q Business index. By creating index fields, you can boost results based on document attributes. For example, there might be use cases where you want to give more relevance to results from a specific category, department, or creation date.

Amazon Q Business data source connectors are designed to crawl the default attributes in your data source automatically. You can also add custom document attributes and map them to custom fields in your index. To learn more, see Mapping document attributes in Amazon Q Business.

For a better understanding of what is indexed by the Web Crawler connector, we present a list of metadata indexed from webpages and attachments.

The following table lists webpage metadata indexed by the Amazon Q Web Crawler connector.

Field	Data Source Field	Amazon Q Business Index Field (reserved)	Field Type
Category	category	_category	String
URL	sourceUrl	_source_uri	String
Title	title	_document_title	String
Meta Tags	metaTags	wc_meta_tags	String List
File Size	htmlSize	wc_html_size	Long (numeric)

The following table lists attachments metadata indexed by the Amazon Q Web Crawler connector.

Field	Data Source Field	Amazon Q Business Index Field (reserved)	Field Type
Category	category	_category	String
URL	sourceUrl	_source_uri	String
File Name	fileName	wc_file_name	String
File Type	fileType	wc_file_type	String
File Size	fileSize	wc_file_size	Long (numeric)

When configuring the data source for your website, you can use URLs or sitemaps, which can be defined either manually or using a text file stored in Amazon S3.

To enforce secure access to protected websites, the Amazon Q Web Crawler supports the following authentication types and standards:

Basic authentication NTLM/Kerberos authentication Form-based authentication SAML authentication

Unlike other data source connectors, the Amazon Q Web Crawler connector doesn’t support access control list (ACL) crawling or identity crawling.

Lastly, you have a range of options for configuring how and what data is synchronized. For example, you can choose to synchronize website domains only, website domains with subdomains only, or website domains with subdomains and the webpages included in links. Additionally, you can use regular expressions to filter which URLS to include or exclude in the crawling process.

Overview of solution

On a high level, this solution consists of an Amazon Q Business application that utilizes two data sources: a website hosting documents related to an employee onboarding guide, and the Amazon Q Business official documentation website. This solution demonstrates how to configure both websites as data sources for the Amazon Q Business application. The following steps will be performed:

AWS CloudFormation

You can follow along using one or both data sources provided in this post or try your own URLs.

Prerequisites

To follow along with this demo, you should have the following prerequisites:

AWS Identity and Access Management

IAM Identity Center instance

user

groups

Deploy a CloudFormation template for the employee onboarding website secured with basic authentication

Deploying this CloudFormation template is optional, but we recommend using it so you can learn more about how the Web Crawler connector works with websites that require authentication.

We start by deploying a CloudFormation template. This template will create a simple static website secured with basic authentication.

Create stack

With new resources (standard)

Choose an existing template

Specify template

Amazon S3 URL

https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16532/template-website.yml

Next

Stack name

onboarding-website-for-q-business-sample

Next

Configure stack options

Next

Review and create

I acknowledge that AWS CloudFormation might create IAM resources

Submit

The deployment process will take a few minutes to complete. You can move to the next section of this post while it’s in process. Keep this tab open—you’ll need to refer to the Outputs tab later.

Create an Amazon Q Business application

Before you start creating Amazon Q Business applications, you are required to enable and configure an IAM Identity Center instance. This step is mandatory because Amazon Q Business integrates with IAM Identity Center to manage user access to your Amazon Q Business applications. If you don’t have an IAM Identity Center instance set up when trying to create your first application, you will see the option to create one, as shown in the following screenshot.

If you already have an IAM Identity Center instance set up, you’re ready to start creating your first application by following these steps:

Get started

Create application

Application name

my-q-business-app

Service access

Create and use a new service-linked role (SLR)

Create

Retrievers

Use native retriever

Index provisioning

1

Number of units

Next

Create a Web Crawler data source for the Amazon Q Business documentation

After you complete the steps in the previous section, you should see the Connect data sources page, as shown in the following screenshot.

If you closed the tab by accident, you can get to this page by navigating to the Amazon Q Business console, choosing your application name, and then choosing Add data source.

Let’s create the data source for the Amazon Q Business documentation website:

Connect data sources

Web crawler

Data source name

q-business-documentation

Description

Source

Source URLs

https://docs.aws.amazon.com/amazonq/

Starting point URLs can be added directly in this UI (up to 10), or you could use a file hosted in Amazon S3 to list up to 100 starting point URLs. Likewise, sitemap URLs can be added in this UI (up to three), or you could add up to three sitemap XML files hosted in Amazon S3.

We refer to source URLs as starting point URLs; later in this post, you’ll have the opportunity to define what gets crawled, for example, domains and subdomains that the webpages might link to. It’s worth mentioning that the Web Crawler connector can only work with HTTPS.

No authentication

Authentication

Web proxy

Configure VPC and security group

No VPC

IAM role

Create a new service role

Sync scope

Sync domain range

Sync domains with subdomains only

Maximum file size

50 MB.

Additional configuration

Scope settings

Crawl depth

Maximum links per page

999

Maximum throttling

300

If you open the Amazon Q official documentation, you’ll see that there are links to Amazon Q Developer documentation and other AWS services. Because we’re only interested in crawling Amazon Q Business, we need to instruct the crawler to focus only on relevant links and pages related to Amazon Q Business. To achieve this, we use regular expressions to define exactly what URLs the crawler should crawl.

Crawl URL Patterns

Add

^https:\/\/docs\.aws\.amazon\.com\/amazonq\/$

^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/qbusiness-ug\/.*\.html$

^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/business-use-dg\/.*\.html$

Sync mode

Full sync

Sync run schedule

Frequency

Run on demand

Choosing this option means you must manually run the sync operation; this option is suitable given the simplicity of this example. For production workloads, you’ll want to define a schedule tailored to your needs, for example, hourly, daily, or weekly, or you could define your own schedule using a cron expression.

Tags

The default values in the Field mappings section can’t be changed at this point. This can only be modified after the application and retriever have been created.

Add data source

After the data source is created, you will be shown the same interface you saw at the beginning of this section, with the note that one Web Crawler data source has been added. Keep this tab open, because you’ll create a second data source for the employee onboarding guide in the next section.

Create a Web Crawler data source for the employee onboarding guide

Complete the following steps to create your second data source:

Connect data sources

Web crawler

CREATE_COMPLETE

Outputs

WebsiteURL

Although unlikely, if the URL isn’t working, it might be because Amazon CloudFront hasn’t finished replicating the website. In that case, you should wait a couple of minutes and try again.

You should now be able to browse the employee onboarding guide. Take a few minutes to get familiar with the contents of the website, because you’ll be asking your Amazon Q Business application questions about this content in a later step.

Data source name

onboarding-guide

Source

URLs

Authentication

Basic authentication

Authentication credentials

AWS Secrets Manager secret

Create and add new secret

Secret name

User name

Password

Save

These credentials will be stored as a secret in AWS Secrets Manager.

Depending on the type of authentication you use, you’ll need certain fields present in your secret, as shown in the following table.

Authentication Type	Fields present in secret
Form based	username, password, userNameFieldXpath, passwordFieldXpath, passwordButtonXpath, loginPageUrl
NTLM	username, password
Basic auth	username, password
No Authentication	NA

Web proxy

No VPC

Configure VPC and security group

IAM role

Create a new service role

Sync domains with subdomains only

Sync scope

Full sync

Sync mode

Sync run schedule

Run on demand

Tags

Field mappings

Add data source

After changes are applied, the Connect data sources page shows two Web Crawler data sources have been added.

Next

We have added our two data sources. In the next section, we add groups and users to our Amazon Q Business application.

Add groups and users to the Amazon Q Business application

Complete the following steps to add groups and users:

Add groups and users

Assign existing users and groups

Next

If you’ve completed the prerequisite of setting up IAM Identity Center, you’ve likely added at least one user. Although it’s not mandatory, we recommend creating multiple users and groups. This will enable you to fully explore and understand all the features of Amazon Q Business beyond what’s covered in this post.

If you haven’t added any users to your Identity Center directory, you can create them here by choosing Add new users. However, you’ll need to complete additional steps, such as setting up their passwords on the IAM Identity Center console. To fully benefit from this tutorial, we recommend having active users and groups by the time you reach this step.

In the search bar, enter either the display name or group name you want to add to the application.

Assign

If you added a group, you’ll see it on the Groups tab. If you added a user, you’ll see it on the Users tab.

The next step is choosing a subscription for your groups or users.

Current subscription

Q Business Pro

This is a good time to get familiar with the Amazon Q Business subscription tiers and pricing. For this example, we use Q Business Pro, but you could also use a Q Business Lite subscription.

Web experience service access

Create and use a new service role

A web experience is the chat interface that your users will utilize to ask questions and perform tasks.

Create application

After the application is created successfully, you’ll be redirected to the Amazon Q Business console, where you can see your new application. Your application is ready, but the data sources haven’t synced any data yet. We’ll do that in the next steps.

Application Details

Data sources

Sync now

You will see the Current sync state for both data sources as Syncing. This process might take several minutes.

After the data sources are synced, you will see their Last sync status as Completed.

You’re now ready to test your application! Keep this page open because you’ll need it for next steps.

Run sample queries to test the solution

At this point, you have created an Amazon Q Business application, added two data sources using the Amazon Q Web Crawler connector, added users to the application, and synchronized all data sources.

The next step is going through the full user experience of logging in to the application and running a few test queries to test our application.

Application Details

Web experience settings

Deployed URL

You’ll be redirected to the AWS access portal URL, which is set up by IAM Identity Center.

Next

You’re now on your Amazon Q Business app and ready to start asking questions!

Enter a prompt

Enter

For this example, we start by asking questions related to the employee onboarding website.

Amazon Q Business uses the onboarding guide data source you created earlier. If you choose Sources, you’ll see a list of in-text source citations in the form of a numbered list.

Now we ask questions related to the Amazon Q Business documentation.

Try it out with your own prompts!

Troubleshooting

In this section, we discuss several common issues and how to troubleshoot:

Amazon Q Business isn’t answering your questions

The Web Crawler is unable to sync

Configuring a robots.txt file for Amazon Q Business Web Crawler

Amazon Q Business answers questions using old data

Run on demand

Sync now

Sync run schedule

Amazon Q Business provides an inaccurate answer or no answer at all

Document enrichment

Although not covered in this post, we recommend exploring document enrichment. This functionality allows you to manipulate and enrich document attributes prior to being added to an index. The following are a couple of ideas for advanced applications of document enrichment:

Clean up

After you finish testing the solution and to avoid incurring in extra costs, clean up the resources you created as part of this solution.

Let’s start by deleting the Amazon Q Business application.

Actions

Delete

You might be asked to complete an optional survey on your reasons for application deletion. You are can select multiple reasons (or none), then choose Submit.

The next step is to delete the CloudFormation stack responsible for deploying the employee onboarding website we used as a data source.

Delete

The stack deletion might take a few minutes. When the deletion is complete, you’ll see the stack has been removed from your list of stacks.

Optionally, if you enabled IAM Identity Center only for this tutorial and want to delete your IAM Identity Center instance, follow these steps:

Settings

Management

Delete

Confirm

Conclusion

The Amazon Q Business Web Crawler allows you to connect websites to your Amazon Q Business applications. This connector supports multiple forms of authentication (if required by your website) and can run sync jobs on a defined schedule.

To learn more about Amazon Q Business and its features, refer to the Amazon Q Business Developer Guide. For a comprehensive list of what can be done with this connector, refer to Connecting Web Crawler to Amazon Q Business.

About the Author

Guillermo Mansilla is a Senior Solutions Architect based in Orlando, Florida. He has had the opportunity to collaborate with startups and enterprise customers in the USA and Canada, assisting them in building and architecting their applications on AWS. Guillermo has developed a keen interest in serverless architectures and generative AI applications. Prior to his current role, he gained over a decade of experience working as a software developer. Away from work, Guillermo enjoys participating in chess tournaments at his local chess club, a pursuit that allows him to exercise his analytical skills in a different context.

Overview of the Amazon Q Web Crawler connector

Overview of solution

Prerequisites

Deploy a CloudFormation template for the employee onboarding website secured with basic authentication

Create an Amazon Q Business application

Create a Web Crawler data source for the Amazon Q Business documentation

Create a Web Crawler data source for the employee onboarding guide

Add groups and users to the Amazon Q Business application

Run sample queries to test the solution

Troubleshooting

Document enrichment

Clean up

Conclusion

About the Author

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签