AWS Machine Learning Blog 2024年08月28日
Index website contents using the Amazon Q Web Crawler connector for Amazon Q Business
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon Q Business Web 爬虫连接器是一个完全托管的服务,允许您使用企业数据构建交互式聊天应用程序。这些应用程序可以根据您的数据或大型语言模型 (LLM) 知识生成答案。您的数据不用于训练目的,Amazon Q Business 提供的答案完全基于用户可以访问的数据。本文演示了如何创建 Amazon Q Business 应用程序,并使用 Amazon Q Web 爬虫连接器为 Amazon Q Business 索引网站内容。

📈 Amazon Q Business Web 爬虫连接器使您可以爬取使用 HTTPS 的网站并索引其内容,以便您可以根据已索引的数据为用户构建生成式人工智能 (AI) 体验。该连接器依赖于 Selenium Web 爬虫包和 Chromium 驱动程序。连接器是完全托管的,对这些组件的更新会自动应用,无需您干预。

📖 连接器会爬取并索引网页和附件的内容。Amazon Q Business 支持多个连接器,每个连接器都有其自己的属性和实体,它将这些实体视为文档。在 Web 爬虫连接器的上下文中,文档指的是单个页面或附件内容。单独来看,索引通常被称为文档语料库;可以将其视为将文档添加到其中并同步以供 Amazon Q Business 用于生成对用户请求的答案的地方。

📡 每个文档都有其自己的属性,也称为元数据。元数据可以映射到您在 Amazon Q Business 索引中的字段。通过创建索引字段,您可以根据文档属性提升结果。例如,可能存在一些用例,您希望对来自特定类别、部门或创建日期的结果赋予更多相关性。

📢 Amazon Q Business 数据源连接器旨在自动爬取数据源中的默认属性。您还可以添加自定义文档属性,并将它们映射到索引中的自定义字段。

📣 为了更好地理解 Web 爬虫连接器索引的内容,我们提供了一个从网页和附件中索引的元数据的列表。

📤 配置网站的数据源时,您可以使用 URL 或站点地图,这些 URL 或站点地图可以手动定义,也可以使用存储在 Amazon S3 中的文本文件定义。

📥 为了强制执行对受保护网站的安全访问,Amazon Q Web 爬虫支持以下身份验证类型和标准:基本身份验证、NTLM/Kerberos 身份验证、基于表单的身份验证、SAML 身份验证。

📦 最后,您有一系列选项可用于配置如何以及同步哪些数据。例如,您可以选择仅同步网站域,仅同步具有子域的网站域,或者同步具有子域以及链接中包含的网页的网站域。此外,您可以使用正则表达式来过滤在爬取过程中要包含或排除的 URL。

Amazon Q Business is a fully managed service that lets you build interactive chat applications using your enterprise data. These applications can generate answers based on your data or a large language model (LLM) knowledge. Your data is not used for training purposes, and the answers provided by Amazon Q Business are based solely on the data users have access to.

Enterprise data is often distributed across different sources, such as documents in Amazon Simple Storage Service (Amazon S3) buckets, database engines, websites, and more. In this post, we demonstrate how to create an Amazon Q Business application and index website contents using the Amazon Q Web Crawler connector for Amazon Q Business.

For this example, we use two data sources (websites). The first data source is an employee onboarding guide from a fictitious company, which requires basic authentication. We demonstrate how to set up authentication for the Web Crawler. The second data source is the official documentation for Amazon Q Business. For this data source, we demonstrate how to apply advanced settings, such as regular expressions, to instruct the Web Crawler to crawl only pages and links related to Amazon Q Business, ignoring pages related to other AWS services.

Overview of the Amazon Q Web Crawler connector

The Amazon Q Web Crawler connector makes it possible to crawl websites that use HTTPS and index their contents so you can build a generative artificial intelligence (AI) experience for your users based on the indexed data. This connector relies on the Selenium Web Crawler Package and a Chromium driver. The connector is fully managed and updates to these components are applied automatically without your intervention.

This connector crawls and indexes the contents of webpages and attachments. Amazon Q Business supports multiple connectors, and each connector has its own properties and entities that it considers documents. In the context of the Web Crawler connector, a document refers to a single page or attachment contents. Separately, an index is commonly referred to as a corpus of documents; think of it as the place where you add and sync your documents for Amazon Q Business to use for generating answers to user requests.

Each document has its own attributes, also known as metadata. Metadata can be mapped to fields in your Amazon Q Business index. By creating index fields, you can boost results based on document attributes. For example, there might be use cases where you want to give more relevance to results from a specific category, department, or creation date.

Amazon Q Business data source connectors are designed to crawl the default attributes in your data source automatically. You can also add custom document attributes and map them to custom fields in your index. To learn more, see Mapping document attributes in Amazon Q Business.

For a better understanding of what is indexed by the Web Crawler connector, we present a list of metadata indexed from webpages and attachments.

The following table lists webpage metadata indexed by the Amazon Q Web Crawler connector.

Field Data Source Field Amazon Q Business Index Field (reserved) Field Type
Category category _category String
URL sourceUrl _source_uri String
Title title _document_title String
Meta Tags metaTags wc_meta_tags String List
File Size htmlSize wc_html_size Long (numeric)

The following table lists attachments metadata indexed by the Amazon Q Web Crawler connector.

Field Data Source Field Amazon Q Business Index Field (reserved) Field Type
Category category _category String
URL sourceUrl _source_uri String
File Name fileName wc_file_name String
File Type fileType wc_file_type String
File Size fileSize wc_file_size Long (numeric)

When configuring the data source for your website, you can use URLs or sitemaps, which can be defined either manually or using a text file stored in Amazon S3.

To enforce secure access to protected websites, the Amazon Q Web Crawler supports the following authentication types and standards:

Unlike other data source connectors, the Amazon Q Web Crawler connector doesn’t support access control list (ACL) crawling or identity crawling.

Lastly, you have a range of options for configuring how and what data is synchronized. For example, you can choose to synchronize website domains only, website domains with subdomains only, or website domains with subdomains and the webpages included in links. Additionally, you can use regular expressions to filter which URLS to include or exclude in the crawling process.

Overview of solution

On a high level, this solution consists of an Amazon Q Business application that utilizes two data sources: a website hosting documents related to an employee onboarding guide, and the Amazon Q Business official documentation website. This solution demonstrates how to configure both websites as data sources for the Amazon Q Business application. The following steps will be performed:

    Deploy an AWS CloudFormation template containing a static website secured with basic authentication. Create an Amazon Q Business application. Create a Web Crawler data source for the Amazon Q Business documentation. Create a Web Crawler data source for the employee onboarding guide. Add groups and users to the Amazon Q Business application. Run sample queries to test the solution.

You can follow along using one or both data sources provided in this post or try your own URLs.

Prerequisites

To follow along with this demo, you should have the following prerequisites:

Deploy a CloudFormation template for the employee onboarding website secured with basic authentication

Deploying this CloudFormation template is optional, but we recommend using it so you can learn more about how the Web Crawler connector works with websites that require authentication.

We start by deploying a CloudFormation template. This template will create a simple static website secured with basic authentication.

    On the AWS CloudFormation console, choose Create stack and choose With new resources (standard). Select Choose an existing template. For Specify template, select Amazon S3 URL. For Amazon S3 URL enter the URL https://aws-blogs-artifacts-public.s3.amazonaws.com/artifacts/ML-16532/template-website.yml Choose Next. For Stack name, enter a name. For example, onboarding-website-for-q-business-sample. Choose Next. Leave all options in Configure stack options as default and choose Next. On the Review and create page, select I acknowledge that AWS CloudFormation might create IAM resources, then choose Submit.

The deployment process will take a few minutes to complete. You can move to the next section of this post while it’s in process. Keep this tab open—you’ll need to refer to the Outputs tab later.

Create an Amazon Q Business application

Before you start creating Amazon Q Business applications, you are required to enable and configure an IAM Identity Center instance. This step is mandatory because Amazon Q Business integrates with IAM Identity Center to manage user access to your Amazon Q Business applications. If you don’t have an IAM Identity Center instance set up when trying to create your first application, you will see the option to create one, as shown in the following screenshot.

If you already have an IAM Identity Center instance set up, you’re ready to start creating your first application by following these steps:

    On a new tab in your browser, open the Amazon Q Business console. Choose Get started or Create application (options will vary based on whether it’s your first time trying the service). For Application name¸ enter a name for your application, for example, my-q-business-app. For Service access, select Create and use a new service-linked role (SLR). Choose Create. For Retrievers, select Use native retriever. For Index provisioning, enter 1 for Number of units. One unit can index 20,000 documents (a document in this context is either a single page of content or a single attachment). Choose Next.

Create a Web Crawler data source for the Amazon Q Business documentation

After you complete the steps in the previous section, you should see the Connect data sources page, as shown in the following screenshot.

If you closed the tab by accident, you can get to this page by navigating to the Amazon Q Business console, choosing your application name, and then choosing Add data source.

Let’s create the data source for the Amazon Q Business documentation website:

    On the Connect data sources page, choose Web crawler. For Data source name, enter a name, for example, q-business-documentation For Description, enter a description. For Source, you have the option to provide either URLs or sitemaps. For this example, select Source URLs and enter the URL of the official documentation of Amazon Q: https://docs.aws.amazon.com/amazonq/

Starting point URLs can be added directly in this UI (up to 10), or you could use a file hosted in Amazon S3 to list up to 100 starting point URLs. Likewise, sitemap URLs can be added in this UI (up to three), or you could add up to three sitemap XML files hosted in Amazon S3.

We refer to source URLs as starting point URLs; later in this post, you’ll have the opportunity to define what gets crawled, for example, domains and subdomains that the webpages might link to. It’s worth mentioning that the Web Crawler connector can only work with HTTPS.

    Select No authentication in the Authentication section because this is a public website. The Web proxy section is optional, so we leave it empty. For Configure VPC and security group, select No VPC. In the IAM role section, choose Create a new service role. In the Sync scope section, for Sync domain range, select Sync domains with subdomains only. For Maximum file size, you can keep the default value of 50 MB. Under Additional configuration, expand Scope settings. Leave Crawl depth set to 2, Maximum links per page set to 999, and Maximum throttling set to 300.

If you open the Amazon Q official documentation, you’ll see that there are links to Amazon Q Developer documentation and other AWS services. Because we’re only interested in crawling Amazon Q Business, we need to instruct the crawler to focus only on relevant links and pages related to Amazon Q Business. To achieve this, we use regular expressions to define exactly what URLs the crawler should crawl.

    Under Crawl URL Patterns, enter the following expressions one by one, and choose Add:
      ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/$ ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/qbusiness-ug\/.*\.html$ ^https:\/\/docs\.aws\.amazon\.com\/amazonq\/latest\/business-use-dg\/.*\.html$

    In the Sync mode section, select Full sync. This option makes it possible to sync all contents regardless of their previous status. In the Sync run schedule section, you define how often Amazon Q Business should sync this data source. For Frequency, select Run on demand.

Choosing this option means you must manually run the sync operation; this option is suitable given the simplicity of this example. For production workloads, you’ll want to define a schedule tailored to your needs, for example, hourly, daily, or weekly, or you could define your own schedule using a cron expression.

    The Tags section is optional, so we leave it empty.

The default values in the Field mappings section can’t be changed at this point. This can only be modified after the application and retriever have been created.

    Choose Add data source and wait a couple of seconds while changes are applied.

After the data source is created, you will be shown the same interface you saw at the beginning of this section, with the note that one Web Crawler data source has been added. Keep this tab open, because you’ll create a second data source for the employee onboarding guide in the next section.

Create a Web Crawler data source for the employee onboarding guide

Complete the following steps to create your second data source:

    On the Connect data sources page, choose Web crawler. Keep this tab open and navigate back to the AWS CloudFormation console tab and verify the stack’s status is CREATE_COMPLETE. If the status of the stack is CREATE_COMPLETE, choose the Outputs tab of the stack you deployed. Note the URL, user name, and password (the following screenshot shows sample values).

    Choose the link for WebsiteURL.

Although unlikely, if the URL isn’t working, it might be because Amazon CloudFront hasn’t finished replicating the website. In that case, you should wait a couple of minutes and try again.

    Sign in with your user name and password.

You should now be able to browse the employee onboarding guide. Take a few minutes to get familiar with the contents of the website, because you’ll be asking your Amazon Q Business application questions about this content in a later step.

    Return to the browser tab where you’re creating the new data source. For Data source name, enter a name, for example, onboarding-guide. For Source, select Source URLs and enter the website URL you saved earlier. For Authentication, select Basic authentication. Under Authentication credentials, for AWS Secrets Manager secret, choose Create and add new secret.

    For Secret name, enter a secret name of your preference. For User name and Password, use the values you saved earlier and make sure there are no extra whitespaces. Choose Save.

These credentials will be stored as a secret in AWS Secrets Manager.

Depending on the type of authentication you use, you’ll need certain fields present in your secret, as shown in the following table.

Authentication Type Fields present in secret
Form based username, password, userNameFieldXpath, passwordFieldXpath, passwordButtonXpath, loginPageUrl
NTLM username, password
Basic auth username, password
No Authentication NA
    Leave the Web proxy section empty. Select No VPC in the Configure VPC and security group For IAM role, choose Create a new service role. Select Sync domains with subdomains only in the Sync scope Select Full sync in the Sync mode For Sync run schedule, choose Run on demand. Leave the sections Tags and Field mappings with their default values. Choose Add data source and wait a couple of seconds while changes are applied.

After changes are applied, the Connect data sources page shows two Web Crawler data sources have been added.

    Scroll down to the end of the page and choose Next.

We have added our two data sources. In the next section, we add groups and users to our Amazon Q Business application.

Add groups and users to the Amazon Q Business application

Complete the following steps to add groups and users:

    On the Add groups and users page, choose Add groups and users. Select Assign existing users and groups and choose Next.

If you’ve completed the prerequisite of setting up IAM Identity Center, you’ve likely added at least one user. Although it’s not mandatory, we recommend creating multiple users and groups. This will enable you to fully explore and understand all the features of Amazon Q Business beyond what’s covered in this post.

If you haven’t added any users to your Identity Center directory, you can create them here by choosing Add new users. However, you’ll need to complete additional steps, such as setting up their passwords on the IAM Identity Center console. To fully benefit from this tutorial, we recommend having active users and groups by the time you reach this step.

    In the search bar, enter either the display name or group name you want to add to the application.

    Choose the user (or group) and choose Assign.

If you added a group, you’ll see it on the Groups tab. If you added a user, you’ll see it on the Users tab.

The next step is choosing a subscription for your groups or users.

    Select the user (or group) you just added, and on the Current subscription dropdown menu, choose your subscription tier. For this example, we choose Q Business Pro.

This is a good time to get familiar with the Amazon Q Business subscription tiers and pricing. For this example, we use Q Business Pro, but you could also use a Q Business Lite subscription.

    In the Web experience service access section, select Create and use a new service role.

A web experience is the chat interface that your users will utilize to ask questions and perform tasks.

    Choose Create application.

After the application is created successfully, you’ll be redirected to the Amazon Q Business console, where you can see your new application. Your application is ready, but the data sources haven’t synced any data yet. We’ll do that in the next steps.

    Choose the name of your new application to open the Application Details.

    In the Data sources section, select each data source and choose Sync now.

You will see the Current sync state for both data sources as Syncing. This process might take several minutes.

After the data sources are synced, you will see their Last sync status as Completed.

You’re now ready to test your application! Keep this page open because you’ll need it for next steps.

Run sample queries to test the solution

At this point, you have created an Amazon Q Business application, added two data sources using the Amazon Q Web Crawler connector, added users to the application, and synchronized all data sources.

The next step is going through the full user experience of logging in to the application and running a few test queries to test our application.

    On the Application Details page, navigate to the Web experience settings Choose the link under Deployed URL.

You’ll be redirected to the AWS access portal URL, which is set up by IAM Identity Center.

    Enter the user name of a user previously added to your Amazon Q Business application and choose Next.

You’re now on your Amazon Q Business app and ready to start asking questions!

    Enter your question (prompt) in the Enter a prompt text field and press Enter.

For this example, we start by asking questions related to the employee onboarding website.

Amazon Q Business uses the onboarding guide data source you created earlier. If you choose Sources, you’ll see a list of in-text source citations in the form of a numbered list.

Now we ask questions related to the Amazon Q Business documentation.

Try it out with your own prompts!

Troubleshooting

In this section, we discuss several common issues and how to troubleshoot:

Document enrichment

Although not covered in this post, we recommend exploring document enrichment. This functionality allows you to manipulate and enrich document attributes prior to being added to an index. The following are a couple of ideas for advanced applications of document enrichment:

Clean up

After you finish testing the solution and to avoid incurring in extra costs, clean up the resources you created as part of this solution.

Let’s start by deleting the Amazon Q Business application.

    On the Amazon Q Business console, select your application from the application list and on the Actions menu, choose Delete.

    Confirm its deletion by entering Delete, then choose Delete.

You might be asked to complete an optional survey on your reasons for application deletion. You are can select multiple reasons (or none), then choose Submit.

The next step is to delete the CloudFormation stack responsible for deploying the employee onboarding website we used as a data source.

    On the CloudFormation console, select the stack you created at the beginning of this walkthrough and choose Delete.

    Choose Delete to confirm the stack deletion.

The stack deletion might take a few minutes. When the deletion is complete, you’ll see the stack has been removed from your list of stacks.

Optionally, if you enabled IAM Identity Center only for this tutorial and want to delete your IAM Identity Center instance, follow these steps:

    On IAM Identity Center console, choose Settings in the navigation pane.

    Choose the Management tab

    Choose Delete.
    Select the acknowledgement check boxes, enter your instance, and choose Confirm.

Conclusion

The Amazon Q Business Web Crawler allows you to connect websites to your Amazon Q Business applications. This connector supports multiple forms of authentication (if required by your website) and can run sync jobs on a defined schedule.

To learn more about Amazon Q Business and its features, refer to the Amazon Q Business Developer Guide. For a comprehensive list of what can be done with this connector, refer to Connecting Web Crawler to Amazon Q Business.


About the Author

Guillermo Mansilla is a Senior Solutions Architect based in Orlando, Florida. He has had the opportunity to collaborate with startups and enterprise customers in the USA and Canada, assisting them in building and architecting their applications on AWS. Guillermo has developed a keen interest in serverless architectures and generative AI applications. Prior to his current role, he gained over a decade of experience working as a software developer. Away from work, Guillermo enjoys participating in chess tournaments at his local chess club, a pursuit that allows him to exercise his analytical skills in a different context.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Q Business Web 爬虫 人工智能 企业数据 聊天应用程序
相关文章