AWS Machine Learning Blog 2024年09月21日
Introducing document-level sync reports: Enhanced data sync visibility in Amazon Kendra
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon Kendra 现在提供了一个新的功能,该功能显著提高了对数据源同步操作的可见性。新版本在同步历史记录中引入了全面的文档级别报告,为管理员提供了每个文档在数据源同步作业期间处理的粒度索引状态、元数据和 ACL 详细信息。此增强功能使管理员能够快速调查和解决在设置 Amazon Kendra 索引时遇到的摄取或访问问题。详细的文档报告会保留在新的 SYNC_RUN_HISTORY_REPORT 日志流中,该日志流位于 Amazon Kendra 索引日志组下,因此在进行故障排除时可以按需获得关键的同步作业详细信息。

📈 **数据源同步生命周期** Amazon Kendra 的数据源同步包含三个关键阶段:抓取、同步和索引。抓取涉及连接器连接到数据源并提取满足已定义同步范围的文档(根据数据源配置)。然后,在同步阶段将这些文档同步到 Amazon Kendra 索引。最后,索引使同步的文档在 Amazon Kendra 环境中可搜索。

📅 **文档级别报告的关键功能和优势** 新的文档级别报告提供了以下关键功能和优势: - **增强的同步运行历史记录页面** - 在同步运行历史记录页面中添加了一个新的“操作”列,提供了对每个同步运行的文档级别报告的访问权限。 - **专用日志流** - 在 Amazon Kendra CloudWatch 日志组中创建了一个名为 SYNC_RUN_HISTORY_REPORT 的新日志流,其中包含文档级别报告。 - **全面的文档信息** - 文档级别报告包含每个文档的以下信息: - 文档 ID - 文档标题 - 合并的文档状态(成功、失败或跳过) - 错误消息(如果文档失败) - 抓取状态 - 同步状态 - 索引状态 - ACL - 元数据 - 哈希文档 ID(用于故障排除帮助) - 时间戳

📊 **使用文档级别报告确定最近文档的最佳提升持续时间** 可以使用文档级别报告获取文档的 _last_updated_at 元数据字段信息,这可以帮助确定适当的提升时间段。为此,可以使用以下 CloudWatch Logs Insights 查询从 SYNC_RUN_HISTORY_REPORT 日志流中检索机器学习文档的 _last_updated_at 元数据属性。

📆 **常见的文档索引可观察性问题** 可以使用文档级别报告来解决常见问题,例如: - 确定特定文档是否已成功索引 - 识别导致文档索引失败的原因 - 了解文档的 ACL 和元数据

Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra helps you aggregate content from a variety of content repositories into a centralized index that lets you quickly search all your enterprise data and find the most accurate answer.

Amazon Kendra securely connects to over 40 data sources. When using your data source, you might want better visibility into the document processing lifecycle during data source sync jobs. They could include knowing the status of each document you attempted to crawl and index, as well as being able to troubleshoot why certain documents were not returned with the expected answers. Additionally, you might need access to metadata, timestamps, and access control lists (ACLs) for the indexed documents.

We are pleased to announce a new feature now available in Amazon Kendra that significantly improves visibility into data source sync operations. The latest release introduces a comprehensive document-level report incorporated into the sync history, providing administrators with granular indexing status, metadata, and ACL details for every document processed during a data source sync job. This enhancement to sync job observability enables administrators to quickly investigate and resolve ingestion or access issues encountered while setting up Amazon Kendra indexes. The detailed document reports are persisted in the new SYNC_RUN_HISTORY_REPORT log stream under the Amazon Kendra index log group, so critical sync job details are available on-demand when troubleshooting.

In this post, we discuss the benefits of this new feature and how it offers enhanced data sync visibility in Amazon Kendra.

Lifecycle of a document in a data source sync run job

In this section, we examine the lifecycle of a document within a data source sync in Amazon Kendra. This provides valuable insight into the sync process. The data source sync comprises three key stages: crawling, syncing, and indexing. Crawling involves the connector connecting to the data source and extracting documents meeting the defined sync scope according to the data source configuration. These documents are then synced to the Amazon Kendra index during the syncing phase. Finally, indexing makes the synced documents searchable within the Amazon Kendra environment.

The following diagram shows a flowchart of a sync run job.

Crawling stage

The first stage is the crawling stage, where the connector crawls all documents and their metadata from the data source. During this stage, the connector also compares the checksum of the document against the Amazon Kendra index to determine if a particular document needs to be added, modified, or deleted from the index. This operation corresponds to the CrawlAction field in the sync run history report.

If the document is unmodified, it’s marked as UNMODIFIED and skipped in the rest of the stages. If any document fails in the crawling stage, for example due to throttling errors, broken content, or if the document size is too big, that document is marked in the sync run history report with the CrawlStatus as FAILED. If the document was skipped due to any validation errors, its CrawlStatus is marked as SKIPPED. These documents are not sent to the next stage. All successful documents are marked as SUCCESS and are sent forward.

We also capture the ACLs and metadata on each document in this stage to be able to add it to the sync run history report.

Syncing stage

During the syncing stage, the document is sent to Amazon Kendra ingestion service APIs like BatchPutDocument and BatchDeleteDocument. After a document is submitted to these APIs, Amazon Kendra runs validation checks on the submitted documents. If any document fails these checks, its SyncStatus is marked as FAILED. If there is an irrecoverable error for a particular document, it is marked as SKIPPED and other documents are sent forward.

Indexing stage

In this step, Amazon Kendra parses the document, processes it according to its content type, and persists it in the index. If the document fails to be persisted, its IndexStatus is marked as FAILED; otherwise, it is marked as SUCCESS.

After the statuses of all the stages have been captured, we emit these statuses as an Amazon CloudWatch event to the customer’s AWS account.

Key features and benefits of document-level reports

The following are the key features and benefits of the new document-level report in Amazon Kendra indexes:

In the following sections, we explore different use cases for the logging feature.

Determine the optimal boosting duration for recent documents in using document-level reporting

When it comes to generating accurate answers, you may want to fine-tune the way Amazon Kendra prioritizes its content. For instance, you may prefer to boost recent documents over older ones to make sure the most up-to-date passages are used to generate an answer. To achieve this, you can use the relevance tuning feature in Amazon Kendra to boost documents based on the last update date attribute, with a specified boosting duration. However, determining the optimal boosting period can be challenging when dealing with a large number of frequently changing documents.

You can now use the per-document-level report to obtain the _last_updated_at metadata field information for your documents, which can help you determine the appropriate boosting period. For this, you use the following CloudWatch Logs Insights query to retrieve the _last_updated_at metadata attribute for machine learning documents from the SYNC_RUN_HISTORY_REPORT log stream.

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'and Metadata like 'Machine Learning'| parse Metadata '{"key":"_last_updated_at","value":{"dateValue":"*"}}' as @last_updated_at| sort @last_updated_at desc, @timestamp desc| dedup DocumentTitle

With the preceding query, you can gain insights into the last updated timestamps of your documents, enabling you to make informed decisions about the optimal boosting period. This approach makes sure your chat responses are generated using the most recent and relevant information, enhancing the overall accuracy and effectiveness of your Amazon Kendra implementation.

The following screenshot shows an example result.

Common document indexing observability and troubleshooting methods

In this section, we explore some common admin tasks for observing and troubleshooting document indexing using the new document-level reporting feature.

List all successfully indexed documents from a data source

To retrieve a list of all documents that have been successfully indexed from a specific data source, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, @timestamp| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/'and ConnectorDocumentStatus.Status = "SUCCESS"| sort @timestamp desc | dedup DocumentTitle, DocumentId

The following screenshot shows an example result.

List all successfully indexed documents from a data source sync job

To retrieve a list of all documents that have been successfully indexed during a specific sync job, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, @timestamp| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'and ConnectorDocumentStatus.Status = "SUCCESS"| sort DocumentTitle

The following screenshot shows an example result.

List all failed indexed documents from a data source sync job

To retrieve a list of all documents that failed to index during a specific sync job, along with the error messages, you can use the following CloudWatch Logs Insights query:

fields DocumentTitle, DocumentId, ConnectorDocumentStatus.Status AS IndexStatus, ErrorMsg, @timestamp| filter @logStream like 'SYNC_RUN_HISTORY_REPORT/your-data-source-id/run-id'and ConnectorDocumentStatus.Status = "FAILED"| sort @timestamp desc

The following screenshot shows an example result.

List all documents that contain a user’s ACL permission from an Amazon Kendra index

To retrieve a list of documents that have a specific users ACL permission, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/'and Acl like 'aneesh@mydemoaws.onmicrosoft.com'| display DocumentTitle, SourceUri

The following screenshot shows an example result.

List the ACL of an indexed document from a data source sync job

To retrieve the ACL information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'and DocumentTitle = "your-document-title"| display DocumentTitle, Acl

The following screenshot shows an example result.

List metadata of an indexed document from a data source sync job

To retrieve the metadata information for a specific indexed document from a sync job, you can use the following CloudWatch Logs Insights query:

filter @logStream like 'SYNC_RUN_HISTORY_REPORT/data-source-id/run-id'and DocumentTitle = "your-document-title"| display DocumentTitle, Metadata

The following screenshot shows an example result.

Conclusion

The newly introduced document-level report in Amazon Kendra provides enhanced visibility and observability into the document processing lifecycle during data source sync jobs. This feature addresses a critical need expressed by customers for better troubleshooting capabilities and access to detailed information about the indexing status, metadata, and ACLs of individual documents.

The document-level report is stored in a log stream named SYNC_RUN_HISTORY_REPORT within the Amazon Kendra index CloudWatch log group. This report contains comprehensive information for each document, including the document ID, title, overall document sync status, error messages (if any), along with its ACLs and metadata information retrieved from the data sources. The data source sync run history page now includes an Actions column, providing access to the document-level report for each sync run. This feature significantly improves the ability to troubleshoot issues related to document ingestion and access control, and issues related to metadata relevance, and provides better visibility about the documents synced with an Amazon Kendra index.

To get started with Amazon Kendra, explore the Getting started guide. To learn more about data source connectors and best practices, see Creating a data source connector.


About the Authors

Aneesh Mohan is a Senior Solutions Architect at Amazon Web Services (AWS), with over 20 years of experience in architecting and delivering high-impact solutions for mission-critical workloads. His expertise spans across the financial services industry, AI/ML, security, and data technologies. Driven by a deep passion for technology, Aneesh is dedicated to partnering with customers to design and implement well-architected, innovative solutions that address their unique business needs.

Ashwin Shukla is a Software Development Engineer II on the Amazon Q for Business and Amazon Kendra engineering team, with 6 years of experience in developing enterprise software. In this role, he works on designing and developing foundational features for Amazon Q for Business.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Kendra 数据源同步 文档级别报告 可观察性 机器学习
相关文章