AWS Machine Learning Blog 2024年10月12日
Boost productivity by using AI in cloud operational health management
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用 Amazon Bedrock、AWS Health、AWS Step Functions 和其他 AWS 服务来创建 AI 驱动的事件驱动型运营助手,该助手可以自动响应运营事件。该助手能够根据组织的策略过滤掉无关事件,推荐操作,在集成 IT 服务管理 (ITSM) 工具中创建和管理问题票证以跟踪操作,并查询知识库以获取与运营事件相关的见解。通过协调一组 AI 端点,该解决方案的代理 AI 设计能够自动化复杂的任务,简化云运营事件的补救过程。这种方法帮助组织克服在复杂、云驱动的环境中管理运营事件数量的挑战,只需最少的人工监督,最终提高业务连续性和运营效率。

🎉 **事件处理层:** 负责管理通知、确认和操作的分流。它使用 Step Functions 实现两个关键工作流:事件编排工作流和事件通知工作流。事件编排工作流订阅和调用传递到主要 Amazon EventBridge 集线器的运营事件。它在以下工作流中发送 HealthEventAdded 或 SecHubEventAdded 事件回主事件集线器。事件通知工作流格式化在 Slack 聊天和后端微服务之间交换的通知。它侦听控制事件,例如 HealthEventAdded 和 SecHubEventAdded。

🤖 **AI 层:** 处理 Amazon Bedrock 的代理、Amazon Bedrock 的知识库和 UI(Slack 聊天)之间的交互。它包含几个关键组件:OpsAgent,这是一个由 Anthropic Claude 3 Haiku on Amazon Bedrock 提供支持的运营助手。它根据事件类型和文本描述对运营事件做出反应。OpsAgent 由 Amazon Bedrock 上具有不同知识领域的另外两个 AI 模型端点支持。一个操作组被定义并附加到 OpsAgent,允许它通过协调 AI 端点的工作和采取行动(例如创建票证)来解决更复杂的问题,而无需人工监督。OpsAgent 预先提示了必要的公司策略和指南,以便根据您的要求执行事件过滤、分流和 ITSM 操作。OpsAgent 使用两个支持的 AI 模型端点:事件专家端点使用 Amazon Bedrock 基础模型 (FM) 中的 Amazon Titan 和 Amazon OpenSearch Serverless,通过检索增强型生成 (RAG) 来回答有关运营事件的问题。ask-aws 端点使用 Amazon Titan 模型和 Amazon Kendra 作为 RAG 源。它包含有关所选主题的最新 AWS 文档。您必须同步 Amazon Kendra 数据源,以确保底层 AI 模型使用最新文档。您可以在解决方案部署后使用 AWS 管理控制台执行此操作。这些专门的端点具有专门的 RAG 数据源,有助于分解复杂的任务、提高准确性并确保使用正确的模型。AI 层还包括两个 AI 编排 Step Functions 工作流。这些工作流管理 AI 代理、AI 模型端点以及与用户(通过 Slack 聊天)的交互:AI 集成工作流定义了运营助手如何根据事件类型和这些事件的文本描述对运营事件做出反应。AI 聊天机器人工作流通过聊天界面管理用户与 OpsAgent 助手的交互。聊天机器人处理聊天会话和上下文。

📊 **存档和报告层:** 处理流式传输、存储以及提取、转换和加载 (ETL) 运营事件数据。它还为 BI 仪表板和报告分析准备数据湖。但是,此解决方案不包括实际的仪表板实现;它为以后的开发准备了一个运营事件数据湖。

💡 **用例示例:** 可以使用此解决方案进行自动事件通知、自主事件确认和操作分流,帮助组织提高效率并减少运营事件带来的风险。

Modern organizations increasingly depend on robust cloud infrastructure to provide business continuity and operational efficiency. Operational health events – including operational issues, software lifecycle notifications, and more – serve as critical inputs to cloud operations management. Inefficiencies in handling these events can lead to unplanned downtime, unnecessary costs, and revenue loss for organizations.

However, managing cloud operational events presents significant challenges, particularly in complex organizational structures. With a vast array of services and resource footprints spanning hundreds of accounts, organizations can face an overwhelming volume of operational events occurring daily, making manual administration impractical. Although traditional programmatic approaches offer automation capabilities, they often come with significant development and maintenance overhead, in addition to increasingly complex mapping rules and inflexible triage logic.

This post shows you how to create an AI-powered, event-driven operations assistant that automatically responds to operational events. It uses Amazon Bedrock, AWS Health, AWS Step Functions, and other AWS services. The assistant can filter out irrelevant events (based on your organization’s policies), recommend actions, create and manage issue tickets in integrated IT service management (ITSM) tools to track actions, and query knowledge bases for insights related to operational events. By orchestrating a group of AI endpoints, the agentic AI design of this solution enables the automation of complex tasks, streamlining the remediation processes for cloud operational events. This approach helps organizations overcome the challenges of managing the volume of operational events in complex, cloud-driven environments with minimal human supervision, ultimately improving business continuity and operational efficiency.

Event-driven operations management

Operational events refer to occurrences within your organization’s cloud environment that might impact the performance, resilience, security, or cost of your workloads. Some examples of AWS-sourced operational events include:

    AWS Health events — Notifications related to AWS service availability, operational issues, or scheduled maintenance that might affect your AWS resources. AWS Security Hub findings — Alerts about potential security vulnerabilities or misconfigurations identified within your AWS environment. AWS Cost Anomaly Detection alerts – Notifications about unusual spending patterns or cost spikes. AWS Trusted Advisor findings — Opportunities for optimizing your AWS resources, improving security, and reducing costs.

However, operational events aren’t limited to AWS-sourced events. They can also originate from your own workloads or on-premises environments. In principle, any event that can integrate with your operations management and is of importance to your workload health qualifies as an operational event.

Operational event management is a comprehensive process that provides efficient handling of events from start to finish. It involves notification, triage, progress tracking, action, and archiving and reporting at a large scale. The following is a breakdown of the typical tasks included in each step:

    Notification of events:
      Format notifications in a standardized, user-friendly way. Dispatch notifications through instant messaging tools or emails.
    Triage of events:
      Filter out irrelevant or noise events based on predefined company policies. Analyze the events’ impact by examining their metadata and textual description. Convert events into actionable tasks and assigning responsible owners based on roles and responsibilities. Log tickets or page the appropriate personnel in the chosen ITSM tools.
    Status tracking of events and actions:
      Group related events into threads for straightforward management. Update ticket statuses based on the progress of event threads and action owner updates.
    Insights and reporting:
      Query and consolidate knowledge across various event sources and tickets. Create business intelligence (BI) dashboards for visual representation and analysis of event data.

A streamlined process should include steps to ensure that events are promptly detected, prioritized, acted upon, and documented for future reference and compliance purposes, enabling efficient operational event management at scale. However, traditional programmatic automation has limitations when handling multiple tasks. For instance, programmatic rules for event attribute-based noise filtering lack flexibility when faced with organizational changes, expansion of the service footprint, or new data source formats, leading growing complexity.

Automating impact analysis in traditional automation through keyword matching on free-text descriptions is impractical. Converting events to tickets requires manual effort to generate action hints and lacks correlation to the originating events. Extracting event storylines from long, complex threads of event updates is challenging.

Let’s explore an AI-based solution to see how it can help address these challenges and improve productivity.

Solution overview

The solution uses AWS Health and AWS Security Hub findings as sources of operational events to demonstrate the workflow. It can be extended to incorporate additional types of operational events—from AWS or non-AWS sources—by following an event-driven architecture (EDA) approach.

The solution is designed to be fully serverless on AWS and can be deployed as infrastructure as code (IaC) by usingf the AWS Cloud Development Kit (AWS CDK).

Slack is used as the primary UI, but you can implement the solution using other messaging tools such as Microsoft Teams.

The cost of running and hosting the solution depends on the actual consumption of queries and the size of the vector store and the Amazon Kendra document libraries. See Amazon Bedrock pricing, Amazon OpenSearch pricing and Amazon Kendra pricing for pricing details.

The full code repository is available in the accompanying GitHub repo.

The following diagram illustrates the solution architecture.

Figure – solution architecture diagram

Solution walk-through

The solution consists of three microservice layers, which we discuss in the following sections.

Event processing layer

The event processing layer manages notifications, acknowledgments, and triage of actions. Its main logic is controlled by two key workflows implemented using Step Functions.

Figure – Event orchestration workflow

Figure – Event notification workflow

AI layer

The AI layer handles the interactions between Agents for Amazon Bedrock, Knowledge Bases for Amazon Bedrock, and the UI (Slack chat). It has several key components.

OpsAgent is an operations assistant powered by Anthropic Claude 3 Haiku on Amazon Bedrock. It reacts to operational events based on the event type and text descriptions. OpsAgent is supported by two other AI model endpoints on Amazon Bedrock with different knowledge domains. An action group is defined and attached to OpsAgent, allowing it to solve more complex problems by orchestrating the work of AI endpoints and taking actions such as creating tickets without human supervisions.

OpsAgent is pre-prompted with required company policies and guidelines to perform event filtering, triage, and ITSM actions based on your requirements. See the sample escalation policy in the GitHub repo (between escalation_runbook tags).

OpsAgent uses two supporting AI model endpoints:

    The events expert endpoint uses the Amazon Titan in Amazon Bedrock foundation model (FM) and Amazon OpenSearch Serverless to answer questions about operational events using Retrieval Augmented Generation (RAG). The ask-aws endpoint uses the Amazon Titan model and Amazon Kendra as the RAG source. It contains the latest AWS documentation on selected topics. You must syncronize the Amazon Kendra data sources to ensure the underlying AI model is using the latest documentation. Your can do this using the AWS Management Console after the solution is deployed.

These dedicated endpoints with specialized RAG data sources help break down complex tasks, improve accuracy, and make sure the correct model is used.

The AI layer also includes of two AI orchestration Step Functions workflows. The workflows manage the AI agent, AI model endpoints, and the interaction with the user (through Slack chat):

Figure – AI integration workflow

Figure: AI chatbot workflow

Archiving and reporting layer

The archiving and reporting layer handles streaming, storing, and extracting, transforming, and loading (ETL) operational event data. It also prepares a data lake for BI dashboards and reporting analysis. However, this solution doesn’t include an actual dashboard implementation; it prepares an operational event data lake for later development.

Use case examples

You can use this solution for automated event notification, autonomous event acknowledgement, and action triage by setting up a virtual supervisor or operator that follows your organization’s policies. The virtual operator is equipped with multiple AI capabilities—each of which is specialized in a specific knowledge domain—such as generating recommended actions or taking actions to issue tickets in ITSM tools, as shown in the following figure.

Figure – use case example 1

The virtual event supervisor filters out noise based on your policies, as illustrated in the following figure.

Figure – use case example 2

AI can use the tickets that are related to a specific AWS Health event to provide the latest status updates on those tickets, as shown in the following figure.

Figure – use case example 3

The following figure shows how the assistant evaluates complex threads of operational events to provide valuable insights.

Figure – use case example 4

The following figure shows a more sophisticated use case.

Figure – use case example 5

Prerequisites

To deploy this solution, you must meet the following prerequisites:

Create a Slack app and set up a channel

Set up Slack:

    Create a Slack app from the manifest template, using the content of the slack-app-manifest.json file from the GitHub repository. Install your app into your workspace, and take note of the Bot User OAuth Token value to be used in later steps. Take note of the Verification Token value under Basic Information of your app, you will need it in later steps. In your Slack desktop app, go to your workspace and add the newly created app. Create a Slack channel and add the newly created app as an integrated app to the channel. Find and take note of the channel ID by choosing (right-clicking) the channel name, choosing Additional options to access the More menu, and choosing Open details to see the channel details.

Prepare your deployment environment

Use the following commands to ready your deployment environment for the worker account. Make sure you aren’t running the command under an existing AWS CDK project root directory. This step is required only if you chose a worker account that’s different from the administration account:

# Make sure your shell session environment is configured to access the worker# account of your choice, for detailed guidance on how to configure, refer to # https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html  # Note that in this step you are bootstrapping your worker account in such a way # that your administration account is trusted to execute CloudFormation deployment in# your worker account, the following command uses an example execution role policy of 'AdministratorAccess',# you can swap it for other policies of your own for least privilege best practice,# for more information on the topic, refer to https://repost.aws/knowledge-center/cdk-customize-bootstrap-cfntoolkitcdk bootstrap aws://<replace with your AWS account id of the worker account>/<replace with the region where your worker services is> --trust <replace with your AWS account id of the administration account> --cloudformation-execution-policies 'arn:aws:iam::aws:policy/AdministratorAccess' --trust-for-lookup <replace with your AWS account id of the administration account>

Use the following commands to ready your deployment environment for the administration account. Make sure you aren’t running the commands under an existing AWS CDK project root directory:

# Make sure your shell session environment is configured to access the admistration # account of your choice, for detailed guidance on how to configure, refer to # https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html# Note 'us-east-1' region is required for receiving AWS Health events associated with# services that operate in AWS global region.cdk bootstrap <replace with your AWS account id of the administration account>/us-east-1# Optional, if you have your cloud infrastructures hosted in other AWS regions than 'us-east-1',# repeat the below commands for each regioncdk bootstrap <replace with your AWS account id of the administration account>/<replace with the region name, e.g. us-west-2>

Copy the GitHub repo to your local directory

Use the following code to copy the GitHub repo to your local directory.:

git clone https://github.com/aws-samples/ops-health-ai.gitcd ops-health-ainpm installcd lambda/src# Depending on your build environment, you might want to change the arch type to 'x86'# or 'arm' in lambda/src/template.yaml file before build sam build --use-containercd ../..

Create an .env file

Create an .env file containing the following code under the project root directory. Replace the variable placeholders with your account information:

CDK_ADMIN_ACCOUNT=<replace with your 12 digits administration AWS account id>CDK_PROCESSING_ACCOUNT=<replace with your 12 digits worker AWS account id. This account id is the same as the admin account id if using single account setup>EVENT_REGIONS=us-east-1,<region 1 of where your infrastructures are hosted>,<region 2 of where your infrastructures are hosted>CDK_PROCESSING_REGION=<replace with the region where you want the worker services to be, e.g. us-east-1>EVENT_HUB_ARN=arn:aws:events:<replace with the worker service region>:<replace with the worker service account id>:event-bus/AiOpsStatefulStackAiOpsEventBusSLACK_CHANNEL_ID=<your Slack channel ID noted down from earlier step>SLACK_APP_VERIFICATION_TOKEN=<replace with your Slack app verification token>SLACK_ACCESS_TOKEN=<replace with your Slack Bot User OAuth Token value>

Deploy the solution using the AWS CDK

Deploy the processing microservice to your worker account (the worker account can be the same as your administrator account):

    In the project root directory, run the following command: cdk deploy --all --require-approval never Capture the HandleSlackCommApiUrl stack output URL, Go to your Slack app and navigate to Event Subscriptions, Request URL Change, Update the URL value with the stack output URL and save your settings.

Test the solution

Test the solution by sending a mock operational event to your administration account . Run the following AWS Command Line Interface (AWS CLI) command:
aws events put-events --entries file://test-events/mockup-events.json

You will receive Slack messages notifying you about the mock event followed by automatic update from the AI assistant reporting the actions it took and the reasons for each action. You don’t need to manually choose Accept or Discharge for each event.

Try creating more mock events based on your past operational events and test them with the use cases described in the Use case examples section.

If you have just enabled AWS Security Hub in your administrator account, you might need to wait for up to 24 hours for any findings to be reported and acted on by the solution. AWS Health events, on the other hand, will be reported whenever applicable.

Clean up

To clean up your resources, run the following command in the CDK project directory: cdk destroy --all

Conclusion

This solution uses AI to help you automate complex tasks in cloud operational events management, bringing new opportunities for you to further streamline cloud operations management at scale with improved productivity, and operational resilience.

To learn more about the AWS services used in this solution, see:


About the author

Sean Xiaohai Wang is a Senior Technical Account Manager at Amazon Web Services. He helps enterpise customers build and operate efficiently on AWS.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI 云运营 事件管理 Amazon Bedrock AWS Health AWS Step Functions
相关文章