AWS Machine Learning Blog 03月21日
Streamline AWS resource troubleshooting with Amazon Bedrock Agents and AWS Support Automation Workflows
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何利用Amazon Bedrock Agents和AWS Support Automation Workflows来创建一个智能代理,以解决AWS资源的问题。该方案通过自然语言处理用户的问题,并调用相应的AWS Support Automation Workflows,自动诊断和修复常见问题。文章详细阐述了方案的核心组件,包括Amazon Bedrock Agents、Lambda函数、IAM角色和AWS Support Automation Workflows,并提供了EKS worker node无法加入集群的故障排除示例,以及OpenAPI schema和Lambda函数代码示例。

🤖 Amazon Bedrock Agents充当用户与AWS Support Automation Workflows之间的智能接口,通过自然语言处理用户查询,理解问题上下文,并管理对话流程以收集所需信息。该代理使用Anthropic的Claude 3.5 Sonnet模型进行高级推理和响应生成,实现自然交互。

⚙️ Amazon Bedrock代理操作组使用OpenAPI规范定义了代理可以调用的结构化API操作,定义了代理与AWS Lambda函数之间的接口,包括可用操作、所需参数和预期响应。每个操作组都包含API模式,指导代理如何正确格式化请求和解释与Lambda函数交互时的响应。

⚡ Lambda函数充当Amazon Bedrock代理与AWS Support Automation Workflows之间的集成层,验证来自代理的输入参数,并启动相应的SAW运行手册执行。它监视自动化进度,并将技术输出处理为结构化格式。工作流程完成后,它将格式化的结果返回给代理以供用户呈现。

🛠️ AWS Support Automation Workflows是由AWS Support Engineering开发的预构建诊断运行手册,这些工作流程基于AWS最佳实践执行全面的系统检查,并以标准化、可重复的方式进行。它们涵盖了广泛的AWS服务和常见问题,封装了AWS Support的广泛故障排除专业知识。

As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Manually investigating and resolving problems can be time-consuming and error-prone, especially when dealing with intricate systems. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows, which is a collection of curated AWS Systems Manager self-service automation runbooks. These runbooks are created by AWS Support Engineering with best practices learned from solving customer issues. They enable AWS customers to troubleshoot, diagnose, and remediate common issues with their AWS resources.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

In this post, we explore how to use the power of Amazon Bedrock Agents and AWS Support Automation Workflows to create an intelligent agent capable of troubleshooting issues with AWS resources.

Solution overview

Although the solution is versatile and can be adapted to use a variety of AWS Support Automation Workflows, we focus on a specific example: troubleshooting an Amazon Elastic Kubernetes Service (Amazon EKS) worker node that failed to join a cluster. The following diagram provides a high-level overview of troubleshooting agents with Amazon Bedrock.

Our solution is built around the following key components that work together to provide a seamless and efficient troubleshooting experience:

The following steps outline the workflow of our solution:

    Users start by describing their AWS resource issue in natural language through the Amazon Bedrock chat console. For example, “Why isn’t my EKS worker node joining the cluster?” The Amazon Bedrock agent analyzes the user’s question and matches it to the appropriate action defined in its OpenAPI schema. If essential information is missing, such as a cluster name or instance ID, the agent engages in a natural conversation to gather the required parameters. This makes sure that necessary data is collected before proceeding with the troubleshooting workflow. The Lambda function receives the validated request and triggers the corresponding AWS Support Automation Workflow. These SAW runbooks contain comprehensive diagnostic checks developed by AWS Support Engineering to identify common issues and their root causes. The checks run automatically without requiring user intervention. The SAW runbook systematically executes its diagnostic checks and compiles the findings. These results, including identified issues and configuration problems, are structured in JSON format and returned to the Lambda function. The Amazon Bedrock agent processes the diagnostic results using chain of thought (CoT) reasoning, based on the ReAct (synergizing reasoning and acting) technique. This enables the agent to analyze the technical findings, identify root causes, generate clear explanations, and provide step-by-step remediation guidance.

During the reasoning phase of the agent, the user is able to view the reasoning steps.

Troubleshooting examples

Let’s take a closer look at a common issue we mentioned earlier and how our agent can assist in troubleshooting it.

EKS worker node failed to join EKS cluster

When an EKS worker node fails to join an EKS cluster, our Amazon Bedrock agent can be invoked with the relevant information: cluster name and worker node ID. The agent will execute the corresponding AWS Support Automation Workflow, which will perform checks like verifying the worker node’s IAM role permissions and verifying the necessary network connectivity.

The automation workflow will run all the checks. Then Amazon Bedrock agent will ingest the troubleshooting, explain the root cause of the issue to the user, and suggest remediation steps based on the AWSSupport-TroubleshootEKSWorkerNode output, such as updating the worker node’s IAM role or resolving network configuration issues, enabling them to take the necessary actions to resolve the problem.

OpenAPI example

When you create an action group in Amazon Bedrock, you must define the parameters that the agent needs to invoke from the user. You can also define API operations that the agent can invoke using these parameters. To define the API operations, we will create an OpenAPI schema in JSON:

"Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post": {        "properties": {          "cluster_name": {            "type": "string",            "title": "Cluster Name",            "description": "The name of the EKS cluster"          },          "worker_id": {            "type": "string",            "title": "Worker Id",            "description": "The ID of the worker node"          }        },        "type": "object",        "required": [          "cluster_name",          "worker_id"        ],        "title": "Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post"      }

The schema consists of the following components:

The OpenAPI schema defines the structure of the request body. To learn more, see Define OpenAPI schemas for your agent’s action groups in Amazon Bedrock and OpenAPI specification.

Lambda function code

Now let’s explore the Lambda function code:

@app.post("/troubleshoot-eks-worker-node")@tracer.capture_methoddef troubleshoot_eks_worker_node(    cluster_name: Annotated[str, Body(description="The name of the EKS cluster")],    worker_id: Annotated[str, Body(description="The ID of the worker node")]) -> dict:    """    Troubleshoot EKS worker node that failed to join the cluster.    Args:        cluster_name (str): The name of the EKS cluster.        worker_id (str): The ID of the worker node.    Returns:        dict: The output of the Automation execution.    """    return execute_automation(        automation_name='AWSSupport-TroubleshootEKSWorkerNode',        parameters={            'ClusterName': [cluster_name],            'WorkerID': [worker_id]        },        execution_mode='TroubleshootWorkerNode'    )

The code consists of the following components

To link a new SAW runbook in the Lambda function, you can follow the same template.

Prerequisites

Make sure you have the following prerequisites:

Deploy the solution

Complete the following steps to deploy the solution:

    Clone the GitHub repository and go to the root of your downloaded repository folder:
$ git clone https://github.com/aws-samples/sample-bedrock-agent-for-troubleshooting-aws-resources.git
$ cd bedrock-agent-for-troubleshooting-aws-resources
    Install local dependencies:
$ npm install
    Sign in to your AWS account using the AWS CLI by configuring your credential file (replace <PROFILE_NAME> with the profile name of your deployment AWS account):
$ export AWS_PROFILE=PROFILE_NAME
    Bootstrap the AWS CDK environment (this is a one-time activity and is not needed if your AWS account is already bootstrapped):
$ cdk bootstrap
    Run the script to replace the placeholders for your AWS account and AWS Region in the config files:
$ cdk deploy --all

Test the agent

Navigate to the Amazon Bedrock Agents console in your Region and find your deployed agent. You will find the agent ID in the cdk deploy command output.

You can now interact with the agent and test troubleshooting a worker node not joining an EKS cluster. The following are some example questions:

The following screenshot shows the console view of the agent.

The agent understood the question and mapped it with the right action group. It also spotted that the parameters needed are missing in the user prompt. It came back with a follow-up question to require the Amazon Elastic Compute Cloud (Amazon EC2) instance ID and EKS cluster name.

We can see the agent’s thought process in the trace step 1. The agent assesses the next step as ready to call the right Lambda function and right API path.

With the results coming back from the runbook, the agent now reviews the troubleshooting outcome. It goes through the information and will start writing the solution where it provides the instructions for the user to follow.

In the answer provided, the agent was able to spot all the issues and transform that into solution steps. We can also see the agent mentioning the right information like IAM policy and the required tag.

Clean up

When implementing Amazon Bedrock Agents, there are no additional charges for resource construction. However, costs are incurred for embedding model and text model invocations on Amazon Bedrock, with charges based on the pricing of each FM used. In this use case, you will also incur costs for Lambda invocations.

To avoid incurring future charges, delete the created resources by the AWS CDK. From the root of your repository folder, run the following command:

$ npm run cdk destroy --all

Conclusion

Amazon Bedrock Agents and AWS Support Automation Workflows are powerful tools that, when combined, can revolutionize AWS resource troubleshooting. In this post, we explored a serverless application built with the AWS CDK that demonstrates how these technologies can be integrated to create an intelligent troubleshooting agent. By defining action groups within the Amazon Bedrock agent and associating them with specific scenarios and automation workflows, we’ve developed a highly efficient process for diagnosing and resolving issues such as Amazon EKS worker node failures.

Our solution showcases the potential for automating complex troubleshooting tasks, saving time and streamlining operations. Powered by Anthropic’s Claude 3.5 Sonnet, the agent demonstrates improved understanding and responding in languages other than English, such as French, Japanese, and Spanish, making it accessible to global teams while maintaining its technical accuracy and effectiveness. The intelligent agent quickly identifies root causes and provides actionable insights, while automatically executing relevant AWS Support Automation Workflows. This approach not only minimizes downtime, but also scales effectively to accommodate various AWS services and use cases, making it a versatile foundation for organizations looking to enhance their AWS infrastructure management.

Explore the AWS Support Automation Workflow for additional use cases and consider using this solution as a starting point for building more comprehensive troubleshooting agents tailored to your organization’s needs. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.

Happy coding!

Acknowledgements

The authors thank all the reviewers for their valuable feedback.


About the Authors

Wael Dimassi is a Technical Account Manager at AWS, building on his 7-year background as a Machine Learning specialist. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them.

Marwen Benzarti is a Senior Cloud Support Engineer at AWS Support where he specializes in Infrastructure as Code. With over 4 years at AWS and 2 years of previous experience as a DevOps engineer, Marwen works closely with customers to implement AWS best practices and troubleshoot complex technical challenges. Outside of work, he enjoys playing both competitive multiplayer and immersive story-driven video games.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Amazon Bedrock AWS Support 故障排除 智能代理
相关文章