Streamline AWS resource troubleshooting with Amazon Bedrock Agents and AWS Support Automation Workflows

As AWS environments grow in complexity, troubleshooting issues with resources can become a daunting task. Manually investigating and resolving problems can be time-consuming and error-prone, especially when dealing with intricate systems. Fortunately, AWS provides a powerful tool called AWS Support Automation Workflows, which is a collection of curated AWS Systems Manager self-service automation runbooks. These runbooks are created by AWS Support Engineering with best practices learned from solving customer issues. They enable AWS customers to troubleshoot, diagnose, and remediate common issues with their AWS resources.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy, and responsible AI. Using Amazon Bedrock, you can experiment with and evaluate top FMs for your use case, privately customize them with your data using techniques such as fine-tuning and Retrieval Augmented Generation (RAG), and build agents that execute tasks using your enterprise systems and data sources. Because Amazon Bedrock is serverless, you don’t have to manage infrastructure, and you can securely integrate and deploy generative AI capabilities into your applications using the AWS services you are already familiar with.

In this post, we explore how to use the power of Amazon Bedrock Agents and AWS Support Automation Workflows to create an intelligent agent capable of troubleshooting issues with AWS resources.

Solution overview

Although the solution is versatile and can be adapted to use a variety of AWS Support Automation Workflows, we focus on a specific example: troubleshooting an Amazon Elastic Kubernetes Service (Amazon EKS) worker node that failed to join a cluster. The following diagram provides a high-level overview of troubleshooting agents with Amazon Bedrock.

Our solution is built around the following key components that work together to provide a seamless and efficient troubleshooting experience:

Amazon

Bedrock Agents

Amazon

Bedrock agent action groups

AWS Lambda

Lambda Function

IAM role

AWS Identity and Access Management

AWS Support Automation Workflows

The following steps outline the workflow of our solution:

ReAct

reasoning

acting

During the reasoning phase of the agent, the user is able to view the reasoning steps.

Troubleshooting examples

Let’s take a closer look at a common issue we mentioned earlier and how our agent can assist in troubleshooting it.

EKS worker node failed to join EKS cluster

When an EKS worker node fails to join an EKS cluster, our Amazon Bedrock agent can be invoked with the relevant information: cluster name and worker node ID. The agent will execute the corresponding AWS Support Automation Workflow, which will perform checks like verifying the worker node’s IAM role permissions and verifying the necessary network connectivity.

The automation workflow will run all the checks. Then Amazon Bedrock agent will ingest the troubleshooting, explain the root cause of the issue to the user, and suggest remediation steps based on the AWSSupport-TroubleshootEKSWorkerNode output, such as updating the worker node’s IAM role or resolving network configuration issues, enabling them to take the necessary actions to resolve the problem.

OpenAPI example

When you create an action group in Amazon Bedrock, you must define the parameters that the agent needs to invoke from the user. You can also define API operations that the agent can invoke using these parameters. To define the API operations, we will create an OpenAPI schema in JSON:

"Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post": {        "properties": {          "cluster_name": {            "type": "string",            "title": "Cluster Name",            "description": "The name of the EKS cluster"          },          "worker_id": {            "type": "string",            "title": "Worker Id",            "description": "The ID of the worker node"          }        },        "type": "object",        "required": [          "cluster_name",          "worker_id"        ],        "title": "Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post"      }

The schema consists of the following components:

Body_troubleshoot_eks_worker_node_troubleshoot_eks_worker_node_post

troubleshoot-eks-worker_node

Properties

“cluster_name”

“worker_id

Type

Required

Title

The OpenAPI schema defines the structure of the request body. To learn more, see Define OpenAPI schemas for your agent’s action groups in Amazon Bedrock and OpenAPI specification.

Lambda function code

Now let’s explore the Lambda function code:

@app.post("/troubleshoot-eks-worker-node")@tracer.capture_methoddef troubleshoot_eks_worker_node(    cluster_name: Annotated[str, Body(description="The name of the EKS cluster")],    worker_id: Annotated[str, Body(description="The ID of the worker node")]) -> dict:    """    Troubleshoot EKS worker node that failed to join the cluster.    Args:        cluster_name (str): The name of the EKS cluster.        worker_id (str): The ID of the worker node.    Returns:        dict: The output of the Automation execution.    """    return execute_automation(        automation_name='AWSSupport-TroubleshootEKSWorkerNode',        parameters={            'ClusterName': [cluster_name],            'WorkerID': [worker_id]        },        execution_mode='TroubleshootWorkerNode'    )

The code consists of the following components

app.post(“/troubleshoot-eks-worker-node”, description=”Troubleshoot EKS worker node failed to join the cluster”)

/troubleshoot-eks-worker-node

@tracer.capture_method

cluster_name: str = Body(description=”The name of the EKS cluster”),

cluster_name

worker_id: str = Body(description=”The ID of the worker node”)

worker_id

-> Annotated[dict, Body(description=”The output of the Automation execution”)]

To link a new SAW runbook in the Lambda function, you can follow the same template.

Prerequisites

Make sure you have the following prerequisites:

Anthropic’s Claude 3.5 Sonnet

enabled in Amazon Bedrock

credentials configured

AWS Command Line Interface

AWS Cloud Development Kit

143.0

Deploy the solution

Complete the following steps to deploy the solution:

Clone the GitHub repository and go to the root of your downloaded repository folder:

$ git clone https://github.com/aws-samples/sample-bedrock-agent-for-troubleshooting-aws-resources.git

$ cd bedrock-agent-for-troubleshooting-aws-resources

Install local dependencies:

$ npm install

configuring your credential

<PROFILE_NAME>

$ export AWS_PROFILE=PROFILE_NAME

Bootstrap the AWS CDK environment (this is a one-time activity and is not needed if your AWS account is already bootstrapped):

$ cdk bootstrap

Run the script to replace the placeholders for your AWS account and AWS Region in the config files:

$ cdk deploy --all

Test the agent

Navigate to the Amazon Bedrock Agents console in your Region and find your deployed agent. You will find the agent ID in the cdk deploy command output.

You can now interact with the agent and test troubleshooting a worker node not joining an EKS cluster. The following are some example questions:

<instance_ID>

The following screenshot shows the console view of the agent.

The agent understood the question and mapped it with the right action group. It also spotted that the parameters needed are missing in the user prompt. It came back with a follow-up question to require the Amazon Elastic Compute Cloud (Amazon EC2) instance ID and EKS cluster name.

We can see the agent’s thought process in the trace step 1. The agent assesses the next step as ready to call the right Lambda function and right API path.

With the results coming back from the runbook, the agent now reviews the troubleshooting outcome. It goes through the information and will start writing the solution where it provides the instructions for the user to follow.

In the answer provided, the agent was able to spot all the issues and transform that into solution steps. We can also see the agent mentioning the right information like IAM policy and the required tag.

Clean up

When implementing Amazon Bedrock Agents, there are no additional charges for resource construction. However, costs are incurred for embedding model and text model invocations on Amazon Bedrock, with charges based on the pricing of each FM used. In this use case, you will also incur costs for Lambda invocations.

To avoid incurring future charges, delete the created resources by the AWS CDK. From the root of your repository folder, run the following command:

$ npm run cdk destroy --all

Conclusion

Amazon Bedrock Agents and AWS Support Automation Workflows are powerful tools that, when combined, can revolutionize AWS resource troubleshooting. In this post, we explored a serverless application built with the AWS CDK that demonstrates how these technologies can be integrated to create an intelligent troubleshooting agent. By defining action groups within the Amazon Bedrock agent and associating them with specific scenarios and automation workflows, we’ve developed a highly efficient process for diagnosing and resolving issues such as Amazon EKS worker node failures.

Our solution showcases the potential for automating complex troubleshooting tasks, saving time and streamlining operations. Powered by Anthropic’s Claude 3.5 Sonnet, the agent demonstrates improved understanding and responding in languages other than English, such as French, Japanese, and Spanish, making it accessible to global teams while maintaining its technical accuracy and effectiveness. The intelligent agent quickly identifies root causes and provides actionable insights, while automatically executing relevant AWS Support Automation Workflows. This approach not only minimizes downtime, but also scales effectively to accommodate various AWS services and use cases, making it a versatile foundation for organizations looking to enhance their AWS infrastructure management.

Explore the AWS Support Automation Workflow for additional use cases and consider using this solution as a starting point for building more comprehensive troubleshooting agents tailored to your organization’s needs. To learn more about using agents to orchestrate workflows, see Automate tasks in your application using conversational agents. For details about using guardrails to safeguard your generative AI applications, refer to Stop harmful content in models using Amazon Bedrock Guardrails.

Happy coding!

Acknowledgements

The authors thank all the reviewers for their valuable feedback.

About the Authors

Wael Dimassi is a Technical Account Manager at AWS, building on his 7-year background as a Machine Learning specialist. He enjoys learning about AWS AI/ML services and helping customers meet their business outcomes by building solutions for them.

Marwen Benzarti is a Senior Cloud Support Engineer at AWS Support where he specializes in Infrastructure as Code. With over 4 years at AWS and 2 years of previous experience as a DevOps engineer, Marwen works closely with customers to implement AWS best practices and troubleshoot complex technical challenges. Outside of work, he enjoys playing both competitive multiplayer and immersive story-driven video games.