使用Amazon Bedrock代理工作流自动化Amazon

语言 Chinese, Simplified

SEO Title

Automate Amazon EKS troubleshooting using an Amazon Bedrock agentic workflow

解决方案概述

该架构由以下核心组件组成：

Amazon Bedrock合作代理-协调工作流程并维护上下文，同时将用户提示路由到专门的代理，管理多步操作和代理交互
适用于K8sGPT的Amazon Bedrock代理–通过K8sGPT的Analyze API针对安全问题、错误配置和性能问题评估集群和吊舱事件，以自然语言提供补救建议
ArgoCD的Amazon Bedrock代理–通过ArgoCD管理基于GitOps的修复，处理回滚、资源优化和配置更新

下图说明了解决方案架构。

Prerequisites

You need to have the following prerequisites in place:

The AWS Command Line Interface (AWS CLI) version 2. For installation instructions, refer to Installing or updating to the latest version of the AWS CLI.
An EKS cluster.
helm.
Kubectl.
Amazon Bedrock model access (In this post, we used Anthropic Claude 3.5 Sonnet v1) in the AWS Region of deployment.
Download the accompanying AWS CloudFormation template. The template is dependent on downloading resources from an Amazon Simple Storage Service (Amazon S3) bucket provisioned in the US East (N. Virginia) us-east-1 AWS Region. Hence, it’s restricted to running in the us-east-1 Region only.

Set up the Amazon EKS cluster with K8sGPT and ArgoCD

We start with installing and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will help with enabling AI-powered analysis and troubleshooting of cluster issues. For example, it can automatically detect and suggest fixes for misconfigured deployments, such as identifying and resolving resource constraint problems in pods.

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes that automates the deployment of applications by keeping the desired application state in sync with what’s defined in a Git repository.

The Amazon Bedrock agent serves as the intelligent decision-maker in our architecture, analyzing cluster issues detected by K8sGPT. After the root cause is identified, the agent orchestrates corrective actions through ArgoCD’s GitOps engine. This powerful integration means that when problems are detected (whether it’s a misconfigured deployment, resource constraints, or scaling issue), the agent can automatically integrate with ArgoCD to provide the necessary fixes. ArgoCD then picks up these changes and synchronizes them with your EKS cluster, creating a truly self-healing infrastructure.

Create the necessary namespaces in Amazon EKS:

kubectl create ns helm-guestbook
kubectl create ns k8sgpt-operator-system

Add the k8sgpt Helm repository and install the operator:

helm repo add k8sgpt https://charts.k8sgpt.ai/
helm repo update
helm install k8sgpt-operator k8sgpt/k8sgpt-operator \
  --namespace k8sgpt-operator-system

You can verify the installation by entering the following command:

kubectl get pods -n k8sgpt-operator-system

NAME                                                          READY   STATUS    RESTARTS  AGE
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   0         1d

After the operator is deployed, you can configure a K8sGPT resource. This Custom Resource Definition(CRD) will have the large language model (LLM) configuration that will aid in AI-powered analysis and troubleshooting of cluster issues. K8sGPT supports various backends to help in AI-powered analysis. For this post, we use Amazon Bedrock as the backend and Anthropic’s Claude V3 as the LLM.

You need to create the pod identity for providing the EKS cluster access to other AWS services with Amazon Bedrock:

eksctl create podidentityassociation --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:policy/AmazonBedrockFullAccess --region $AWS_REGION

Bash

Configure the K8sGPT CRD:

cat << EOF > k8sgpt.yaml
apiVersion: core.k8sgpt.ai/v1alpha1
kind: K8sGPT
metadata:
  name: k8sgpt-bedrock
  namespace: k8sgpt-operator-system
spec:
  ai:
    enabled: true
    model: anthropic.claude-v3
    backend: amazonbedrock
    region: us-east-1
    credentials:
      secretRef:
        name: k8sgpt-secret
        namespace: k8sgpt-operator-system
  noCache: false
  repository: ghcr.io/k8sgpt-ai/k8sgpt
  version: v0.3.48
EOF

kubectl apply -f k8sgpt.yaml

Validate the settings to confirm the k8sgpt-bedrock pod is running successfully:

kubectl get pods -n k8sgpt-operator-system
NAME                                                          READY   STATUS    RESTARTS      AGE
k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Running   9 (22d ago)   22d
release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   3 (10h ago)   22d

Now you can configure the ArgoCD controller:

helm repo add argo https://argoproj.github.io/argo-helm
helm repo update
kubectl create namespace argocd
helm install argocd argo/argo-cd \
  --namespace argocd \
  --create-namespace

Verify the ArgoCD installation:

kubectl get pods -n argocd
NAME                                                READY   STATUS    RESTARTS   AGE
argocd-application-controller-0                     1/1     Running   0          43d
argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Running   0          43d
argocd-dex-server-55d5769f46-58dwx                  1/1     Running   0          43d
argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Running   0          43d
argocd-redis-587d59bbc-rndkp                        1/1     Running   0          43d
argocd-repo-server-76f6c7686b-rhjkg                 1/1     Running   0          43d
argocd-server-64fcc786c-bd2t8                       1/1     Running   0          43d

Patch the argocd service to have an external load balancer:

kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'

You can now access the ArgoCD UI with the following load balancer endpoint and the credentials for the admin user:

kubectl get svc argocd-server -n argocd
NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d

Retrieve the credentials for the ArgoCD UI:

export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d`

echo ArgoCD admin password - $argocdpassword

Push the credentials to AWS Secrets Manager:

aws secretsmanager create-secret \
--name argocdcreds \
--description "Credentials for argocd" \
--secret-string "{\"USERNAME\":\"admin\",\"PASSWORD\":\"$argocdpassword\"}"

Configure a sample application in ArgoCD:

cat << EOF > argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: helm-guestbook
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/awsvikram/argocd-example-apps
targetRevision: HEAD
path: helm-guestbook
destination:
server: https://kubernetes.default.svc
namespace: helm-guestbook
syncPolicy:
automated:
prune: true
selfHeal: true
EOF

Apply the configuration and verify it from the ArgoCD UI by logging in as the admin user:
```
kubectl apply -f argocd-application.yaml
```
Bash
It takes some time for K8sGPT to analyze the newly created pods. To make that immediate, restart the pods created in the k8sgpt-operator-system namespace. The pods can be restarted by entering the following command:
```
kubectl -n k8sgpt-operator-system rollout restart deploy

deployment.apps/k8sgpt-bedrock restarted
deployment.apps/k8sgpt-operator-controller-manager restarted
```
Bash

Set up the Amazon Bedrock agents for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the individual agents into the US East (N. Virginia) Region. When you deploy the CloudFormation template, you deploy several resources (costs will be incurred for the AWS resources used).

Use the following parameters for the CloudFormation template:

EnvironmentName: The name for the deployment (EKSBlogSetup)

ArgoCD_LoadBalancer_URL: Extracting the ArgoCD LoadBalancer URL:

kubectl  get service argocd-server -n argocd -ojsonpath="{.status.loadBalancer.ingress[0].hostname}"

AWSSecretName: The Secrets Manager secret name that was created to store ArgoCD credentials

The stack creates the following AWS Lambda functions:

<Stack name>-LambdaK8sGPTAgent-<auto-generated>
<Stack name>-RestartRollBackApplicationArgoCD-<auto-generated>
<Stack name>-ArgocdIncreaseMemory-<auto-generated>

The stack creates the following Amazon Bedrock agents:

ArgoCDAgent, with the following action groups:
1. argocd-rollback
2. argocd-restart
3. argocd-memory-management
K8sGPTAgent, with the following action group:
1. k8s-cluster-operations
CollaboratorAgent

The stack outputs the following, with the following agents associated to it:

ArgoCDAgent
K8sGPTAgent

LambdaK8sGPTAgentRole, AWS Identity and Access Management (IAM) role Amazon Resource Name (ARN) associated to the Lambda function handing interactions with the K8sGPT agent on the EKS cluster. This role ARN will be needed at a later stage of the configuration process.
K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign appropriate permissions to enable K8sGPT Amazon Bedrock agent to access the EKS cluster

To enable the K8sGPT Amazon Bedrock agent to access the EKS cluster, you need to configure the appropriate IAM permissions using Amazon EKS access management APIs. This is a two-step process: first, you create an access entry for the Lambda function’s execution role (which you can find in the CloudFormation template output section), and then you associate the AmazonEKSViewPolicy to grant read-only access to the cluster. This configuration makes sure that the K8sGPT agent has the necessary permissions to monitor and analyze the EKS cluster resources while maintaining the principle of least privilege.

Create an access entry for the Lambda function’s execution role

export CFN_STACK_NAME=EKS-Troubleshooter
	   export EKS_CLUSTER=PetSite

export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output text`

aws eks create-access-entry \
    --cluster-name $EKS_CLUSTER \
    --principal-arn $K8SGPT_LAMBDA_ROLE

Associate the EKS view policy with the access entry

aws eks associate-access-policy \
    --cluster-name $EKS_CLUSTER \
    --principal-arn  $K8SGPT_LAMBDA_ROLE\
    --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
    --access-scope type=cluster

Verify the Amazon Bedrock agents. The CloudFormation template adds all three required agents. To view the agents, on the Amazon Bedrock console, under Builder tools in the navigation pane, select Agents, as shown in the following screenshot.

Bedrock agents

Perform Amazon EKS troubleshooting using the Amazon Bedrock agentic workflow

Now, test the solution. We explore the following two scenarios:

The agent coordinates with the K8sGPT agent to provide insights into the root cause of a pod failure
The collaborator agent coordinates with the ArgoCD agent to provide a response

Agent coordinates with K8sGPT agent to provide insights into the root cause of a pod failure

In this section, we examine a down alert for a sample application called memory-demo. We’re interested in the root cause of the issue. We use the following prompt: “We got a down alert for the memory-demo app. Help us with the root cause of the issue.”

The agent not only stated the root cause, but went one step further to potentially fix the error, which in this case is increasing memory resources to the application.

K8sgpt agent finding

Collaborator agent coordinates with ArgoCD agent to provide a response

For this scenario, we continue from the previous prompt. We feel the application wasn’t provided enough memory, and it should be increased to permanently fix the issue. We can also tell the application is in an unhealthy state in the ArgoCD UI, as shown in the following screenshot.

ArgoUI

Let’s now proceed to increase the memory, as shown in the following screenshot.

Interacting with agent to increase memory

The agent interacted with the argocd_operations Amazon Bedrock agent and was able to successfully increase the memory. The same can be inferred in the ArgoCD UI.

ArgoUI showing memory increase

清理

如果您决定停止使用该解决方案，请完成以下步骤：

要删除使用AWS CloudFormation部署的关联资源，请执行以下操作：
1. 在AWS CloudFormation控制台上，在导航窗格中选择Stacks。
2. 找到您在部署过程中创建的堆栈（您为其分配了一个名称）。
3. 选择堆栈并选择删除。
如果您专门为此实现创建了EKS集群，请删除它。

结论

通过编排多个Amazon Bedrock代理，我们演示了如何构建一个基于AI的Amazon EKS故障排除系统，简化Kubernetes操作。K8sGPT分析和ArgoCD部署自动化的集成展示了将专业AI代理与现有DevOps工具相结合的强大可能性。尽管该解决方案代表了Kubernetes自动化操作的进步，但重要的是要记住，人工监督仍然很有价值，特别是对于复杂的场景和战略决策。

随着Amazon Bedrock及其代理能力的不断发展，我们可以期待更复杂的编排可能性。您可以扩展此解决方案，以纳入其他工具、指标和自动化工作流程，以满足您组织的特定需求。

要了解有关Amazon Bedrock的更多信息，请参阅以下资源：