跳转到主要内容
Chinese, Simplified

category

随着组织扩展其Amazon Elastic Kubernetes Service(Amazon EKS)部署,平台管理员在高效管理多租户集群方面面临着越来越多的挑战。调查pod故障、解决资源限制和解决配置错误等任务可能会消耗大量时间和精力。团队应该专注于推动创新,而不是花费宝贵的工程时间手动解析日志、跟踪指标和实施修复。现在,有了生成式人工智能的力量,你可以改变你的Kubernetes操作。通过实施智能集群监控、模式分析和自动修复,您可以大大减少常见集群问题的平均识别时间(MTTI)和平均解决时间(MTTR)。

在AWS re:Invent 2024上,我们宣布了Amazon Bedrock的多代理协作功能(预览版)。通过多代理协作,您可以构建、部署和管理多个人工智能代理,共同完成需要专业技能的复杂多步任务。由于对EKS集群进行故障排除涉及从多个可观察性信号中获取见解,并使用连续集成和部署(CI/CD)管道应用修复,因此多代理工作流可以帮助运营团队简化EKS集群的管理。工作流管理器代理可以与各个代理集成,这些代理与各个可观察性信号和CI/CD工作流交互,以根据用户提示编排和执行任务。

在这篇文章中,我们将演示如何编排多个Amazon Bedrock代理来创建一个复杂的Amazon EKS故障排除系统。通过实现专业代理之间的协作——从K8sGPT中获取见解并通过ArgoCD框架执行操作——您可以构建一个全面的自动化系统,在最少的人为干预下识别、分析和解决集群问题。

解决方案概述

该架构由以下核心组件组成:

  • Amazon Bedrock合作代理-协调工作流程并维护上下文,同时将用户提示路由到专门的代理,管理多步操作和代理交互
  • 适用于K8sGPT的Amazon Bedrock代理–通过K8sGPT的Analyze API针对安全问题、错误配置和性能问题评估集群和吊舱事件,以自然语言提供补救建议
  • ArgoCD的Amazon Bedrock代理–通过ArgoCD管理基于GitOps的修复,处理回滚、资源优化和配置更新

下图说明了解决方案架构。

Prerequisites

You need to have the following prerequisites in place:

Set up the Amazon EKS cluster with K8sGPT and ArgoCD

We start with installing and configuring the K8sGPT operator and ArgoCD controller on the EKS cluster.

The K8sGPT operator will help with enabling AI-powered analysis and troubleshooting of cluster issues. For example, it can automatically detect and suggest fixes for misconfigured deployments, such as identifying and resolving resource constraint problems in pods.

ArgoCD is a declarative GitOps continuous delivery tool for Kubernetes that automates the deployment of applications by keeping the desired application state in sync with what’s defined in a Git repository.

The Amazon Bedrock agent serves as the intelligent decision-maker in our architecture, analyzing cluster issues detected by K8sGPT. After the root cause is identified, the agent orchestrates corrective actions through ArgoCD’s GitOps engine. This powerful integration means that when problems are detected (whether it’s a misconfigured deployment, resource constraints, or scaling issue), the agent can automatically integrate with ArgoCD to provide the necessary fixes. ArgoCD then picks up these changes and synchronizes them with your EKS cluster, creating a truly self-healing infrastructure.

  1. Create the necessary namespaces in Amazon EKS:

    kubectl create ns helm-guestbook
    kubectl create ns k8sgpt-operator-system
    Bash
  2. Add the k8sgpt Helm repository and install the operator:

    helm repo add k8sgpt https://charts.k8sgpt.ai/
    helm repo update
    helm install k8sgpt-operator k8sgpt/k8sgpt-operator \
      --namespace k8sgpt-operator-system
    Bash
  3. You can verify the installation by entering the following command:

    kubectl get pods -n k8sgpt-operator-system
    
    NAME                                                          READY   STATUS    RESTARTS  AGE
    release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   0         1d
    Bash

After the operator is deployed, you can configure a K8sGPT resource. This Custom Resource Definition(CRD) will have the large language model (LLM) configuration that will aid in AI-powered analysis and troubleshooting of cluster issues. K8sGPT supports various backends to help in AI-powered analysis. For this post, we use Amazon Bedrock as the backend and Anthropic’s Claude V3 as the LLM.

  1. You need to create the pod identity for providing the EKS cluster access to other AWS services with Amazon Bedrock:

    eksctl create podidentityassociation  --cluster PetSite --namespace k8sgpt-operator-system --service-account-name k8sgpt  --role-name k8sgpt-app-eks-pod-identity-role --permission-policy-arns arn:aws:iam::aws:policy/AmazonBedrockFullAccess  --region $AWS_REGION

    Bash
  2. Configure the K8sGPT CRD:

    cat << EOF > k8sgpt.yaml
    apiVersion: core.k8sgpt.ai/v1alpha1
    kind: K8sGPT
    metadata:
      name: k8sgpt-bedrock
      namespace: k8sgpt-operator-system
    spec:
      ai:
        enabled: true
        model: anthropic.claude-v3
        backend: amazonbedrock
        region: us-east-1
        credentials:
          secretRef:
            name: k8sgpt-secret
            namespace: k8sgpt-operator-system
      noCache: false
      repository: ghcr.io/k8sgpt-ai/k8sgpt
      version: v0.3.48
    EOF
    
    kubectl apply -f k8sgpt.yaml
    
    Bash
  3. Validate the settings to confirm the k8sgpt-bedrock pod is running successfully:

    kubectl get pods -n k8sgpt-operator-system
    NAME                                                          READY   STATUS    RESTARTS      AGE
    k8sgpt-bedrock-5b655cbb9b-sn897                               1/1     Running   9 (22d ago)   22d
    release-k8sgpt-operator-controller-manager-5b749ffd7f-7sgnd   2/2     Running   3 (10h ago)   22d
    Bash
  4. Now you can configure the ArgoCD controller:

    helm repo add argo https://argoproj.github.io/argo-helm
    helm repo update
    kubectl create namespace argocd
    helm install argocd argo/argo-cd \
      --namespace argocd \
      --create-namespace
    Bash
  5. Verify the ArgoCD installation:

    kubectl get pods -n argocd
    NAME                                                READY   STATUS    RESTARTS   AGE
    argocd-application-controller-0                     1/1     Running   0          43d
    argocd-applicationset-controller-5c787df94f-7jpvp   1/1     Running   0          43d
    argocd-dex-server-55d5769f46-58dwx                  1/1     Running   0          43d
    argocd-notifications-controller-7ccbd7fb6-9pptz     1/1     Running   0          43d
    argocd-redis-587d59bbc-rndkp                        1/1     Running   0          43d
    argocd-repo-server-76f6c7686b-rhjkg                 1/1     Running   0          43d
    argocd-server-64fcc786c-bd2t8                       1/1     Running   0          43d
    Bash
  6. Patch the argocd service to have an external load balancer:

    kubectl patch svc argocd-server -n argocd -p '{"spec": {"type": "LoadBalancer"}}'
    Bash
  7. You can now access the ArgoCD UI with the following load balancer endpoint and the credentials for the admin user:

    kubectl get svc argocd-server -n argocd
    NAME            TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE
    argocd-server   LoadBalancer   10.100.168.229   a91a6fd4292ed420d92a1a5c748f43bc-653186012.us-east-1.elb.amazonaws.com   80:32334/TCP,443:32261/TCP   43d
    Bash
  8. Retrieve the credentials for the ArgoCD UI:

    export argocdpassword=`kubectl -n argocd get secret argocd-initial-admin-secret \
    -o jsonpath="{.data.password}" | base64 -d`
    
    echo ArgoCD admin password - $argocdpassword
    Bash
  9. Push the credentials to AWS Secrets Manager:

    aws secretsmanager create-secret \
    --name argocdcreds \
    --description "Credentials for argocd" \
    --secret-string "{\"USERNAME\":\"admin\",\"PASSWORD\":\"$argocdpassword\"}"
    Bash
  10. Configure a sample application in ArgoCD:

    cat << EOF > argocd-application.yaml
    apiVersion: argoproj.io/v1alpha1
    kind: Application
    metadata:
    name: helm-guestbook
    namespace: argocd
    spec:
    project: default
    source:
    repoURL: https://github.com/awsvikram/argocd-example-apps
    targetRevision: HEAD
    path: helm-guestbook
    destination:
    server: https://kubernetes.default.svc
    namespace: helm-guestbook
    syncPolicy:
    automated:
    prune: true
    selfHeal: true
    EOF
    Bash
  11. Apply the configuration and verify it from the ArgoCD UI by logging in as the admin user:

    kubectl apply -f argocd-application.yaml
    Bash

    ArgoCD Application

  12. It takes some time for K8sGPT to analyze the newly created pods. To make that immediate, restart the pods created in the k8sgpt-operator-system namespace. The pods can be restarted by entering the following command:

    kubectl -n k8sgpt-operator-system rollout restart deploy
    
    deployment.apps/k8sgpt-bedrock restarted
    deployment.apps/k8sgpt-operator-controller-manager restarted
    Bash

Set up the Amazon Bedrock agents for K8sGPT and ArgoCD

We use a CloudFormation stack to deploy the individual agents into the US East (N. Virginia) Region. When you deploy the CloudFormation template, you deploy several resources (costs will be incurred for the AWS resources used).

Use the following parameters for the CloudFormation template:

  • EnvironmentName: The name for the deployment (EKSBlogSetup)
  • ArgoCD_LoadBalancer_URL: Extracting the ArgoCD LoadBalancer URL:

    kubectl  get service argocd-server -n argocd -ojsonpath="{.status.loadBalancer.ingress[0].hostname}"
    Bash
  • AWSSecretName: The Secrets Manager secret name that was created to store ArgoCD credentials

The stack creates the following AWS Lambda functions:

  • <Stack name>-LambdaK8sGPTAgent-<auto-generated>
    <Stack name>-RestartRollBackApplicationArgoCD-<auto-generated>
    <Stack name>-ArgocdIncreaseMemory-<auto-generated>

The stack creates the following Amazon Bedrock agents:

  • ArgoCDAgent, with the following action groups:
    1. argocd-rollback
    2. argocd-restart
    3. argocd-memory-management
  • K8sGPTAgent, with the following action group:
    1. k8s-cluster-operations
  • CollaboratorAgent

The stack outputs the following, with the following agents associated to it:

  1. ArgoCDAgent
  2. K8sGPTAgent
  • LambdaK8sGPTAgentRole, AWS Identity and Access Management (IAM) role Amazon Resource Name (ARN) associated to the Lambda function handing interactions with the K8sGPT agent on the EKS cluster. This role ARN will be needed at a later stage of the configuration process.
  • K8sGPTAgentAliasId, ID of the K8sGPT Amazon Bedrock agent alias
  • ArgoCDAgentAliasId, ID of the ArgoCD Amazon Bedrock Agent alias
  • CollaboratorAgentAliasId, ID of the collaborator Amazon Bedrock agent alias

Assign appropriate permissions to enable K8sGPT Amazon Bedrock agent to access the EKS cluster

To enable the K8sGPT Amazon Bedrock agent to access the EKS cluster, you need to configure the appropriate IAM permissions using Amazon EKS access management APIs. This is a two-step process: first, you create an access entry for the Lambda function’s execution role (which you can find in the CloudFormation template output section), and then you associate the AmazonEKSViewPolicy to grant read-only access to the cluster. This configuration makes sure that the K8sGPT agent has the necessary permissions to monitor and analyze the EKS cluster resources while maintaining the principle of least privilege.

  1. Create an access entry for the Lambda function’s execution role

    export CFN_STACK_NAME=EKS-Troubleshooter
    	   export EKS_CLUSTER=PetSite
    
    export K8SGPT_LAMBDA_ROLE=`aws cloudformation describe-stacks --stack-name $CFN_STACK_NAME --query "Stacks[0].Outputs[?OutputKey=='LambdaK8sGPTAgentRole'].OutputValue" --output text`
    
    aws eks create-access-entry \
        --cluster-name $EKS_CLUSTER \
        --principal-arn $K8SGPT_LAMBDA_ROLE
    Bash
  2. Associate the EKS view policy with the access entry

    aws eks associate-access-policy \
        --cluster-name $EKS_CLUSTER \
        --principal-arn  $K8SGPT_LAMBDA_ROLE\
        --policy-arn arn:aws:eks::aws:cluster-access-policy/AmazonEKSClusterAdminPolicy \
        --access-scope type=cluster
    Bash
  3. Verify the Amazon Bedrock agents. The CloudFormation template adds all three required agents. To view the agents, on the Amazon Bedrock console, under Builder tools in the navigation pane, select Agents, as shown in the following screenshot.

Bedrock agents

Perform Amazon EKS troubleshooting using the Amazon Bedrock agentic workflow

Now, test the solution. We explore the following two scenarios:

  1. The agent coordinates with the K8sGPT agent to provide insights into the root cause of a pod failure
  2. The collaborator agent coordinates with the ArgoCD agent to provide a response

Agent coordinates with K8sGPT agent to provide insights into the root cause of a pod failure

In this section, we examine a down alert for a sample application called memory-demo. We’re interested in the root cause of the issue. We use the following prompt: “We got a down alert for the memory-demo app. Help us with the root cause of the issue.”

The agent not only stated the root cause, but went one step further to potentially fix the error, which in this case is increasing memory resources to the application.

K8sgpt agent finding

Collaborator agent coordinates with ArgoCD agent to provide a response

For this scenario, we continue from the previous prompt. We feel the application wasn’t provided enough memory, and it should be increased to permanently fix the issue. We can also tell the application is in an unhealthy state in the ArgoCD UI, as shown in the following screenshot.

ArgoUI

Let’s now proceed to increase the memory, as shown in the following screenshot.

Interacting with agent to increase memory

The agent interacted with the argocd_operations Amazon Bedrock agent and was able to successfully increase the memory. The same can be inferred in the ArgoCD UI.

ArgoUI showing memory increase

清理

如果您决定停止使用该解决方案,请完成以下步骤:

  1. 要删除使用AWS CloudFormation部署的关联资源,请执行以下操作:
    1. 在AWS CloudFormation控制台上,在导航窗格中选择Stacks。
    2. 找到您在部署过程中创建的堆栈(您为其分配了一个名称)。
    3. 选择堆栈并选择删除。
  2. 如果您专门为此实现创建了EKS集群,请删除它。

结论

通过编排多个Amazon Bedrock代理,我们演示了如何构建一个基于AI的Amazon EKS故障排除系统,简化Kubernetes操作。K8sGPT分析和ArgoCD部署自动化的集成展示了将专业AI代理与现有DevOps工具相结合的强大可能性。尽管该解决方案代表了Kubernetes自动化操作的进步,但重要的是要记住,人工监督仍然很有价值,特别是对于复杂的场景和战略决策。

随着Amazon Bedrock及其代理能力的不断发展,我们可以期待更复杂的编排可能性。您可以扩展此解决方案,以纳入其他工具、指标和自动化工作流程,以满足您组织的特定需求。

要了解有关Amazon Bedrock的更多信息,请参阅以下资源:

本文地址
最后修改
星期二, 九月 23, 2025 - 16:22
Tags
 
Article