快速创新：宝马用于云事件分析的生成式人工智能解决方案 |

语言 Chinese, Simplified

SEO Title

Innovating at speed: BMW’s generative AI solution for cloud incident analysis

根本原因分析的挑战

数字服务通常是通过将多个软件组件链接在一起来实现的；可能由不同团队构建和运行的组件。例如，考虑远程打开和锁定车门的服务。可能有一个开发团队负责构建和运行iOS应用程序，另一个团队负责Android应用程序，一个团队构建和运行用于iOS和Android应用程序的前端后端，等等。此外，这些团队可能在地理上分散，在不同的位置和地区运行工作负载；许多托管在AWS上，一些托管在其他地方。

现在考虑一个（虚构的）场景，车主抱怨用应用程序远程锁门不再有效。是iOS应用程序负责中断，还是后端负责前端？防火墙规则是否在某个地方发生了变化？内部TLS证书是否过期？MQTT系统是否出现延迟？在最近的API更改中，是否存在意外的突破性更改？他们什么时候真正部署的？还是中央订阅服务的数据库密码再次轮换？

在这种情况下，很难确定问题的根本原因。它需要检查许多系统和团队，其中许多可能会失败，因为它们是相互依存的。开发人员需要对系统架构进行推理，形成假设，并遵循组件链，直到找到罪魁祸首。他们经常不得不回溯和重新评估他们的假设，并在另一个组件链中进行调查。

了解如此复杂系统中的挑战，突显了对根本原因分析采取稳健有效方法的必要性。考虑到这一背景，让我们探讨宝马和AWS如何合作开发一个使用亚马逊基岩代理的解决方案，以简化和增强RCA流程。

解决方案概述

在较高层次上，该解决方案使用Amazon Bedrock代理进行自动化RCA。这个代理有几个定制的工具可以用来完成它的工作。这些工具由AWS Lambda函数实现，使用Amazon CloudWatch和AWS CloudTrail等服务来分析系统日志和指标。下图说明了解决方案架构。

When an incident occurs, an on-call engineer gives a description of the issue at hand to the Amazon Bedrock agent. The agent will then start investigating for the root cause of the issue, using its tools to do tasks that the on-call engineer would otherwise do manually, such as searching through logs. Based on the clues it uncovers, the agent proposes several likely hypotheses to the on-call engineer. The engineer can then resolve the issue, or give pointers to the agent to direct the investigation further. In the following section, we take a closer look at the tools the agent uses.

Amazon Bedrock agent tools

The Amazon Bedrock agent’s effectiveness in performing RCA lies in its ability to seamlessly integrate with custom tools. These tools, designed as Lambda functions, use AWS services like CloudWatch and CloudTrail to automate tasks that are typically manual and time-intensive for engineers. By organizing its capabilities into specialized tools, the Amazon Bedrock agent makes sure that RCA is both efficient and precise.

Architecture Tool

The Architecture Tool uses C4 diagrams to provide a comprehensive view of the system’s architecture. These diagrams, enhanced through Structurizr, give the agent a hierarchical understanding of component relationships, dependencies, and workflows. This allows the agent to target the most relevant areas during its RCA process, effectively narrowing down potential causes of failure based on how different systems interact.

For instance, if an issue affects a specific service, the Architecture Tool can identify upstream or downstream dependencies and suggest hypotheses focused on those systems. This accelerates diagnostics by enabling the agent to reason contextually about the architecture instead of blindly searching through logs or metrics.

Logs Tool

The Logs Tool uses CloudWatch Logs Insights to analyze log data in real time. By searching for patterns, errors, or anomalies, as well as comparing the trend to the previous period, it helps the agent pinpoint issues related to specific events, such as failed authentications or system crashes.

For example, in a scenario involving database access failures, the Logs Tool might identify a new spike in the number of error messages such as “FATAL: password authentication failed” compared to the previous hour. This insight allows the agent to quickly associate the failure with potential root causes, such as an improperly rotated database password.

Metrics Tool

The Metrics Tool provides the agent with real-time insights into the system’s health by monitoring key metrics through CloudWatch. This tool identifies statistical anomalies in critical performance indicators such as latency, error rates, resource utilization, or unusual spikes in usage patterns, which can often signal potential issues or deviations from normal behavior.

For instance, in a Kubernetes memory overload scenario, the Metrics Tool might detect a sharp increase in memory consumption or unusual resource allocation prior to the failure. By surfacing CloudWatch metric alarms for such anomalies, the tool enables the agent to prioritize hypotheses related to resource mismanagement, misconfigured thresholds, or unexpected system load, guiding the investigation more effectively toward resolving the issue.

Infrastructure Tool

The Infrastructure Tool uses CloudTrail data to analyze critical control-plane events, such as configuration changes, security group updates, or API calls. This tool is particularly effective in identifying misconfigurations or breaking changes that might trigger cascading failures.

Consider a case where a security group ingress rule is inadvertently removed, causing connectivity issues between services. The Infrastructure Tool can detect and correlate this event with the reported incident, providing the agent with actionable insights to guide its RCA process.

By combining these tools, the Amazon Bedrock agent mimics the step-by-step reasoning of an experienced engineer while executing tasks at machine speed. The modular nature of the tools allows for flexibility and customization, making sure that RCA is tailored to the unique needs of BMW’s complex, multi-regional cloud infrastructure.

In the next section, we discuss how these tools work together within the agent’s workflow.

Amazon Bedrock agents: The ReAct framework in action

At the heart of BMW’s rapid RCA lies the ReAct (Reasoning and Action) agent framework, an innovative approach that dynamically combines logical reasoning with task execution. By integrating ReAct with Amazon Bedrock, BMW gains a flexible solution for diagnosing and resolving complex cloud-based incidents. Unlike traditional methods, which rely on predefined workflows, ReAct agents use real-time inputs and iterative decision-making to adapt to the specific circumstances of an incident.

The ReAct agent in BMW’s RCA solution uses a structured yet adaptive workflow to diagnose and resolve issues. First, it interprets the textual description of an incident (for example, “Vehicle doors cannot be locked via the app”) to identify which parts of the system are most likely impacted. Guided by the ReAct framework’s iterative reasoning, the agent then gathers evidence by calling specialized tools, using data centrally aggregated in a cross-account observability setup. By continuously reevaluating the results of each tool invocation, the agent zeros in on potential causes—whether an expired certificate, a revoked firewall rule, or a spike in traffic—until it isolates the root cause. The following diagram illustrates this workflow.

The ReAct framework offers the following benefits:

Dynamic and adaptive – The ReAct agent tailors its approach to the specific incident, rather than a one-size-fits-all methodology. This adaptability is especially critical in BMW’s multi-regional, multi-service architecture.
Efficient tool utilization – By reasoning about which tools to invoke and when, the ReAct agent minimizes redundant queries, providing faster diagnostics without overloading AWS services like CloudWatch or CloudTrail.
Human-like reasoning – The ReAct agent mimics the logical thought process of a seasoned engineer, iteratively exploring hypotheses until it identifies the root cause. This capability bridges the gap between automation and human expertise.

By employing Amazon Bedrock ReAct agents, significantly lower diagnosis times are achieved. These agents not only enhance operational efficiency but also empower engineers to focus on strategic improvements rather than labor-intensive diagnostics.

Case study: Root cause analysis “Unlocking vehicles via the iOS app”

To illustrate the power of Amazon Bedrock agents in action, let us explore a possible real-world scenario involving the interplay between BMW’s connected fleet and the digital services running in the cloud backend.

We deliberately change the security group for the central networking account in a test environment. This has the effect that requests from the fleet are (correctly) blocked by the changed security group and do not reach the services hosted in the backend. Hence, a test user cannot lock or unlock her vehicle door remotely.

Incident details

BMW engineers received a report from a tester indicating the remote lock/unlock functionality on the mobile app does not work.

This report raised immediate questions: was the issue in the app itself, the backend-for-frontend service, or deeper within the system, such as in the MQTT connectivity or authentication mechanisms?

How the ReAct agent addresses the problem

The problem is described to the Amazon Bedrock ReAct agent: “Users of the iOS app cannot unlock car doors remotely.” The agent immediately begins its analysis:

The agent begins by understanding the overall system architecture, calling the Architecture Tool. The outputs of the architecture tool reveal that the iOS app, like the Android app, is connected to a backend-for-frontend API, and that the backend-for-frontend API itself is connected to several other internal APIs, such as the Remote Vehicle Management API. The Remote Vehicle Management API is responsible for sending commands to cars by using MQTT messaging.
The agent uses the other tools at its disposal in a targeted way: it scans the logs, metrics, and control plane activities of only those components that are involved in remotely unlocking car doors: iOS app remote logs, backend-for-frontend API logs, and so on. The agent finds several clues:
1. Anomalous logs that indicate connectivity issues (network timeouts).
2. A sharp decrease in the number of successful invocations of the Remote Vehicle Management API.
3. Control plane activities: several security groups in the central networking account hosted on the testing environment were changed.
Based on those findings, the agent infers and defines several hypotheses and presents these to the user, ordered by their likelihood. In this case, the first hypothesis is the actual root cause: a security group was inadvertently changed in the central networking account, which meant that network traffic between the backend-for-frontend and the Remote Vehicle Management API was now blocked. The agent correctly correlated logs (“fetch timeout error”), metrics (decrease in invocations) and control plane changes (security group ingress rule removed) to come to this conclusion.
If the on-call engineer wants further information, they can now ask follow-up questions to the agent, or instruct the agent to investigate elsewhere as well.

The entire process—from incident detection to resolution—took minutes, compared to the hours it could have taken with traditional RCA methods. The ReAct agent’s ability to dynamically reason, access cross-account observability data, and iterate on its hypotheses alleviated the need for tedious manual investigations.

结论

通过使用Amazon Bedrock ReAct代理，宝马展示了如何改进其根本原因分析方法，将复杂的手动流程转变为高效的自动化工作流程。ReAct框架中集成的工具大大缩小了潜在的推理空间，并实现了动态假设生成和有针对性的诊断，模仿了经验丰富的工程师在机器速度下的推理过程。这项创新缩短了识别和解决服务中断所需的时间，进一步提高了宝马互联服务的可靠性，改善了全球数百万客户的体验。

该解决方案已证明取得了可衡量的成功，该代理在85%的测试用例中识别了根本原因，并在其余的用例中提供了详细的见解，大大加快了工程师的调查速度。通过降低初级工程师的准入门槛，它使经验不足的团队成员能够有效地诊断问题，保持宝马运营的可靠性和可扩展性。

将生成式人工智能纳入RCA流程展示了人工智能在现代基于云的运营中的变革潜力。动态适应、情境推理和处理复杂的多区域基础设施的能力使亚马逊基岩代理成为旨在保持数字服务高可用性的组织的游戏规则改变者。

随着宝马继续扩大其互联车队和数字产品，采用像亚马逊Bedrock这样的生成式人工智能驱动解决方案将在保持卓越运营和为客户提供无缝体验方面发挥重要作用。以宝马为例，您的组织还可以从亚马逊基岩代理商那里受益，进行根本原因分析，以提高服务可靠性。

首先，探索Amazon Bedrock Agent以优化您的事件诊断，或使用CloudWatch Logs Insights来识别系统日志中的异常。如果你想亲自介绍如何创建自己的Amazon Bedrock代理，并附上代码示例和最佳实践，请查看以下GitHub仓库。这些工具正在为高效的RCA和卓越运营树立新的行业标准。

本文地址

https://architect.pub