Exploration and Practice of RCA Agent in Complex Business Scenarios
In today's era when AI coding tools are becoming increasingly mature, the ability to generate code is considered a nearly conquered territory. However, the overarching challenges in software engineering remain far from being resolved. This article is compiled from the sharing of Guo Yongliang, a senior server - side architect at Kuaishou, at the QCon Global Software Development Conference 2026 Beijing Station, titled "Exploration and Practice of RCA Agent in Complex Business Scenarios".
During the sharing, Guo Yongliang introduced in detail a business troubleshooting system based on large - scale models. He dissected four core challenges faced in business: how to make AI understand business, how to combat alarm noise, how to measure uncertainty, and how to suppress model hallucinations. He also presented the Agent architecture design, evaluation system, and ideas for continuous evolution built around these challenges.
The following is the transcript of the speech (edited by InfoQ without changing the original meaning).
Background and Pain Points: Why Do We Need RCA Agent?
Boris Cherny, the person in charge of Claude Code, once put forward a view in a podcast: Coding work has largely been conquered by AI. This judgment raises a deeper question - has software engineering really been solved?
Judging from two research reports, the answer is no. The 2025 DORA report statistics on the efficiency changes after the implementation of AI Coding show that the improvement in individual efficiency is quite significant, but the improvement in organizational efficiency is rather limited. A Microsoft internal survey also points in a similar direction. They collected about six thousand samples of work - day time allocation. Excluding meetings, communication, learning, and administrative affairs, development and troubleshooting still account for the largest proportion of R & D personnel's time. A natural inference is that if the dividends brought by AI Coding have become stable, then troubleshooting is the next productivity bottleneck to be conquered.
Another set of phenomena also confirms this judgment. OpenClaw released a major version reconstruction in March this year. After the version went live, a large number of users reported that the plugin was paralyzed or its functions failed. It's worth noting that most of OpenClaw's code was generated by AI Coding. What does this mean? As human control over code decreases in the AI era, AI - based troubleshooting may evolve from an option to a necessity. When people can no longer fully understand their own systems, there must be a diagnostic system driven by AI as an equivalent safeguard.
The entire technical system can be roughly divided into three layers: the infrastructure layer involves container, node, and network failures; the middleware layer covers abnormalities in Cache, DB, and MQ; and the business - layer failures we focus on are directly related to the decline of core indicators, storm alarms, and cross - system propagation. The business layer has three prominent characteristics: First, it is a direct reflection of user experience and revenue; second, business code iterates extremely fast and is highly volatile; third, it is impossible to predict the troubleshooting steps for business problems. For example, for a decline in video duration, the root cause could be a slow Redis query, the GC of the service itself, or a bug introduced by a downstream service. The uncertainty of the troubleshooting path is the biggest difficulty in business - layer troubleshooting.
Challenges in Business Scenario Implementation
In the actual implementation process, we face four core challenges that progress step by step. The first is how to make AI understand business. In a typical four - quadrant diagram, the factors that can cause fluctuations in business indicators include both internal and external, active and passive ones, with signals and noise highly mixed. The natural change in traffic caused by the start of primary and secondary schools and the abnormal decline caused by code defects are mixed together, forming a huge state space. The second challenge is to combat noise - in a system where the proportion of alarm noise may exceed 75%, how to prevent the Agent from exhausting its computing power on invalid signals. The third is how to measure the uncertainty of AI troubleshooting itself, that is, to establish a repeatable and quantifiable evaluation system. The fourth is to directly combat the hallucination problem of large - scale models in numerical calculation and trend recognition.
Challenge 1: How to Make AI Understand Business
For example, the main website suddenly encountered an increase in the number of user Feed - stream requests, exceeding the alarm threshold. As the core service directly bearing the Feed - stream, all the downstream availability rates of entrance service A showed normal. However, the downstream dependencies of service A are extremely large, spanning hundreds of services and multiple departments. In this case, there are two levels of problems for the on - duty engineer: First, is this indicator anomaly really a problem? Is it caused by an internal fault or purely by an external hot - spot? Second, even if it is decided to treat it as a problem, it is obviously unrealistic to pull in all downstream business colleagues for troubleshooting one by one.
In fact, the root cause lies in the decline of recommendation quality - the decline in the quality of the information stream causes users to repeatedly brush videos, leading to an abnormal increase in the number of requests. The fault propagation chain is very complex: Entrance service A calls downstream service B. Service B does not show any abnormalities because it has a fallback and downgrade logic internally. However, the downstream department service E that B depends on had a Core Dump. The reason for the Core Dump of E is that there was a missing interface field in another service F it requested, which is ultimately attributed to a configuration change in service F that introduced a previously un - traversed logical path.
There are several counter - intuitive aspects in this case. Generally, a decline in recommendation quality would lead to a decrease in the number of user requests, but this problem unexpectedly caused an increase in requests. The entire abnormal propagation chain was interrupted at two nodes: A calling B and E calling F. Everything seemed normal at the indicator level, and it was impossible to establish a connection based on Metrics. Cross - department collaboration further increased the difficulty - the main - site colleagues were unaware of the internal change events of the downstream departments. This problem ultimately consumed a large amount of manpower, and the troubleshooting group once had more than a hundred people.
Using the traditional three - pronged approach of monitoring - Trace, Metrics, and Log - there are at least two obvious breakpoints in this case. The first breakpoint occurs when A calls B. The request is normal, and Metrics cannot establish a connection. We have to rely on business experience, and the main - site colleagues have to manually confirm with the colleagues in department B. The second breakpoint is more hidden: the fault of E calling F is caused by a missing interface field, and the request is also normal. Since this logic has never been traversed before, it is very likely that no Log was recorded at all. The discovery of this breakpoint also depends on the manual communication of internal colleagues.
The conclusion is very clear: If we want the Agent to handle this, it must understand the business beyond technical indicators; otherwise, it will never be able to cross these two breakpoints.
How to achieve this? In addition to the conventional Trace, Metrics, Log, and change events, we introduced business - code GIT. Because code is the only real document, and all systems are built on code. The initial practice was very straightforward. We introduced a Coding Agent to analyze the code in real - time. At first, we used the Claude Agent SDK, and it took about thirty minutes to analyze a code library, which was obviously unacceptable in a troubleshooting scenario. After switching to the PI Coding Agent, the analysis time for a single - library task was reduced to about five minutes. However, even with this reduction, there is still an efficiency bottleneck in the actual scenario. A complete business - troubleshooting task usually involves multiple services on a chain, and in a Java system, there are also a large number of underlying dependencies of SDKs that need to be sorted out. Usually, three to five libraries need to be analyzed simultaneously, which takes a total of fifteen to twenty - five minutes. This time is still too long for fault response.
The root of the problem is that although code is the only real document, it is a thing with a very low level of abstraction. Low abstraction inevitably leads to low efficiency. When people troubleshoot, even the service maintainers can never remember every line of code. The business code is abstracted to a certain extent in people's minds. If we want AI to understand, we must reduce its cognitive cost.
Our approach is to establish a layer of code abstraction, which we call "business assets". For example, we annotate error codes with their business semantics, describe the meanings of Metrics in a business - oriented way, and establish the topological relationship between indicators. Taking the Feed - stream scenario as an example, a decrease in the availability rate of the downstream recommendation service may lead to a change in the fallback rate of the upstream service, ultimately causing a change in the Feed delivery volume. There are also some switch configurations that directly affect business logic, and we also establish a map of their impacts. The construction of these assets has two modes: Part of them are precipitated offline. The Coding Agent generates a description of the relationships in the core code offline and stores it in the knowledge base in the form of Markdown documents; the other part is generated on - demand during the troubleshooting process. After the Agent analyzes a certain task in real - time, it precipitates it as a Skill and incorporates it into the knowledge base. Through these two methods, the business assets are set in motion.
In summary, the essence of solving the challenge of "making AI understand business" is to eliminate the contextual gap between humans and AI. It is relatively easy for AI to obtain traditional monitoring data, but a large amount of other information is simultaneously running in the minds of R & D personnel - code logic, indicator relationships, business common sense, such as the increase in gift - giving requests that may be caused by a livestreamer going live, and external events. Similar to code logic, if we want AI to troubleshoot, we must provide all this information to it.
Challenge 2: How to Combat Noise
In actual implementation, alarm noise is an extremely exhausting problem. Statistically, most alarms in the system are useless, and the proportion of alarm noise may exceed 75%. Less than a quarter of the alarms really need attention.
The harm caused by alarm noise is real. We once conducted a retrospective analysis of a fault above the P2 level within our company. We found that about ten minutes after the fault occurred, a certain indicator had fluctuated and issued an alarm, but the on - duty personnel directly clicked to silence it. After that, the indicator deviated rapidly to a serious level within fifteen minutes, but no one noticed. The reason was that this alarm had been triggered more than fifteen times within seven days, and the on - duty personnel had developed alarm fatigue and simply silenced it without even looking.
However, if we let AI process all alarms in full, new problems will arise. According to internal experiments, the Token consumption for the Agent to complete a full - fledged reasoning in the ReAct cycle is approximately between six hundred thousand and over one million Tokens. The total number of alarm events on the Kuaishou main website per month is about twenty to thirty thousand. If AI is used to process all alarms, the monthly Token consumption will be close to ten billion, and the annual cost will reach millions of RMB. In addition to the cost issue, the number of interactions in the ReAct cycle is uncontrollable, and the delay cannot be guaranteed.
Our solution is divided into two layers. The first layer introduces a very lightweight alarm confidence - assessment Agent or Workflow. Its task is to extract the "portrait" of the alarm - including the periodic pattern of the alarm, the degree of deviation from the threshold after each trigger, the recovery time, the service distribution, and the clustering of the curve, and evaluate these as statistical data. Let's take an example to illustrate the value of deviation analysis: A certain availability alarm may break through four nines to 98% every day. One day, it suddenly drops to 60%, which obviously needs attention; if it still only reaches 98%, it may not need attention. The same applies to periodicity: If an alarm goes off every morning at dawn, its confidence is relatively low; if it suddenly starts to alarm in the afternoon one day, it is likely to be an obvious abnormal signal.
After screening out some noise through confidence assessment, the next step is to conduct troubleshooting reasoning on the remaining problems. However, even after the initial filtering, there is still a large amount of noise in the reasoning stage. The system is full of fluctuations in technical indicators. For example, a service happens to have a GC, but it is not enough to cause fluctuations in core business indicators. These will all cause misjudgments by the Agent. Another typical source of noise is change events: During the peak release period, there may be more than five hundred changes associated with a core service's link within an hour. Most of these changes will not cause failures, but when a failure really occurs, the Agent will inevitably pull in so many changes. How to judge whether they are related to the alarm? These are all potential misjudgment signals.
Our response method is to introduce an evidence pyramid similar to evidence - based medicine and establish an evidence - grading system. This idea comes from the scenario of seeing a doctor in a hospital - doctors receive a large number of cases every day, and a large proportion of patients may come for a visit just because of anxiety and have no real diseases. So doctors first need to filter out the noise. For patients with real diseases, they need to further investigate the cause of the disease. Medicine has a very rigorous science and mature best practices in this regard, and we can fully learn from it.
In this pyramid, the bottom layer is the original signal. The next layer is the background context - such as external trend hot - spots, static service dependencies, and engineers' experience. Above that is the single - point observation data, such as a single Metrics anomaly or a single - service indicator anomaly. When these single - point anomalies are associated through Trace or topology, they form multi - dimensional fusion evidence, such as associated changes on the link and matching historical fault patterns, which constitutes a more solid layer. The top layer is direct causal inference: There is a clear directed - graph topological relationship between indicators, or it has been confirmed at the source - code level, or the faulty service corresponds to a direct change within the time window. These are all considered direct causal inferences.
Challenge 3: How to Measure Uncertainty
Currently, there is a consensus in production - level AI systems: It is very easy to run a Good Case and present a few Demos, but when it comes to actual production environments, it is extremely difficult to eliminate Bad Cases. There are a large number of Corner Cases and Silent Errors. The CTO of an AI startup once mentioned in an article: When demonstrating a Demo, you only need to find the correct path, but in a production environment, ninety percent of the situations are bad. Why does this happen? Because the deterministic factors in traditional programs have become uncertain factors in AI - when the same problem is input for reasoning multiple times, different reasoning paths may be formed, and the conclusions may also be different. Moreover, in an extremely large business system, there are many influencing factors, and any change in a variable may cause a huge deviation in the result, similar to the butterfly effect.
We have a very real - life case. When we first developed the Agent, we wanted to recall the "single - point jitter" problem because a large number of RPC availability alarms may be caused by the single - point jitter of a downstream Pod, resulting in overall indicator fluctuations. The approach was relatively simple: we introduced a traditional anomaly - analysis algorithm and added a drill - down dimension, and provided this tool for the Agent to use. However, after adding this tool, although the single - point problem was successfully recalled, the overall accuracy of the cases deteriorated. The reason is that the single - point problem is an extremely high - frequency problem and has a high probability of occurring in a cluster of thousands of Pods. When the core business indicators fluctuate, there is often a single - point problem. The Agent found the single - point problem when troubleshooting the decline in access volume and also found it when troubleshooting the decline in search volume, thus wrongly establishing a causal relationship. The root