When the three leading internet companies clash, has the observable boundary in the AI era emerged?
The reasoning ability of large language models (LLMs) and the data understanding ability of generative AI have provided new ideas for the evolution of observability technology. On the other hand, observability technology is also feeding back into the field of AI. So, how do AI and observability technology empower each other? What is the action path for AIOps to move from experimentation to production and from slogans to implementation?
Recently, the live - streaming program "Geek Gathering" of InfoQ in collaboration with AICon specially invited Zhang Cheng, the person - in - charge of the observability technology architecture and a senior technical expert at Alibaba Cloud as the host. Together with Dr. Li Ye, an algorithm expert at Alibaba Cloud, Dr. Dong Shandong, the person - in - charge of the algorithm for the Dev - Infra observability platform at ByteDance, and Wang Yapu, the person - in - charge of the observability team at Xiaohongshu, they jointly explored the new boundaries of observability in the AI era on the occasion of the upcoming QCon Global Software Development Conference 2025 Shanghai Station.
Some of the wonderful viewpoints are as follows:
- Traditional observability mainly focuses on "seeing", while the new - generation operation and maintenance paradigm in the future is expected to achieve a complete closed - loop of "discovery - analysis - solution - review". In this process, the observability system is evolving from a mere "eye" to a role with both a "brain" and "hands".
- Only when we have evaluation criteria close to real - world scenarios and verify the model's performance in a large number of real cases, and confirm that it can honestly say "I don't know" when it should, without fabricating or hallucinating, can we truly establish a trust mechanism for AI.
- The law of "garbage in, garbage out" not only remains valid in the AI era but is significantly amplified due to the high dependence of LLMs on data scale and quality.
- It is feasible to achieve "semi - autonomous" operation and maintenance within three to five years, and even closed - loop automation can be achieved in some scenarios. However, there is still a long way to go to achieve full autonomy and the so - called "coffee - style operation and maintenance".
The following content is based on the live - streaming transcript and has been abridged by InfoQ.
The full live - streaming replay can be viewed at: https://www.infoq.cn/video/YOTeVHta0A3Xqq2l4Bbp
Zhang Cheng: In your opinion, what fundamental and unprecedented changes is AI bringing to the matter of observability itself?
Li Ye: Firstly, "AI for observability". In the past, we needed to manually write SQL to extract and analyze data. Now, as long as we provide the large model with clear context and data format, it can excellently generate SQL, configure dashboards, and set up scheduled tasks automatically. Our internal evaluation shows that when the context is sufficient, the accuracy of the large model in such tasks can reach 80% - 90%, even surpassing engineers who are not familiar with SQL. This means that the way of data extraction has been completely changed.
In more complex exploratory and correlational analyses, AI can also provide assistance. For example, when a screenshot of a complex system scenario is given to the large model, its analysis results are sometimes better than those of novice engineers. Although it cannot replace experts in root - cause analysis at present, it can significantly improve the work efficiency of all engineers. The focus is shifting from "for human viewing" to "for AI viewing". In the future, the key will not only be beautiful visualization but how to organize data in a structured way so that it can be efficiently understood and utilized by large models.
Secondly, "observability for AI". The emergence of AI systems has brought new observability requirements. Each call of a large model incurs costs, so all the generated trace data will be retained, which significantly increases the storage demand. At the same time, the analysis and diagnosis of AI systems are also more complex. When a problem occurs in the execution of a large model in a workflow or an agent, we need to be able to diagnose the cause and evaluate its performance. For example, did it retrieve the correct document in the RAG stage? At which stage did the hallucination occur? These all pose higher requirements for the new - generation observability system. Another example is achieving efficient observability and fault self - healing in large - scale GPU clusters, which has also become a new challenge.
Dong Shandong: LLMs provide a general "brain base" for the observability field, significantly changing the implementation method of traditional AIOps.
In the past, implementing AIOps algorithms required starting from scratch: combining scenario goals, collecting and cleaning data, and then modeling, training, and tuning. The introduction of LLMs provides us with a natural basic ability of "sixty or seventy points", enabling us to build available demonstration prototypes more quickly and better in specific observation scenarios. As many experts have said, LLMs are like equipping all industries with college students with general abilities, and the subsequent in - depth optimization in this field still needs to be completed by the industry itself.
LLMs perform excellently in multi - modal understanding and fusion, and their effect improvement and feedback mechanisms are also more efficient. One of the keys lies in the application of multi - modal context: our task focus has shifted to how to provide more comprehensive and high - quality context information for LLMs, and the most difficult part of multi - source information fusion and understanding is borne by LLMs. Taking anomaly detection as an example, traditional methods are mostly limited to single indicators, while LLMs can comprehensively integrate various types of data such as indicators, logs, and traces to achieve more comprehensive anomaly judgment. Better context will surely bring better detection results.
In addition, in traditional methods, incorporating manual feedback usually requires retraining the model, while LLMs, with their powerful text - understanding ability, can quickly and conveniently apply manual feedback to the next detection task.
Compared with traditional AIOps, which often optimizes for single - point scenarios, the introduction of LLMs makes it possible to optimize the entire process from the full life cycle of alarms, including problem discovery, analysis, processing, review, prevention, and even system self - healing. We can effectively connect the entire process through the Agent architecture on the basis of the existing observation data platform and various small models: LLMs and domain knowledge together form the decision - making "brain", and the observation data and small models serve as the "tool hands", enabling the Agent to process alarms one by one and work in collaboration with humans. In the future, it may even assume the responsibilities of SRE like a digital life.
Wang Yapu: When problems occur during the AI training process, the system often gets "stuck" as a whole, which significantly increases the system's stability and complexity. In the past, observability mainly relied on rule - based and threshold - based alarms for known problems; the introduction of AI enables the system to have certain semantic understanding and reasoning abilities and conduct interpretable and verifiable analyses of unknown problems. In the past, it might take us several hours to manually troubleshoot performance degradation. With the help of AI, we can automatically analyze the relationships between indicators, links, and changes, shift from passive response to active observability, and even further achieve the ability of reasoning and insight.
In the past, operation and maintenance or R & D personnel needed to master complex query languages and understand various concepts of the monitoring platform. Now, AI makes observability conversational. Engineers only need to input natural - language requests such as "Help me check the log success rate", and the large model can complete the analysis. Historically, observability platforms were often support systems and difficult to meet the customized needs of various business lines. However, with AI, self - service and personalized orchestration have become possible. The observability platform can focus on underlying capabilities and abstract outputs, and business teams can freely combine tools to achieve a "customized" operation and maintenance experience.
The third aspect is the formation of a closed - loop of intelligent decision - making. Traditional observability mainly focuses on "seeing", while the new - generation operation and maintenance paradigm in the future is expected to achieve a complete closed - loop of "discovery - analysis - solution - review". In this process, the observability system is evolving from a mere "eye" to a role with both a "brain" and "hands".
Zhang Cheng: How should we measure the "intelligence" of an AI Agent? Is the score in the laboratory evaluation set more important, or is its "actual combat ability" to solve real - world problems in complex online environments more important?
Dong Shandong: When measuring the intelligence of an AI Agent, we should consider two aspects: general ability and specific ability.
For general abilities, current benchmarks for LLMs such as MMLU and MATH, as well as evaluations for Agent abilities like AgentBench and SWE - bench, all have good reference value, measuring various aspects of LLMs' general understanding, reasoning, and planning abilities.
For specific abilities, we need to consider its actual combat ability to solve real - world problems. This is particularly evident in the observability field. Although the AIOps community has built some demos and corresponding data sets around observation and troubleshooting, in each company, there are many relatively more complex problems with non - standardized requirements, which place higher demands on the actual combat ability of AI Agents.
Taking the RCA scenario in the observability field as an example, I would like to provide a rough classification of AI Agents for reference:
L1 +: Single - point enhancement: In a specific problem, the analysis process remains the same as before, but the AI Agent can assist in enhancing the analysis of some links.
L2: Autonomous problem - solving. RCA is fully Agent - enabled. When a problem occurs with a customized indicator, the AI Agent can plan and execute according to the preset SOP and actual situation until the problem is solved.
L3: Learning. On the basis of a monitoring goal and task set by humans, it can independently read the team's documents and materials, extract and learn knowledge. When a user asks about a general troubleshooting process, it can also evaluate whether it can execute correctly according to the process. If some tools are missing, it can generate and supplement these tools according to a certain protocol and format, and finally execute the troubleshooting process correctly and output the results.
Li Ye: Actual combat ability is more important, and laboratory evaluations should be as close to real - world scenarios as possible. Currently, there is a phenomenon of "gaming the rankings" in some large - model leaderboards. Take SWE Bench Verified as an example, which only contains about 500 questions. If an algorithm engineer fixes one error case every day for a year, they can almost "memorize" the entire data set and obtain a high score through artificial over - fitting. This results in laboratory scores often failing to truly reflect the model's actual combat level.
Similar problems exist in other fields. For example, in the microservice scenario, laboratory benchmark tests usually involve only a dozen services, while a real - world production system may have hundreds of services, and each service contains a large number of operations, with a completely different level of complexity. The types of faults injected through chaos engineering in the laboratory are relatively limited, while real - world faults are diverse. If we only use known problems for verification, the algorithm may not perform better than a rule - based system, and it cannot reflect the generalization ability of large models in unknown scenarios.
Evaluating actual combat ability requires a reasonable division of task difficulty. We can't "ask a first - grade primary school student to answer college entrance examination questions". Similarly, if we let the current large models directly handle complex L3 - level tasks, they may all fail, but this doesn't mean that AI is useless; it just means that it is not suitable for such high - level scenarios at present. On the contrary, in tasks with high certainty, such as converting natural language into SQL or PromQL, large models have shown reliability. This kind of evaluation close to actual combat can truly enhance our confidence in the implementation of AI.
Zhang Cheng: Does the emergence of large models mean that the very sophisticated traditional algorithms we relied on in the past have reached a ceiling? What "qualitative" differences does it bring when processing observability data?
Wang Yapu: Traditional algorithms have not reached the ceiling, and their greatest advantage lies in certainty. In many scenarios, traditional algorithms are still irreplaceable. Taking time - series anomaly detection as an example, current production systems of various companies are still using relevant algorithms on a large scale. They have the characteristics of fast response, low resource consumption, strong controllability, and good stability. For some mature small - model algorithms, as long as the scenario is clear, their accuracy can be very high, and the latency can even be controlled within milliseconds, which is an advantage that current large models can hardly match.
However, the emergence of large models has brought qualitative changes, mainly reflected in learning and efficiency - improvement abilities. Traditional algorithms are very efficient in processing single data sources but are overwhelmed by complex multi - modal and cross - domain problems. Large models can understand multiple types of information simultaneously, including indicator curves, log texts, user feedback, and code changes, and establish associations between them. This "comprehensive understanding" ability is difficult to achieve with traditional algorithms.
The second advantage is programmability and interpretability. Traditional algorithms often require data collection, manual annotation, and parameter tuning and training, which involve a huge workload. Large models can automatically assemble the fault - diagnosis process through reasoning chains and tool calls. For example, it can decide which business line to check first, then analyze the changes within 24 hours, and finally whether to conduct further in - depth analysis according to the logical order. This automated reasoning significantly shortens the positioning time and greatly improves human efficiency.
The third advantage is generalization ability. Although traditional algorithms perform well in specific scenarios, once they are migrated to a new environment, they need to be retrained and tuned, with high costs and poor stability. Large models have good transferability and adaptability and can quickly respond to new application scenarios. This generalization ability is another qualitative change brought by large models.
Zhang Cheng: In the future technology stack of the observability platform, what will be the relationship between large models and traditional algorithms? Will it be "replacement", "complementation", or some new "collaborative" model?
Wang Yapu: The relationship between traditional algorithms and large models is not one of opposition but of division of labor, cooperation, and complementary advantages, just like the different cognitive systems in the human brain. One system is a fast and automated reaction system. For example, when driving, seeing a red light and immediately braking, or instinctively becoming alert when hearing an alarm, this kind of reaction does not require in - depth thinking and is highly efficient; the other system requires slow thinking, knowledge integration, and in - depth analysis, such as diagnosing complex problems or making key decisions. The two do not conflict but work in collaboration.
From this perspective, traditional algorithms are more like the former, trained for specific scenarios and able to make fast, accurate, and stable reactions within the known scope, similar to "muscle memory"; large models are more like the latter, with extensive knowledge reserves and complex reasoning abilities, capable of handling cross - domain and complex information problems, but with a slower response speed, higher resource consumption, and sometimes even "over - thinking".
In the past, traditional algorithms were the only choice; now, large models have become the new protagonist, while traditional algorithms have become the supporting role, but their value has not decreased. On the contrary, they have found a more suitable position. We should not view the two as an "either - or" choice but should let them complement each other through a collaboration mechanism to achieve "1 + 1 > 2".
Li Ye: To discuss the "replacement relationship" question, I'd like to use the "elimination method". First, we can rule out the possibility of large models replacing traditional algorithms. Traditional algorithms and CPU operators can already handle about 80% - 90% of online scenarios well. Taking Alibaba Cloud's practice as an example, rule - based methods can intercept or self - heal more than 60% - 70% of system anomalies. These methods are efficient, consume less resources, and are highly interpretable. Since "there's no need to use a sledgehammer to crack a nut", there's no need for us to use large models with huge computational overhead to handle these problems.
In addition, even if we don't consider efficiency and cost, from a technical perspective, large models are not suitable for directly processing raw observability data because the data volume is extremely large. For example, one minute of trace data may reach several gigabytes. If all this data is input into a large model, it will directly cause the context window to overflow; the volume of log data is even larger. Therefore, current natural - language large models cannot directly process such a large amount of raw data.
Even if we compress the data before inputting it into the model, since these models are mainly trained on natural - language texts, they lack prior knowledge of "machine - generated data". Observability data mainly consists of time - series numerical values and machine logs, while the corpora learned by large models are human languages, and there is a natural difference between the two. Therefore, large models need to undergo domain fine - tuning or reinforcement learning in the observability field to have practical value. For example, in the root - cause ranking (Root Cause Ranking) task, if an open - source general large model is directly used, the accuracy is often only 30% - 40%, or even lower; after specialized fine - tuning or reinforcement learning in the observability field, the accuracy can be increased to over 80% - 90%.
In conclusion, large models cannot replace traditional algorithms, and general large models are not "omnipotent" in the observability field. In specific vertical scenarios, we still need "fast and accurate" domain models. At the same time, we should not stick to the old rule system and should collaborate when necessary. In the past, we manually wrote a large number of rules, and now large models can help summarize rules, generate annotations, and automatically learn and extract rules through data