Employees complain that "cleaning up after AI" is even more tiring? Unveiling the "paradox" and real turning point of AI efficiency improvement in enterprises
In today's era when the digital wave is sweeping across the globe, AI is integrating into the development of enterprises with unprecedented depth and breadth, becoming the core engine driving innovation and growth. So, how can we discover new business and growth opportunities through AI technology? And how can we use AI to achieve more efficient user acquisition, retention, and conversion?
Recently, the live - streaming program "Geek Gathering" of InfoQ in collaboration with AICon specially invited Wang Yunfeng, CTO of Smzdm.com as the host. Together with Liang Xiaowu, a senior technical expert at Alibaba, and Zou Panxiang, General Manager of the AI R & D Department at CaiXun Co., Ltd., on the eve of the AICon Global Artificial Intelligence Development and Application Conference 2025 Beijing Station, they jointly discussed the practical review of AI efficiency improvement in enterprises.
Some of the wonderful viewpoints are as follows:
- Without hallucination, there is no creativity. We must accept this kind of deviation and, at the same time, use engineering means to control it within a reasonable range to get the results we need.
- In all types of Agents, data governance is always a key pre - requisite step in large - model engineering.
- The person in charge of a project must have the ability to package the underlying AI capabilities into product features that users can perceive, understand, and use. They should also be able to communicate with the business department, be familiar with customer scenarios and business processes, and put forward AI requirements from them.
The following content is based on the live - streaming transcript and has been abridged by InfoQ.
When a "Pupil" Meets a "Ph.D. Student"
Wang Yunfeng: Let's start by talking about "model usage". Wang Xiaochuan of Baichuan once made an analogy, saying that the current top - tier models (like GPT - 4, Gemini 3) have reached the "Ph.D. student" level of intelligence. But in my personal experience, although the model is like a Ph.D. student, the engineering environment we build for it and even the Prompts we write for it may still be at the "pupil" level. This "mismatch of capabilities" often makes us feel that AI doesn't follow our instructions. In this year's practice, how did you actually make good use of the capabilities of this "Ph.D. student" in engineering? Are there any common experiences?
Liang Xiaowu: Based on my long - term background in GUI automation and my current attempt to use AI to upgrade GUI capabilities, I've summarized three aspects of experiences.
First, we need to select the base model according to specific scenarios. GUI operations are essentially more similar to RPA, and their visual grounding and reasoning methods are significantly different from voice or text tasks. Therefore, we've conducted a lot of exploration in the selection and combination of base models, trying various domestic and foreign models. Finally, we found that Qianwen 3 performs outstandingly in our GUI scenarios in terms of reasoning effects.
Second is the design of the Agent architecture. The architecture of an AI Agent is different from the traditional microservices engineering system. It needs to gradually converge from uncertainty rather than follow a fixed process. Therefore, the core of architecture design is how to use engineering means to keep the output of the AI model within a controllable range, enabling this "Ph.D. student" to have a controllable and effective interaction with our system.
In the GUI Agent, we introduced the role of a "referee". We added a referee's judgment at each step of the operation, which is a key mechanism outside the model.
Third is context engineering, which was previously known as Prompt engineering. Since we don't develop base models and at most only develop some vertical small models, such as small analysis models or local image recognition models, context engineering has become the core of realizing AI engineering capabilities.
Wang Yunfeng: The company held a hackathon some time ago. Many colleagues reported that if they haven't built an Agent from start to finish, they won't really realize the core significance of context engineering. We can't stuff all knowledge into the model at once, but tasks often require a large amount of information. If context engineering is not done well, the model can only play a small part of its capabilities.
Teacher Zou, the requirements of B - end customers are often very rigid. How do you use engineering means (such as Chain of Thought CoT) to make the model both show the intelligence of a "Ph.D." and follow the "pupil's" discipline?
Zou Panxiang: In B - end scenarios, there are significant differences between C - end and B - end AI applications. Let's first talk about the often - mentioned "hallucination" problem of large models. Many people think that hallucination is negative, but in fact, it is precisely because of hallucination that the model has generative capabilities. If the model has no hallucination at all, it can only memorize knowledge and cannot generate new content.
Therefore, we will infer whether hallucination is needed based on specific scenarios. It's impossible to completely eliminate hallucination, but we can decide when to reduce it. In creative scenarios, such as video marketing targeting different user groups, hallucination helps generate more imaginative and diverse content. But in B - end business scenarios, we usually need to reduce hallucination as much as possible.
Reducing hallucination requires clear methods: when to reduce it, how to reduce it, and how to maintain the "Ph.D." level of the model while reducing hallucination. Context engineering is crucial in reducing hallucination. We need to inject expert experience, the execution results of tool APIs, plug - in results, reasoning chains, etc. into the model to reduce deviations. However, context engineering alone is not enough because the hallucination problem may originate from multiple links, such as knowledge retrieval, reasoning planning, tool invocation, or intention recognition. In the B - end, we can't accept a black - box process. Therefore, we proposed a method that is observable, controllable throughout the whole process.
Observability can be divided into three stages. The first is intention understanding. Since the clarity of users' questions varies, we need an intention clarification process to ensure that we obtain real requirements. After clarification, we enter the task planning stage. The thinking process of each step, including the knowledge used and the tools invoked, needs to be printed out so that users can see the reasoning path of the model. For example, in itinerary planning, the model needs to repeatedly ask for information such as time, number of people, transportation methods, and accommodation preferences to ensure that the plan meets the requirements.
Before going live, we make the model execute as expected as much as possible by making the process observable, printing out each step of the reasoning process, and injecting context. After going live, the biggest difference between AI applications and traditional IT applications is the unclear boundary. The output of an IT system is predictable, but AI can only make "estimates". We can improve the performance from 60 points to 80 points when going live, but we can never reach 100 points. After going live, we need to continuously improve from 80 points to 90, 95, or even 99 points through iteration, but it's still impossible to be perfect.
Therefore, we must use engineering means to inject content in advance, intervene manually, or make supplements to handle the parts that the model can't cover.
The AI model is only a part of the whole system, not the whole. Therefore, we will add many supporting modules and management tools. In some scenarios, we may even reduce the model size to lower its generalization ability and improve controllability. Sometimes we let the model generate a plan first, then have it manually verified, and finally convert the plan into a controllable path - search process. We've encountered many pitfalls when doing Agents in the industry, so we summarized that AI implementation should meet the requirements of observability, iterability, controllability, credibility, and integrability. These requirements also drive the delivery team and the R & D team to develop corresponding tools and engineering capabilities to support the overall implementation.
Wang Yunfeng: Whether it's the C - end or the B - end, everyone's final understanding actually tends to be the same: the "intelligence" of large models is already very high, but a high - intelligence brain alone can't solve problems. It still needs a lot of knowledge and engineering capabilities, and much of this knowledge is not in the model itself. Therefore, we can't expect the model to give the final answer at once. Instead, we need to use engineering means to ensure that its output is controllable.
The creativity and emergent capabilities of the model come from diversity, which is called "hallucination" today. Without hallucination, there is no creativity. We must accept this kind of deviation and, at the same time, use engineering means to control it within a reasonable range to get the results we need.
The Lack of "Context" in Data
Wang Yunfeng: After talking about models, let's talk about "data", which is the fuel for AI. At Smzdm.com, to enable AI to make good consumption decisions, we must show it a large amount of user behavior and community content. This is also the core motivation for me to promote MCP (Model Context Protocol): I hope to solve the "context" connection problem between AI and enterprise private data. But to be honest, this process is more difficult than expected. In your practice, what is the biggest obstacle to making AI "understand" the internal business logic of an enterprise? Teacher Zou, when delivering services to B - end customers, do you have to spend a lot of time helping customers "clean data"? If the customer's data is extremely poor, should we forcefully implement AI or advise the customer to redo digitalization?
Zou Panxiang: In the actual implementation process, many customers will ask questions: Since the model is already so powerful, why do we still need data? What types of data are needed? And how should we carry out data governance? To answer these questions, we need to start from "why we do data governance".
First, the model often can't understand the enterprise's own business scenarios, processes, and terms in vertical fields. For example, in the operator scenario, a "package" refers to a phone bill package. But if we directly ask the model to "help me order a package", the model may understand it as a McDonald's or KFC package. Therefore, in specific business scenarios, we must transfer the enterprise's private knowledge to the model so that it can understand the real meaning of specific terms.
In addition, data governance is also related to the visualization and transfer of expert experience. Expert experience is usually reflected in problem - analysis methods and processing processes, while the model relies on general knowledge. Without the input of professional knowledge, the model can't solve many scenario - based problems.
For example, in the 10086 customer service scenario of an operator, users may consult about packages, number portability, etc. Number portability has strict processes and condition requirements and can't be handled casually. In the past, online customer service often couldn't solve the problem for a long time, while it could be completed offline in a few minutes because it involves information such as business system calls and condition checks. If this content is not provided to the model, the model can't really understand the business.
We need to clarify which data needs to be governed, which can be divided into two major categories: one is knowledge - based data, including expert experience, document materials, scheme content, and structured analysis data. This type of data needs to be visually precipitated, usually in the form of PPTs, Word documents. Knowledge - based data can be combined with the model in two ways: one is to incorporate the data into the knowledge base; the other is to use it for model training. If it is used for training, we need to clarify the model type (such as multimodal or language model) and the training stage (such as SFT, reinforcement fine - tuning, or alignment training). The corresponding data format, data volume, and processing tools are also different, and we need to carry out cleaning, deduplication, annotation, desensitization, etc.
If the knowledge enters the knowledge base, we need to consider the data source, type, update mechanism, conflict handling, and timeliness management. The focus of governance is concentrated on the parsing, storage, indexing, and recall stages to ensure that the knowledge is consistent, effective, and accurate when retrieved.
The other type of data is data in the production process, including API call records, system logs, task execution links, etc. This type of data is sometimes used as material for model reinforcement learning, but more often, it is provided to the model as context during real - time reasoning. In this process, strict constraints must be set, and not all data can be directly exposed to the model. Especially in a multi - Agent environment, data may be wrongly cached by the model and passed between different agents, causing a risk of cross - permission data leakage.
For example, if the finance Agent obtains the enterprise's financial data and the model caches it, and the recruitment Agent accesses this cached data without permission, it will cause a serious security risk. Therefore, in production data governance, we need to focus on issues such as the account system, permission control, privacy protection, data desensitization, and prevention of external prompt poisoning.
After completing these governance measures, we need to evaluate the model's effectiveness. The evaluation can be divided into two categories: technical indicators and business indicators. Technical evaluation includes accuracy, recall rate, answer consistency, and relevance; business evaluation focuses on key indicators such as user growth rate, customer service difficulty rate, sales conversion rate, and user activity. The observation and analysis of this business data are the basis for the continuous iteration and optimization of the Agent.
The core driving force for the continuous iteration of the Agent is ROI. Therefore, we must rely on the data after going live to evaluate the benefits and locate whether the problem is a technical or business one. This type of data governance belongs to the continuous operation work after going live, not the pre - governance in the AI - ready stage.
Wang Yunfeng: Teacher Liang, the GUI Agent of Fliggy directly looks at the screen, which seems to bypass the governance problem of the "data interface". Is this a kind of "skipping class"? Is the depth of data understanding sufficient in this way?
Liang Xiaowu: In the GUI Agent, in addition to knowledge - based data, there is also a significant particularity: graphic data. If the GUI image data is inaccurate, the model won't be able to find the position of interface elements. Therefore, the quality of image data is crucial to the effect of the GUI Agent.
For this reason, we've done a lot of work. For example, when providing GUI data, we must design a format that the large model can understand, which is also part of context engineering. GUI operations involve an "action space", including events such as clicks and drag - and - drops, while there is no concept of an action space in ordinary non - GUI scenarios. Therefore, we need to clearly define these actions for the model so that it can correctly predict the next action.
In addition, in practical applications, the accuracy of the model can never reach 100%, especially in enterprises with a large number of non - standard, customized UIs. The model can easily recognize standard buttons, but it's often difficult to recognize enterprise - customized buttons or special components. Therefore, we must help the model understand through data injection and example teaching. For example, when dealing with some up - and - down sliding components, we need to clearly tell the model that this is a sliding structure and provide examples for it to learn, similar to the training logic of other AI models.
At the same time, in the GUI Agent, we also use technologies such as CAG (Cached LAG) for high - frequency operation scenarios to specially process hot graphic data to improve the stability of recognition and operation. Generally speaking, the accuracy of image data directly determines the execution effect of the GUI Agent. If the model can't recognize graphics or correct errors, it can't achieve high - precision operations. In all types of Agents, data governance is always a key pre - requisite step in large - model engineering.
Is it "Real Efficiency Improvement" or "More Tired"?
Wang Yunfeng: Bosses are all shouting about efficiency improvement, but sometimes front - line employees report that they are "more tired": in the past, they wrote code or copywriting themselves. Now they have to write Prompts for AI, review the results after writing, and take the blame if there are mistakes. After your projects went live, did you encounter this situation of "getting more tired with use"? Where do you think the real "efficiency improvement" inflection point is?
Liang Xiaowu: Whether employees feel less tired or not mainly depends on accuracy. When building an AI Agent, we always face the problem of accuracy. Different from traditional deterministic process - based engineering, AI is a process of continuously converging uncertainty. If the accuracy is low, employees will lack confidence and feel that using AI is tiring. Take our GUI Agent as an example. In April this year, the accuracy was only about 40%, and it was difficult for employees to rely on it.
By September, after we increased the accuracy to 90% - 95%, the team's confidence in AI increased significantly, and they really saw the value of AI in efficiency improvement. Our C - end business is now completely taken over by AI, and the efficiency improvement is very obvious.
The second experience is the impact of the engineering system and tool system on efficiency improvement. When building AI engineering, we need to support infrastructure such as debugging tools, incubation tools, scaffolding, and Prompt template libraries so that employees can directly use them without spending time on technical details unrelated to AI itself, similar to the tool construction of the past process - based architecture.
If the infrastructure is complete, the overall development efficiency will be significantly improved. At this time, employees only need to focus on the core parts of AI, such as context engineering, reasoning ability, image recognition quality, or language quality. Once there is both high enough accuracy and a perfect engineering tool system, employees