Zhipu GLM-5V-Turbo sparks a "misfire," and the war of domestic multimodal agents is on the verge of breaking out.
In the fierce competition among domestic large models, Zhipu's GLM series has always held a trump card of great commercial value: extremely strong coding ability.
As the main form of AI shifts from large language models to agents, the industry competition has entered the second half. Developers and development ecosystems are the groups with the strongest willingness to pay.
However, the expectations of industry giants for AI are obviously not limited to an "outsourced programmer." Only by becoming an all - around agent that can truly take over the system workflow can AI enter the lives of ordinary people.
Therefore, a powerful AI is far from enough just by typing on the keyboard. It must have "eyes" to examine web page layouts, understand posters and charts, and even comprehend various non - text complex information on the GUI.
A few days ago, DeepSeek's gray - scale test of the "image recognition mode" fired the first shot.
Now, Zhipu is also closely following and officially launching a new exploration in the multimodal field. In the technical report of the latest model GLM - 5V - Turbo, we can clearly see that this is Zhipu's new charge towards the native multimodal agent, and also a confession full of technological brute force, engineering compromises, and business considerations.
01
The Violent Aesthetics and Micro - operation Art of the Visual Foundation
The idea of adding visual capabilities to large language models has been frequently attempted in the past few years.
However, the resulting visual language models (VLM) are often just spliced products. The language model is the absolute "brain," and the visual module is just an external camera.
That is to say, the model simply cannot understand the logic contained in information such as images. Forcing two - dimensional visual signals into a one - dimensional token sequence will result in the model failing to understand images, ignoring key details, and even generating serious hallucinations. Naturally, it cannot be used as an agent.
Therefore, GLM - 5V - Turbo sets the tone at the beginning:
Multimodal perception must not just be an auxiliary interface. It must become a native core component for model reasoning, planning, tool invocation, and task execution.
Therefore, to achieve true "nativization," Zhipu has made three major changes to the underlying architecture this time:
1. Reconstruct the visual foundation: CogViT designed specifically for agents
Agents need to operate users' computers. Therefore, in the graphical user interface, the model not only needs to know what is in the picture but also pay attention to various easily overlooked details, even a button that may be only a few pixels in length and width.
For this purpose, Zhipu has self - developed a high - parameter - efficiency visual encoder, CogViT, and adopted two - stage pre - training:
In the first stage, which is feature reconstruction, among the two teacher models, SigLIP2 is responsible for enabling the model to recognize semantics, and DINOv3 is responsible for enabling the model to recognize textures. Finally, masked image modeling is used to enhance the expression of the model's visual features.
In the second stage, which is image - text alignment, the NaFlex scheme is introduced to handle dynamic resolution, and the global Batch Size is directly increased to 64K.
This design directly maximizes the new model's spatial perception and geometric understanding capabilities and lays the foundation for subsequent operations on web pages and mobile UIs.
2. Balance between engineering and algorithms: Multimodal Multi - Token Prediction (MMTP)
The introduction of multimodal capabilities is inevitably accompanied by an exponential increase in video memory and computing power consumption.
Developers who follow the AI field should know that Zhipu's computing power reserve has not been abundant in the past six months. The previously controversial price adjustment has indirectly confirmed that in the face of large - scale reasoning, computing power cost is a black hole.
Introducing Multi - Token Prediction (MTP) to improve reasoning efficiency is a common practice in the industry. However, when Zhipu introduced MTP, it made a textbook - level engineering decision:
Since it is not feasible to directly pass visual features containing a large amount of information to the MTP prediction head, a shared special token "<|image|>" is used as a placeholder for visual input.
This seemingly simple change actually best conforms to "engineering pragmatism." It significantly reduces the communication complexity in pipeline parallelism and directly avoids the headache - inducing problem of video memory explosion.
In addition, on the premise of ensuring the stable convergence of the model, this "ingenious idea" can also greatly reduce the computing power cost of training and reasoning.
3. Break the long - tail curse: Ultra - large - scale multimodal reinforcement learning system
Currently, the training idea of agents is essentially no different from that of large language models, still using reinforcement learning.
However, in the training process of agents, single - task reinforcement learning can easily cause the model to oscillate.
Zhipu's research team found that multi - task collaborative reinforcement learning can enable the model to see a more diverse strategy distribution and even lead to the transfer of cross - task thinking patterns.
Therefore, Zhipu conducted joint reinforcement learning on more than 30 task categories and achieved full - pipeline decoupling and asynchronous execution in the infrastructure. They not only advanced the visual segmentation step from the forward propagation stage to the data loading stage but also made extreme memory management for communication between GPUs.
02
The Paradigm Shift from API Distribution to Workflow Takeover
The underlying reconstruction of technology ultimately points to the leap in the commercialization logic.
The in - depth multimodal research ability demonstrated by GLM - 5V - Turbo is indicating two commercial changes in Zhipu's AI applications:
First, break the barriers of traditional text - based SaaS through in - depth multimodal research.
Most previous AI assistants can only read pure text content. Even if users are allowed to upload attachments such as pictures, videos, and PDFs, once the non - text information in them is too much, the AI's recognition ability will decline precipitously.
However, GLM - 5V - Turbo can autonomously and cyclically execute the workflow of "planning → multimodal reading → status update," directly parse high - value visual information in various charts, documents, and PPTs, and directly deliver Markdown business reports and highly structured slides.
In this regard, Zhipu's approach is almost the same as Anthropic, which released Claude for Microsoft 365 yesterday and directly entered the Microsoft ecosystem.
Therefore, traditional information retrieval tools will inevitably face a blow. When AI can deliver finished reports with data visualization end - to - end, the token - based billing model will gradually shift to a business model of "billing by delivered projects."
Second, the ultimate form of an agent will be the symbiosis of the model and the harness.
Zhipu's technical report presents a very inspiring view:
The capability boundary of the system is no longer determined unilaterally by the model but is jointly shaped by the model and its surrounding framework (harness).
As one of the leading domestic models, Zhipu's official also continuously provides a more comprehensive toolchain (Official Skills) and has achieved seamless integration with the industry - standard frameworks Claude Code and Auto Claw.
In fact, Zhipu has clearly recognized that it is almost impossible for a single AI startup to create an ecosystem as powerful as Google's. Instead of going all - out alone, it is better to let global general - purpose tools like Claude Code and AutoClaw, which are good at handling terminal and file logic, become its dexterous hands for operating computers.
The myth of the "all - powerful large model" that people expected before is now almost shattered. Even OpenAI, as powerful as it is, cannot achieve AGI with just a large language model. The future moat will surely shift to the deep coupling of model capabilities and external tools.
After all, B - side enterprises, the main payers, never need a robot that can talk about everything but a cognitive - driven engine that can be directly and seamlessly integrated into the existing system.
03
The Hard - earned Lessons: Three Laws of Agent R & D
What makes Zhipu's technical report release different is that the research team very rarely and frankly shared the design perspectives they summarized during the R & D process at the end of the report.
This "pit - avoiding guide" obtained through countless computing power consumptions and all - night overtime is far more valuable than the open - source models and technologies and has extremely high value for the entire AI industry.
First, don't aim too high. Underlying perception is the cornerstone that determines the ceiling of the model.
In the past year, a trend has gradually formed in the AI industry. When all products are released, they always carry labels such as "deep thinking," "self - reflection," and "long - logic planning," as if only products with these labels are advanced AIs.
However, it is not difficult to find from user feedback that these high - end labels have not been implemented in specific application scenarios.
Zhipu found in practice that many seemingly advanced planning failures are not due to the accumulation of minor errors in the process but because the model starts "feeling the elephant blindly" from the very first step. It may fail to see subtle UI elements clearly or get the spatial position of buttons wrong.
The operation logic of agents is completely different from that of large language models. Visual perception is by no means a low - level module that can be discarded after pre - processing. It continuously restricts the upper limit of the model's advanced reasoning ability.
Second, when it comes to agent training, abandon the superstition of "end - to - end" and actively embrace hierarchical optimization.
This does not deny the assertion that "agents should be trained using agent (rather than large language model) reinforcement learning," but AI enterprises must also face the reality that the cost of training agents is currently high, high - quality trajectory data is scarce, and there is a lack of industry - standard evaluation criteria.
If the model is made to learn extremely complex long - cycle tasks from the start, the result will either be "only getting the form but not the essence" or the model will simply collapse.
Zhipu's approach is to break down tasks into small pieces like a skilled butcher dissecting an ox, and conduct hierarchical optimization from recognizing icons at the most basic level, to single - step action prediction, and then to the planning of the entire behavior trajectory. Facts have proven that this is not only a compromise that has to be made when computing power is limited but also one of the best ways to ensure the stable convergence of the model.
Finally, tasks that cannot be accurately evaluated have no reference value.
For current agents with multimodal capabilities, the most difficult part is not getting them to work but not knowing how to objectively "score" them.
Compared with the dialog boxes on web pages, the real computer environment is full of openness and uncertainty. Zhipu realized that only by designing a verification process with strict step control and the ability to isolate signals from different dimensions can this end - to - end evaluation be meaningful and guide the model iteration process in reverse.
04
Conclusion
After reading Zhipu's technical report, it is more like an online symposium between the research team and users than a simple demonstration and explanation of the model's capabilities.
This report does not depict its model as perfect. Instead, it poses several soul - searching unsolved industry mysteries at the end:
Both videos and images are memory - consuming monsters. How should context - compressed memory be achieved in long - cycle tasks?
When can the model get rid of human - fed standard answers and emerge with smarter interaction strategies on its own?
No one can answer these questions for the time being.
All we can see is a domestic model that is evolving rapidly and the reality that the entire AI industry is entering a difficult deep - water area.
Adding multimodal capabilities is the inevitable path for Zhipu to march towards a full - stack agent, but the computing power bills are already everywhere on the way. Under the objective reality of scarce computing power, Zhipu still fought an admirable resource - breaking - through battle with its ingenious architectural design, extreme video memory optimization, and hierarchical training strategy.
GLM - 5V - Turbo has proven its ability to take over the user's computer screen. The next test is whether the entire market is ready to pay for the productivity of "native multimodality."
This article is from the WeChat official account "Silicon - based Starlight". Author: Siqi. Republished by 36Kr with permission.