Zhipu Launches Battle for Digital World Interpretation Right

SCAIL-2, more than SCAIL-2

In 2026, it has been the third year of rapid development since the birth of generative AI.

Over the past three years, the AI industry has undergone a drastic upheaval, from large language model training, to multimodal understanding, and then to the paradigm shift of video generation.

The focus of capital and public opinion has always been on the market, visuals, and coherence of video generation. A long - neglected pain point in the deep - water area has finally been brought to the forefront: It's easy to generate, but hard to control; the visuals are stunning, but it can't be put into production.

This is why both programming ability and multimodal ability are essential conditions for the commercialization of intelligent agents. However, the latter is always ignored, while the former is always emphasized.

Unexpectedly, the first to systematically solve this problem is not ByteDance, the dominant player in the video generation field, nor Alibaba, which has a complete range of video and image models, but Zhipu, which was previously criticized for lacking multimodal ability.

This time, led by Professor Tang Jie, the founder and chief scientist of Zhipu, Zhipu AI and the research team of Tsinghua University jointly released a model called SCAIL - 2, cutting through the barrier between artificial intelligence and industrial film and television production with a precise scalpel.

A model released low - key, but it means a paradigm challenge to the underlying logic of digital content production: It attempts to challenge the long - standing "intermediate representation" rule in the industry. With a minimalist "end - to - end" architecture, it foretells the arrival of an era of intention - driven digital creation.

01 From "Skeleton Dependence" to "Visual Intuition"

In the field of AI video generation, past control technologies have long been trapped in a "semiotic worship". Whether it's Runway or some early diffusion models, in order to enable AI to achieve controlled motion, the engineering community had to establish a complex translation system:

Using pose estimators, the human body in the video is abstracted into a skeleton diagram, and then these skeleton diagrams are input into the model as constraints.

This "stick - figure" approach essentially makes AI learn how to "imitate symbols" rather than "understand motion". In an ideal state, reinforcement learning with hundreds of millions of "stick - figures" sounds perfect, but in complex scenarios, the results are completely different. In situations such as mutual occlusion of multiple human subjects, fine hand movements, and even interaction with non - human characters, the system composed of "stick - figures" will collapse instantly due to deep ambiguity.

The revolutionary nature of SCAIL - 2 lies here. It announces the end of the era of relying on "stick - figures". The core architecture of SCAIL - 2 completely abandons the explicit intermediate representation and directly drives the latent space features of the video and the latent space features of the reference character for pixel - level splicing, that is, letting the AI model directly read the visual context.

This design concept directly evolves the model from a translator to an observer. The advantages of directly splicing video latent vectors are very obvious: AI can capture information that the skeleton cannot express, such as subtle wrinkles in clothing, light and shadow feedback of characters in complex environments, and the physical interaction logic of objects.

Compared with the improvement of technical indicators, building machine visual intuition is even more valuable. The way the model understands actions in the real world has undergone a qualitative change. Instead of translating each "point", it directly internalizes the physical laws of human motion with big data.

This end - to - end ability enables SCAIL - 2 to handle high - difficulty tasks such as animal - driven and first - person perspective in a zero - shot scenario, successfully breaking through the ceiling left by traditional skeleton models.

02 Zhipu's Foresight

To evaluate the strategic value of SCAIL - 2, we need to examine it in the horizontal and vertical coordinate system of China's AI industry.

From a horizontal perspective, Zhipu hopes to become an ecosystem builder beyond model packaging.

Currently, there is a widespread "shell - wrapping anxiety" in the domestic AI circle. Only a handful of enterprises are willing to invest all their energy and funds in underlying innovation. Most AI products that can be put into practical use are often simple fine - tuning and UI packaging based on open - source models.

However, Zhipu has demonstrated a completely independent underlying evolution path through SCAIL - 2.

Whether it's large language models or multimodal models, the gap between cutting - edge models is still visibly narrowing. Instead of building commercial barriers through closed - door development, Zhipu has acutely chosen the strategic entry point of open source + ComfyUI.

ComfyUI can currently be regarded as the workflow base camp for top global AI creators and technology geeks. Connecting SCAIL - 2 to ComfyUI is almost equivalent to embedding Zhipu deep into the productivity of creators. Contributing a new model is just the surface. Defining itself as the underlying protocol for digital asset circulation is Zhipu's real goal.

If the workflows of global creators start to run on Zhipu's protocol, an ecological barrier will naturally form. This is the same logic as NVIDIA's construction of the CUDA ecosystem back then: selling software is not as good as selling rules.

From a vertical perspective, Zhipu's advantage lies in its ability to deeply integrate academic sources with commercial implementation.

Different from many purely market - driven companies, Zhipu is backed by the KEG Laboratory of Tsinghua University. Its founder, Tang Jie, is a computer professor at Tsinghua University, and its core competitiveness is self - evident: technological continuity.

From the earliest GLM series of large language models to the current SCAIL - 2 video model, Zhipu has always maintained the unity of large - model infrastructure. This coherence, loved by technology enthusiasts, means that Zhipu has a rigorous and self - consistent mathematical foundation in aspects such as multimodal understanding, temporal logic processing, and latent space alignment.

With its profound academic accumulation, Zhipu AI has a far - superior dimensionality - reduction strike ability when dealing with complex cross - modal data streams. This can also be confirmed from a commercial perspective: Although Zhipu has also experienced a series of controversial events such as package replacements and price increases, the GLM series of models are still one of the top choices for many users who rely on domestic AI models.

03 The Last Piece of the Puzzle for Video Model Commercialization

The arrival of true AGI is still far away. Against this background, many people believe that video generation has not yet transformed from a toy into a productivity tool. However, Zhipu's commercial ambition is obviously greater than that. Next, we will try to analyze its commercial logic from three dimensions:

Firstly, the digitization of motion assets and the reconstruction of the production pipeline.

In the traditional special - effects industry, character animation production is basically a black hole of high investment and high latency. From binding, motion capture to rendering, the production cycle of a high - quality animated character can range from a few weeks to several months. SCAIL - 2 separates motion from the skeleton and turns it into a reusable visual vector.

In essence, this is assetizing performance ability. In the future, the motion transfer of a virtual character will surely be as simple as copy - pasting. What Zhipu is doing is not only lowering the production threshold but also monopolizing the future production method of digital content.

Secondly, building a moat for the data factory.

This is not only Zhipu's goal but also the goal of all AI enterprises. When AI moves from the dialog box on the web page to each user's computer and then gradually into the real world (i.e., large language model → intelligent agent → embodied intelligence), the most scarce resource in this process is not computing power but data.

The reason why SCAIL - 2 is powerful is that it has both algorithms and the MotionPair - 60K dataset. More importantly, Zhipu has established a high - quality data pipeline that can automatically synthesize, verify, and screen through an intelligent agent loop. This internal - loop mechanism of "producing AI data by AI" enables Zhipu to break free from the quality bottleneck of external Internet data. As the number of training rounds increases from linear to exponential, Zhipu's data factory will produce more and more accurate visual models, and the first - mover advantage will turn into an insurmountable gap.

Thirdly, the commercial migration from tool - based to infrastructure.

After Zhipu completely decouples characters, backgrounds, and actions through SCAIL - 2, we have reason to boldly speculate that its commercial monetization model may undergo a complete transformation in the future: from one - time API call fees and monthly subscription fees to "production protocol" fees.

Whether it's game manufacturers, live - streaming platforms, or film and television companies, when it comes to the interaction of digital virtual humans, they may need to purchase Zhipu's visual middleware protocol, that is, let all gold - diggers buy Zhipu's shovels.

04 The Computing Power Philosophy Behind the End - to - End Architecture

The algorithm has been open - sourced, and the data has been put into production. Naturally, the next question is computing power.

It's clearly a pipe dream to break the monopoly of foreign advanced computing power overnight. The reason why SCAIL - 2 can achieve end - to - end is still the old method of domestic AI: optimizing the computing power allocation at a higher level.

In the traditional method, during the inference stage, multiple traversal steps such as skeleton extraction, reprojection, and mask generation are required, and the resulting computing power bottleneck is self - evident.

However, Zhipu's end - to - end solution conforms to the idea of "the greatest truth is the simplest". It combines all complex tasks into the same Transformer architecture. This solution not only significantly reduces the inference latency but also reduces the refraction loss when information is converted between different intermediate layers. From an engineering perspective, with the same computing power consumption, SCAIL - 2 can produce a much higher information density than traditional models.

Zhipu has provided a new solution for the industry and has also made public a deep - seated commercial truth that all domestic AI enterprises must admit: Having the optimal allocation right of computing power is approximately equivalent to having the pricing power of the market. Architecture optimization actually helps customers save video memory and computing time. The commercial stickiness brought by this "cost - reduction" is far more stable than advertising and marketing.

05 Control is Sovereignty

Finally, SCAIL - 2 is not without weaknesses. Zhipu defines the model's strict dependence on large - scale, high - quality paired data as its biggest pain point. Although the preference alignment technology has been introduced to greatly solve the problem of collapse in fine areas such as hands and faces, this still reflects a major problem commonly faced by generative AI: There are still limitations in fine - grained control.

However, this is also where Zhipu's foresight lies: By straightforwardly admitting the current AI's deficiency in understanding physical laws and injecting human cognitive feedback into the model through preference alignment, it is actually accelerating the process of AI socialization and engineering.

From the perspective of commercial game theory, Zhipu has started a war for the right to interpret the digital world. Assuming that AGI is a future - achievable operating system, then the large language model is its logical center, and the video model is its physical presentation layer. And SCAIL - 2 is the "driver program" with control rights in this operating system.

In this era of intelligent agents where technology iterates on a weekly basis, Zhipu has not only demonstrated excellent engineering capabilities but also shown profound insights into the industrial paradigm. Zhipu is telling the entire industry: Simple parameter stacking has reached a dead end. Only by reconstructing the underlying interaction logic can we truly achieve the industrial production of AI.

While the global attention is focused on whether certain giants can generate a one - hour video, Zhipu is thinking about how to make a character accurately complete an action of "picking up a water cup". This obsession with precise control is the rarest quality in the domestic AI industry and the most admirable highlight of this enterprise.

This article is from the WeChat official account "Silicon - based Starlight", author: Si Qi. It is published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Zhipu has launched a battle for the right to interpret the digital world

01

From "Skeleton Dependence" to "Visual Intuition"

02

Zhipu's Foresight

03

The Last Piece of the Puzzle for Video Model Commercialization

04

The Computing Power Philosophy Behind the End - to - End Architecture

05

Control is Sovereignty