HomeArticle

Tsinghua-affiliated forces set their sights on world models

白鲸实验室2026-06-24 10:17
Why is it always Tsinghua?

Many entrepreneurs compare this wave of the world model craze to the moment when ChatGPT was first born. In 2023, the keyword for AI startups was "large models." All major investment institutions and tech giants talked about parameters in the tens of billions and the Scaling Law. By 2026, the keyword is shifting to a more vague, grand, and even somewhat philosophically ambitious term: the world model.

However, the problem is that no one can fully explain it.

In the narrative blueprints of different companies, it can be video generation, robot control, environmental understanding for autonomous driving, or the long - term memory system of multimodal agents. It doesn't seem like a clear - cut technical route but more like a concept with constantly expanding boundaries. Just recently, NVIDIA suddenly announced on its official website the industry's first full - stack comprehensive safety system, Halos, for robots, autonomous driving, and physical AI, specifically designed to run on the IGX Thor humanoid robot hardware.

Interestingly, a group of Chinese AI startups at the forefront of the trend are all converging towards this term.

Shusheng Technology, the top player in the first - generation AI video generation, has renamed its video generation model as "world simulation in the time dimension." Mianbi Intelligence is discussing how to build a longer - range structured reasoning space in edge - side models. Momenta, which has been in the autonomous driving field for many years, continues to strengthen the autonomous driving system's understanding of the closed - loop world. Zhipu, whose market value has just exceeded one trillion, also announced that it will gradually expand the boundaries of "environmental interaction" capabilities in addition to its agent capabilities.

These companies betting on the world model are in different sectors, but most of them have one thing in common: they come from the same academic and industrial network, the Tsinghua circle. This kind of concentration is not common in other industries. It doesn't rely on geographical clusters of cafes and incubators like the consumer Internet, nor does it depend on the physical radius of the supply chain like the semiconductor industry. Instead, it radiates outward from Tsinghua University, Wudaokou, and Zhichun Road.

The reason why the world model has become the new hot topic in the current AI industry is not only that the technology is maturing but also that the object of modeling, "language," is approaching its bottleneck.

Entrepreneurs who have been in the field of large speech models for a long time have realized that language models can simulate the description of the world but do not truly understand the operating rules of the world. At the same time, video generation models are starting to have problems with "temporal consistency," robot models are facing the inescapability of "physical failure," and autonomous driving systems must handle the continuous feedback from the real world. The significance of the world model lies in trying to unify these scattered problems.

In the previous round, the Tsinghua circle almost defined the large models. And in this round, the Tsinghua circle is leading the way in the world model, which is considered an important path to general intelligence.

01

Why the Tsinghua circle?

The influence of the Tsinghua circle in the AI investment community sometimes takes on an air of mystery. At a time when the market value of Zhipu, founded by Tang Jie, the leader of Tsinghua - affiliated entrepreneurs, is soaring towards one trillion, people in the industry joke that when they hear that the core team of a project is from the EE (Department of Electronic Engineering) or CS (Department of Computer Science) of Tsinghua University, they subconsciously raise the "technical ceiling" by two notches even before looking at the business plan.

In fact, this is not some kind of mystery. If you look at the trajectories of these entrepreneurs in their respective fields over the past decade, you'll find that they are somewhat similar. They are never satisfied with just creating a plug - in but always try to rewrite the underlying logic of the operating system and explore the cutting - edge.

To understand today's world model craze, we must go back to the starting point of the previous round, the large models.

The Tsinghua - affiliated entrepreneurs were the focus of attention in the previous round. Companies like Zhipu, Mianbi, Yuezhianmian, and Shusheng Technology all took on the task of catching up with OpenAI at different stages. Now, with the performance of its GLM series of models, Zhipu has become the new benchmark for the "pursuer of Anthropic" in China.

Starting from Aminer, Tang Jie's team essentially created a "knowledge system." It's not a model but a structured representation of the human academic world. Then, in the GLM stage, this system was transformed into a language model and quickly entered the large - scale competition after the release of GPT - 3.

In Zhipu's decision - making logic, a recurring keyword is "reaching the top." At a crucial decision - making meeting in 2021, when the team was discussing whether to invest tens of millions of yuan to catch up with large models, the internal controversy centered around one question: whether this direction was worth "proving that China can also achieve world - class results." Tang Jie's stance was straightforward. If successful, it could at least prove one thing: China's large - model technology could stand in the world's first echelon.

At that time, GPT - 3 had been released for more than a year, and no one knew if the domestic models could catch up. While making the bet, Tang Jie also bore the pressure of "possibly having no returns for five years." In the end, this pursuit of excellence also gave domestic models a place of their own.

At the Zhipu Open Day in 2024, Tang Jie clearly stated that they would build a "cognition - driven world model." Zhipu is trying to make the model not only able to chat but also operate mobile apps autonomously, book hotels, and plan trips. The model needs to have a deep understanding of the "mobile interface" micro - world, knowing which page will be jumped to after clicking an icon and which step the process should return to after a payment failure.

And this logic of "environmental interaction" is the starting point for Tang Jie and others to explore the world model.

If Tang Jie and others represent the exploration desire and competitiveness of the technical school, Tang Jiayu, the CEO of Shusheng Technology, and Zhu Jun, the chief scientist, are more like using AI startups to verify their theoretical research.

Zhu Jun and Tang Jiayu are a "Tsinghua master - apprentice pair" in the field of generative models. Zhu Jun is a representative of the Bayesian method and the generative model school. After founding Shusheng Technology, they didn't move far from Wudaokou. The company is only 2 kilometers away from Tsinghua University. Based on their years of research on diffusion models, they concluded that the model should not just output a result but a probability distribution of the result.

Driven by this technical background, in 2024, Shusheng Technology self - developed the U - ViT architecture, trying to handle the spatial details and temporal continuity in visual generation under a unified framework, allowing the model to learn the spatio - temporal laws of the physical world.

Cao Xudong, the founder of Momenta, is closer to an engineering realist. In the 2016 AI wave, Cao Xudong didn't choose to be a supplier of perception modules, which was closer to making money. Instead, he wanted to build an autonomous driving brain, dealing with a much more complex system engineering than "face recognition." This choice of deep cooperation with automobile manufacturers also allowed Momenta to accumulate nearly ten years of real - world scenario data in end - to - end autonomous driving.

For autonomous driving, it is necessary to understand physical - level interactions, such as the friction between tires and the ground, and also require spatio - temporal deduction to predict the movement of pedestrians and vehicles. More importantly, it also needs cognitive reasoning to understand traffic police gestures and traffic lights. The world model is a natural next step for him.

If we look at the paths of these people together, we'll find that they have all chosen to "reinvent the wheel."

Tang Jie self - developed the GLM series. Zhu Jun didn't use the ready - made framework of Stable Diffusion but self - developed the U - ViT. In 2016, Cao Xudong of Momenta chose to do full - stack autonomous driving instead of selling perception modules to automobile manufacturers, which meant taking on all aspects of perception, decision - making, and control by himself. They are used to facing problems without ready - made solutions in their academic training and are more willing to accept long - term investments measured in years.

As the AI wave has advanced to the present, the short - term optimization in traditional business logic is no longer applicable. The correctness of a direction may take many years to verify.

For these entrepreneurs, when a key part of a system is stuck, the most natural choice is not to bypass it but to build it themselves. The habits formed during their long - term doctoral training make them more inclined to question the underlying problems and improve their core capabilities. This has almost become their muscle memory.

02

Three routes of the world model

Surrounding the world model, these star entrepreneurs from Tsinghua University all choose to approach different aspects from their most proficient systems.

Represented by Zhipu and Mianbi Intelligence, the world model they are pursuing is a long - range, structured reasoning space.

Beyond the GLM system, Zhipu is gradually expanding its capabilities to the fields of agents and interactions. Mianbi Intelligence emphasizes long - context and reasoning abilities, hoping to enable the model to have continuous modeling capabilities through a longer "memory window." However, the problem that cannot be avoided in this path is whether language is sufficient to express the world structure?

Tang Jie has said more than once that relying solely on large - scale data training, the model can learn the statistical correlations of massive data but may not truly master the structure and causal relationships behind the knowledge. If video is a time slice and a robot is a spatial interaction, then the language model is more like a compressed expression of the world. In this framework, the world model is not simply about generating a world but enabling the machine to establish an internal representation of the world's states, causal relationships, and evolutionary laws.

To some extent, this is also the reason why more and more Tsinghua - affiliated entrepreneurs are turning to the world model.

At the WAIC in July 2024, Tang Jie proposed: "The world model needs to have an understanding of physical laws and social common sense. This kind of understanding cannot be solved by more data. It requires the combination of knowledge engineering and deep learning." Since 2025, Zhipu has frequently mentioned the AutoGLM and Agent strategies, trying to find a feasible technical route.

Represented by Shusheng Technology, in their eyes, the world model is more like a super video generation engine spread along the time axis.

When Tang Jiayu and Zhu Jun's team launched the first long - duration video generation model in China to compete with Sora in 2024, they positioned it as "world simulation in the time dimension." They used a large amount of visual data to train the model's intuition about common - sense physics. For example, when you throw a ball, the model can predict that it will fall due to gravity. Even if the ground is not shown in the picture, it can imagine the parabolic trajectory.

Shusheng Technology's U - ViT architecture combines the Transformer and diffusion models. This route believes that once the video model can perfectly predict the next frame, it can become a highly realistic virtual world engine, which can then be fed back to the research and development of embodied intelligence and autonomous driving.

Momenta emphasizes that the world model can be achieved through the reconstruction of physical laws and real - time interactions.

Momenta is trying to continuously map and understand the real world in the digital space and let the system learn and iterate through a data closed - loop, taking a path that integrates perception, decision - making, and self - evolution.

Cao Xudong said as early as 2016 that the ultimate challenge in autonomous driving is not seeing but understanding and predicting. As a player who has accumulated sufficient data in the autonomous driving scenario for a long time, high - precision physical modeling is necessary for vehicle dynamics, sensor simulation, road surface friction coefficient, and the impact of weather on visibility. After these models are verified, they can be directly applied to real - world vehicles, and then real - world road condition data can be collected for model training.

On June 23, Momenta passed the Hong Kong Stock Exchange's hearing and officially entered the IPO sprint stage, and it is expected to become the first stock in physical AI.

This means that Momenta has provided a more realistic answer for the market. Physical AI is not about having a perfect model first and then waiting for application scenarios. A better way is to continuously collect, train, verify, and apply the model in mass - production scenarios and then return to the real world for further evolution. At least in the field of autonomous driving, this flywheel has started to turn.

03

The Tsinghua - affiliated entrepreneurs and the world model are still on the way

It should be noted that the world model is still in its infancy, far from the grand vision of containing the entire physical and logical world.

This year, NVIDIA's release of NVIDIA Cosmos 3 marked the emergence of a relatively large and unified world model. However, compared with the most advanced language models, its scale is still much smaller, showing the ability to expand to general tasks. But when DeepMind released Genie 2, the official blog was very cautious. They said: "Genie 2 is a research preview and has not been publicly released. It shows the possible directions for the future."

The concept of the world model is good, but most people can't feel its impact yet.

This is because at this stage, the obstacles to making the model truly perceive and understand the world are almost all - around.

Firstly, there is the illusion of physical authenticity. In video generation tools such as Shusheng Technology's Vidu, Runway, and Sora, when users try to let AI generate a video, "object penetration" often occurs. For example, when asking a person to drink water, the cup may pass through the palm, or a chair may suddenly flow like a liquid. This violation of physical laws by the model shows that the current video generation models still cannot achieve strict physical simulation.

In addition, while the data flywheel of language models is already in motion, the world model still lacks a lot of data. When Wang Xingxing mass - produced robots at Unitree, he found that even though they had trained the robots to be gymnastic champions in the simulation environment, in the real world, a slightly reflective floor tile or a loose shoelace could make the robots fall to the ground without warning.

This phenomenon is called the Sim - to - Real Gap.

The world model needs an endless amount of data to cover the long - tail problems in the real world. However, many physical details, such as the friction coefficient of materials, the deformation of soft objects, and the scattering of light, are almost impossible to model exhaustively. The common sense that "a cup will break when it falls on the ground" that humans take for granted requires the model to understand multiple attributes such as material brittleness, gravitational acceleration, and ground hardness. If one link is missing, the reasoning will collapse. This is a problem that stumps all model manufacturers.

"From a macro perspective, the so - called world models currently do not have a fully unified technical stack and are still talking about different things. An important topic in the future is how to gather the data of all downstream tasks into the same model architecture and achieve real scale - up," Dr. Ma Xiaoteng, the chief scientist of Mind Lab, told us.

For both industry players and developers, the world model is mostly confined to laboratories and papers at present. The released models with small scales have limited understanding capabilities and will collapse when dealing with slightly complex physical interactions. For large - scale models, in the era when tokens have become a new currency, the reasoning cost is too high. There is still a long way to go before the world model can be "useful."

Interestingly, Tang Jie proposed early on that relying solely on language statistical correlations can easily turn the model into a high - level "parrot" that can generate answers smoothly but may not truly understand the world.

Over the past decade or more, Tang Jie has always tried to combine cognitive maps with neural networks, hoping to enable machines to establish a knowledge structure similar to that of humans. To some extent, this obsession can even be regarded as an early version of the world model idea.

Reality soon proved that language is not equal to the world. Even in the AutoGLM era, when an app updates its interface and the button positions change, the agent will be at a loss, and the model has to learn again. Because the model can remember the pages but cannot understand the underlying operating rules.

Currently, Zhipu is regarding the world model as the key breakthrough to tackle this tough problem.

These may be the characteristics of the Tsinghua - affiliated entrepreneurs. The world model is still on the way, and this path is destined to be long.