A Tsinghua Team Built a Distributed Predictive World Model, Secured Hundreds of Millions of Yuan in Series A Funding, and Deployed on 100,000 Terminal Devices | 36Kr Exclusive
Author | Huang Nan
Editor | Yuan Silai
Yingke learned that the embodied intelligence world model company "Qianjue Technology" recently completed hundreds of millions of yuan in Series A financing. This round was led by Jingming Capital, and jointly invested by institutions such as Shandong New Kinetic Energy, Shandong Financial Capital, Yuanhe Hope, Xinneng Venture Capital, Nanchuang Investment, Inno Angel Fund, Shangshi Capital, Ren'ai Group, and Xuansu Investment. The list of investors includes national teams, industrial players, market - oriented funds, and family offices. Maple Pledge Capital has long served as a private equity financing advisor.
The funds will be mainly used for the architecture construction, algorithm iteration, and scenario implementation of the self - developed world model. At the same time, the core R & D and project delivery teams will be expanded, and the supporting capabilities for commercial implementation will be improved.
Qianjue Technology was founded in June 2023. Its core team was incubated from the Brain - like Research Center of Tsinghua University. It has long focused on the R & D and implementation of large models for embodied intelligence decision - making and planning, breaking through the limitations of traditional device tasks to help robots achieve dynamic environmental adaptation and fully autonomous operation.
The wave of world models is rapidly sweeping into the field of embodied intelligence, becoming the core breakthrough point for general artificial intelligence to land in the physical world. Yann LeCun, the father of convolutional neural networks, first proposed the core theory of world models. The AMI team he founded has continuously focused on the research of technical directions such as abstract representation space modeling and physical world law prediction, laying a core theoretical foundation for the industry.
From causal reasoning to spatial intelligence, from physical simulation to generative prediction, research based on different technical paradigms and theoretical bases is being carried out simultaneously in the industry. This is a track that has not yet converged and has great imagination. All explorers are trying to answer the same question: how to make machines truly understand and predict changes in the physical world.
In the mainstream generative route, a typical approach is to predict the next frame of the picture through pixel - level reconstruction. However, Zhang Tianren, the CTO of Qianjue Technology, pointed out to Yingke that this method often has an easily overlooked problem - feature pollution.
"The image input information in the real physical world is extremely large, containing a lot of noise unrelated to the task, such as light and shadow, and texture." Zhang Tianren explained. In order to pursue pixel - level lossless reconstruction, the model has to forcibly bind effective features and invalid information together. As a result, the internal representation of the model may no longer be "pure". "It can indeed extract generalizable features from real - world data, but these features are mixed with interference items."
This pollution will directly affect the model's ability to understand the physical world. The original intention of the world model is to let the model learn to make predictions that conform to physical laws, rather than simply fitting images. Once the features are polluted, it is difficult for the model to extract real causal relationships and physical invariance, and the generalization ability is naturally limited.
"When a person looks at a picture, they don't distribute their attention evenly on each pixel, but quickly lock in the area related to the task." Zhang Tianren said. "However, rather than understanding the world, the generative model is more likely to reproduce the appearance."
Facing the limitation of feature extraction in the generative route, the predictive world model provides another idea. Its core logic is that for a robot to truly understand the physical world, it is not by restoring each frame of pixels, but by predicting the low - dimensional evolution trajectory of the physical state.
Gao Haichuan, the CEO of Qianjue Technology, used a case to explain the essential difference between the two: when a person plays ball, they don't imagine clear frames of pictures in their mind, but directly swing the racket, relying on the low - dimensional prediction of the ball's trajectory. This prediction does not contain pixel information, only the state evolution of physical laws. "When humans play ball games in the physical world, it's impossible to imagine clear and complete pixel pictures. There's not enough time, and this information is unstable." Gao Haichuan said.
The same logic applies to embodied intelligence. When performing tasks, robots need not the imagination of "what the future will look like", but the prediction of "where to go in the next state". The core output of the predictive model is not video frames, but low - dimensional abstract features. These features can be directly decoded into motion trajectories or planning instructions, thus bypassing the computational burden and feature pollution problems caused by pixel reconstruction.
On the basis of the predictive route, Qianjue Technology further proposed a distributed prediction architecture. Its architecture uses a brain - region connection method similar to the human brain. Different regions of the brain have their own functions, with close internal collaboration within connected regions and relative independence between regions.
Compared with the traditional method of compressing and processing all information together, the distributed prediction architecture first distributes the information into different regions, and then compresses and predicts them respectively, resulting in higher sample efficiency and faster inference speed. "For the same task, it may take 1000 'state - action' pairs from scratch; with good representation, 100 are enough, effectively reducing the teaching data required for robots to adapt to new scenarios." Zhang Tianren said.
Through this distributed architecture, the model can learn the evolution laws of physical states in the abstract representation space, rather than just the temporal correlation of pixels, and better serve downstream planning and control. When robots face a new environment, they can faster understand "what causes what", which is particularly crucial for the implementation of real - world scenarios.
A robot equipped with Qianjue's world model working in a restaurant (Image source/Enterprise)
Specifically in the application end, Qianjue Technology decouples the embodied brain and cerebellum. Its world model is responsible for perception, prediction, and planning, without being bound to a specific execution action space. As long as the same modality is shared, the model can use the observed environmental changes as a unified data source for training. This means that the same "brain" can be quickly migrated to different bodies. The decoupling design effectively reduces the migration cost and accelerates the data flywheel closed - loop in real - world scenarios.
According to Yingke, Qianjue Technology's self - developed embodied brain has completed hardware adaptation for multiple categories such as wheeled, quadruped, bipedal humanoid, drones, and cleaning robots, and has implemented real - world projects such as hotel cleaning, commercial services, and precision indoor operations. Currently, the scale of connected terminal devices reaches 100,000. Relying on the real interaction data continuously generated by a large number of terminals, the world model will be further iteratively optimized in the future.
A robot equipped with Qianjue's world model autonomously delivering in a coffee shop (Image source/Enterprise)
The following is an excerpt from an interview between Yingke and Gao Haichuan, CEO of Qianjue Technology, and Zhang Tianren, CTO of Qianjue Technology (slightly edited):
Yingke: In the open - loop prediction scenario, the long - term reasoning error of the world model will accumulate with the number of steps. How does Qianjue's predictive architecture address this problem? To what extent can the closed - loop feedback mechanism of embodied tasks suppress the amplification of errors?
Zhang Tianren: This problem can be analyzed from several aspects. First, the magnitude of the accumulated error depends on whether the application scenario has closed - loop feedback. The video generation model is completely open - loop, predicting many future frames at once without any external information correction, so errors are easy to accumulate. However, the difference in embodied intelligence is that it has closed - loop feedback. We won't let the robot predict 1000 steps at once and plan the entire task before execution. Instead, we first predict 50 steps, select an action to execute, and after execution, the environment will give a new state as feedback, and we will correct the subsequent prediction based on the feedback.
This cycle of "execution - observation - correction" is the most essential difference between embodied tasks and video generation, which can effectively suppress the amplification of errors.
Second, regarding the memory module. Qianjue has currently tried to build a Memory system on some platforms, but it has not been directly integrated with the vision center. The reason is that since there is already closed - loop feedback, explicit long - term memory is not temporarily needed in many scenarios.
Third, Qianjue's model supports multi - step prediction. One "step" predicted by the model does not necessarily correspond to a single control instruction at the bottom layer, but can correspond to a complete semantic action, such as 50 bottom - layer steps. The fewer the number of predicted steps, the lower the probability and magnitude of accumulated errors.
Generally speaking, we believe that the upper - limit challenge of the world model's ability lies in completely open - loop ultra - long - term planning. For example, a robot has to plan all the details of the next few hundred steps at once before it starts to act. However, this kind of usage scenario is rare in real embodied tasks. A more natural and realistic approach is to "do while observing" and adjust at any time when problems are found.
Yingke: Qianjue has achieved large - scale deployment of 100,000 units. During the actual implementation process, what unexpected findings were there in the customer feedback? What impact did they have on your product iteration?
Gao Haichuan: Currently, Qianjue has 100,000 machines running in real - world scenarios. Users use the robots as real products, and the feedback they give is also real. Therefore, there is no "real - to - real gap" between the model we train and the implementation scenarios.
A robot equipped with Qianjue's world model autonomously cleaning a desktop (Image source/Enterprise)
There are two points in the market feedback that exceeded our expectations.
One is the sensitivity to response speed. The tolerance for delay varies greatly in different scenarios. A 4 - second response of the generative model is basically unusable in the robot scenario. Although our predictive model has a fast inference speed and can return results within 0.5 seconds, some robots require a cloud - transmission delay of about 1 second, and customers still feedback "lag". When we reduce the delay by 0.5 seconds, the user experience has a qualitative leap. This millisecond - level delay optimization can often be more directly converted into user satisfaction than the improvement of model capabilities.
On the other hand, it is the value of initiative. Most of the time, customers do not want robots to be just passive tools for executing instructions, but expect them to "be proactive" - actively perceive the environment and make autonomous decisions, rather than waiting for human instructions one by one. For example, in a hotel scenario, a robot that actively detects stains on the ground and starts cleaning can make customers feel more "intelligent" than one that executes after receiving instructions. This experience leap from a "driven device" to an "intelligent agent member" is becoming a key dimension for product differentiation.