Exklusiv: Ein Startup mit Verbindungen zur Tsinghua-Universität hat eine Seed-Finanzierung in Höhe von mehreren hundert Millionen Yuan abgeschlossen: Wir wollen nicht das Etikett „Weltmodell“ erhalten
Text | Zhou Xinyu The original English text here is retained because names are generally not translated. The same applies hereinafter.
Editor | Zhang Yuxin
„The Lychees from Chang'an“ is a story that Li Yiming, a promotion supervisor at Tsinghua University from 1997, really likes.
In the story, the junior official Li Shande has to solve a series of closely related problems such as keeping freshness, post - stations, routes, and supplies in order to transport the fresh lychees, whose color „changes within a day“, from Lingnan to Chang'an. Without this complete system, the fresh lychees cannot be moved.
In Li Yiming's view, this story set in the Tang Dynasty forms a clever intertextuality with the current „World Model“ race track:
The scenarios and problems to be solved by Physical AI (physical AI system) are the „fresh lychees“. To achieve the goal of „transport“, experts also have to create a whole system of solutions, which includes data collection, model development, and hardware provision.
„The first principle of the world model is not which technology is chosen, but which problem is finally solved.“ He said to „Intelligent Genesis“, the so - called world model is just „a horse for transporting lychees“, a technology for problem - solving, which would be worthless without the cooperation of other steps.
When this former researcher at NVIDIA Vision & Robotics returned to Tsinghua University in China as an assistant professor at the School of AI in early 2026, he saw that the AI sector had fallen into a great FOMO (fear of missing out) regarding the „World Model“.
The World Model is one of the most confusing concepts in 2026, with many different factions and opinions.
Due to the lack of consensus and imagination, the World Model is currently the sector with the largest valuation bubble. Whether it is video models, 3D models, or the VLA (Visual - Language - Action) approach of the embedded brain, all those who come into contact with simulation and physics count themselves into the „World Model“ clique.
In contrast, Li Yiming believes that it is more important to clarify a system that enables the generalization of different robots in different scenarios than to clarify the definition of the World Model.
Recently, Li Yiming's team has proposed a Physical AI Infra driven by data and physics. It contains two self - developed components:
Data Pipeline: The scale of data collection is rapidly increased, from the average industry size of hundreds of thousands of hours to millions to tens of millions of hours.
Physics Engine: A Real - to - Sim - Real loop is realized, that is, based on real - world data, a simulation world is created, which is used for the reinforced learning of robots in the physical world, and finally tasks are executed in the real world.
Although the World Model is not an independent component, it penetrates every step of this system infrastructure. For example, based on the collected data, the „World Model“ is set as the target for pre - training; in the post - training step, the „World Model“ becomes the simulation environment for the reinforced learning of robots.
This infrastructure can enable the training of fine manipulation skills such as cutting, turning, inserting, stirring, pressing, grasping, and threading, and can be provided between different types of wrists and robot arms in different forms. At the same time, it can be adapted to different scenarios such as production and manufacturing, retail and services, hotel operations, catering preparation, and medical support.
This technology package is also taken over by „Klaren Intelligence“, which was founded in April 2026. Supported by Li Yiming's team, a new player in the field of Physical AI, it completed several financing rounds within two months after its founding.
„Intelligent Genesis“ learned from exclusive sources that the amount of the seed round of Klaren Intelligence is several hundred million yuan. The investors include Shunwei Capital, Sequoia China, Hillhouse Capital, Fengrui Capital, Xinglian Capital, Tsinghua University Alumni Seed Fund, SEE FUND, and other funds, as well as industrial investors such as Zhiyuan Robot, Lingxin Qiaoshou, and Century Golden Resources Group.
Rarity is an important reason why the primary market invests in Klaren.
On the one hand, they are the professionals with software and hardware capabilities. Li Yiming's career spans spatial perception, multimodal inference, autonomous driving, and embedded intelligence.
During his doctoral studies at New York University, he published research results on embedded visual inference together with Xie Saining (co - founder and chief scientist of AMI Labs). At the same time, he published several excellent articles in CVPR and NeurIPS together with NVIDIA and received the NVIDIA Scholarship 2024 (only 10 worldwide).
△ Li Yiming. Image source: Interviewed person's image material
Most of the more than 50 members of the Klaren team are students from Tsinghua University, with an average age of 23 years. „Professionals with software and hardware capabilities are very rare in China, so Tsinghua University provides us with a good talent platform.“ Li Yiming said.
On the other hand, it is the rarity of Klaren's technology. Li Yiming has boldly chosen a „difficult“ path: everything from data collection to model training to the physics engine is self - developed.
This is quite rare in China. The high initial investment and the difficulties of software and hardware technology have already deterred many companies. But Li Yiming believes that only when all steps are connected can the information flow between different steps and modules flow smoothly and can the different steps be optimized together.
In Li Yiming's plan, the team will publish a world model by the end of this year, which can be used in different B - sector scenarios. In 2028, Klaren will achieve the scaling of solutions. Finally, his goal is to provide customers with a software - hardware integration solution that solves problems across different robots and scenarios.
Recently, „Intelligent Genesis“ talked with Li Yiming about his technological assessments and his assessments of world models and Physical AI.
The following is the summary of Li Yiming's views by „Intelligent Genesis“:
Physical AI companies are neither robot manufacturers nor model companies
🤖 We don't just make world models, but a whole system.
We are not oriented towards technology, but towards real - world problems. The goal of training world models is not to train models, but to solve some problems of Physical AI (physical AI system) and improve the success rate of tasks.
Therefore, we are not interested in what exactly the world model is, but in how data, models, hardware, and infrastructure can be combined into a system to finally become a world model that works in scenarios.
Our goal is to create an ecosystem driven by data and physics. The „World Model“ penetrates every step:
In the pre - training process, the „World Model“ is set as the self - supervised training target, and at the same time, the state and action are modeled. In the post - training process, the „World Model“ becomes an interactive environment in which robots can learn through reinforcement.
Klaren Intelligence is actually not just a „World Model company“. The whole team is developing a whole system that includes data pipelines, world models, and physics engines. The so - called „model“ is just one of its technological components.
🤖 The core feature of a new Physical AI team is full - stack development.
From data collection devices to data pipelines, from differentiable physics engines to model training, we build everything ourselves:
Self - developed devices such as full - hand tactile sensor gloves reduce the cost per set from dollars to the yuan level to scale up data collection and collect millions of hours of data.
The self - developed differentiable physics engine realizes the Real - to - Sim - Real loop, can model complex materials such as liquids, soft bodies, and elastoplastic deformable objects, and becomes an efficient platform for the post - training of reinforcement learning.
Based on the data collected in different scenarios and the post - training physics engine, our self - developed world - model operating system can be quickly transferred to different scenarios and also enable cross - embodiment (translation of different robots).
🤖 A new embedded company should neither be a robot manufacturer nor a model company, but a company that offers world models as a service (World Model as Service).
In the future, with the rapid accumulation of data, we can achieve rapid generalization across different robots. Finally, we provide customers not with a world model, but with a software - hardware integration system.
This system can automatically select the optimal hardware solution based on the implementation scenario and the customer's budget and is ready for immediate use.
🤖 The profiling of professionals in Physical AI requires software and hardware capabilities.
Tsinghua University provides a good talent platform. The average age of our team is 2003, and there are even freshmen from 2007.
The profile of professionals in Physical AI is different from that in LLM. We need professionals who are strong in both software and hardware. Currently, such people are very rare because our education system is still in the process of maturing.
Therefore, we train good talents ourselves. Today's students can make great progress in about six months to a year in a good team.
One cannot only engage in data collection and ignore the physical laws
🤖 The number of parameters of an embedded model must at least reach the number of parameters of a language model, or even be several orders of magnitude higher, before one can talk about „Intelligent Genesis“.
Language is a compressed form of world rules. Even for language models, hundreds of billions of parameters are already required. An embedded model trained on natural signals requires even more data and parameters.
🤖 Human data is easier to scale than data from real machines.
There are hundreds of millions of people in China who work on the front line or live at home. Compared with data collection through robot control, data collection by humans with devices is much more efficient. It is much easier to scale the number of people than the number of machines or the collection duration.
Currently, we have already established cooperation with scenario partners in factories, hotels, real - estate companies, shopping malls, and kitchens to collect millions of hours of data in a short time.
🤖 It is unrealistic to create a whole Physical AI Infra only through data collection. Many physical laws are also needed.
The currently collected data volume is not enough to transfer Physical AI autonomously to all scenarios. However, there are so many different scenarios in the real world that even two apples can look different. It is impossible to collect data from all scenarios.
Physical laws can currently compensate for the limitations of data. The so - called physical laws, such as Newton's laws and the Navier - Stokes equations (laws of motion of viscous Newtonian fluids), are summaries of the rules of the physical world by humanity and have a certain universality.
🤖 Klaren Intelligence has developed a world - model concept that takes physical constraints into account. With only 1% of the data from real machines compared to others, it can train a strategy model and achieve the same success rate.
We first collect a small amount of data from real machines. Then we align the state transitions (the change of the world state due to actions) of real machines with those of the physical world model and transfer the loss (a measure of the model's errors) back to continuously optimize the world model.
The advantage of this is that we only need a small amount of real data to „calibrate“ the state transitions of the world model so that robots can learn autonomously in the virtual world.
For example, a robot used to have to cut hundreds of apples to learn cutting. Now it only has to cut ten times in reality, and the rest of the exercises can be done in the physical world model.
VLA, Video Models, and JEPA are not the „original world models“
🤖 The world model is responsible for the interaction between the machine and the world, while the language model is responsible for the interaction between the machine and humans.
Now everyone is aware that the development of VLM (Visual - Language Model) and VLA (Visual - Language - Action Model) based on LLM is essentially not well - compatible with the physical world.
Because the language model is a highly discretized...