HomeArticle

Exclusive | Tsinghua-affiliated startup secures hundreds of millions of yuan in seed round: We don't want to be labeled as a "world model"

阿菜cabbage2026-07-01 21:21
A Physical AI company in the new era is neither an ontology company nor a model company.

Text by | Zhou Xinyu

Edited by | Zhang Yuxin

"The Lychees from Chang'an" is a story that Li Yiming, a doctoral supervisor at Tsinghua University born in 1997, likes very much.

In the story, to transport the fresh lychees that "change color within a day" from Lingnan to Chang'an, the junior official Li Shande has to solve a series of interlocking problems such as preservation, post - stations, routes, and supplies. Without this complete system, the fresh lychees wouldn't be able to move forward even an inch.

In Li Yiming's eyes, this story set in the Tang Dynasty forms an ingenious intertext with the current "world model" track:

The scenarios and problems solved by Physical AI are the "fresh lychees". To achieve the goal of "transportation", practitioners also need to build a complete system solution covering data collection, model research and development, and hardware deployment.

"The first - principle of the world model is not about what technical route to take, but what problems to solve in the end." He told "Intelligent Emergence", The so - called world model is just "a horse for transporting lychees", a technical route to solve problems. Without the cooperation of other links, it will be worthless.

However, at the beginning of 2026, when this former researcher at NVIDIA's Vision & Robotics returned to China as an assistant professor at the School of Artificial Intelligence of Tsinghua University, he saw that the AI track was falling into a huge FOMO about the "world model".

The world model is one of the most confusing concepts in 2026, with various factions and different opinions.

Non - consensus and imagination have made the world model the track with the largest valuation bubble at present. Whether it's video models, 3D models, or the embodied brains following the VLA (Vision - Language - Action) route, as long as they are related to simulation and physics, they all classify themselves as part of the "world model" camp.

On the contrary, Li Yiming believes that what's more important than clarifying the definition of the world model is to clarify a system that allows various robots to generalize in various scenarios.

Recently, Li Yiming's team proposed a Physical AI Infra driven by both data and physics. It includes two self - developed components:

Data Pipeline: Rapidly scale up the data collection volume from the industry average of hundreds of thousands of hours to millions or even tens of millions of hours.

Physics Engine: Achieve the Real - to - Sim - Real closed - loop. That is, based on real - world data, build a simulated world for robots to conduct reinforcement learning in the physical world, and finally execute tasks in the real world.

Even though the world model is not an independent component, it still permeates every link of this system infrastructure. For example, based on the collected data, the system will use the "world model" as the pre - training target; in the post - training phase, the "world model" will become the simulation environment for robots to conduct reinforcement learning.

This infrastructure can train fine operation skills such as cutting, screwing, plugging, stirring, pressing, pinching, and threading, and can be deployed across different forms of dexterous hands, robotic arms, etc. It can also be adapted to diverse scenarios such as manufacturing, retail services, hotel operations, food preparation, and medical assistance.

This technical solution is also adopted by "Liquing Intelligence", which was established in April 2026. Backed by Li Yiming's team, this new player in the Physical AI field completed multiple rounds of financing within just two months of its establishment.

"Intelligent Emergence" exclusively learned that Liquing Intelligence's seed - round financing amount reached hundreds of millions of yuan. The investors include funds such as Shunwei Capital, Sequoia China, Hillhouse Ventures, Fengrui Capital, Xinglian Capital, Tsinghua Alumni Seed Fund, SEE FUND, as well as industrial capitals from multiple parties such as Zhiyuan Robotics, Lingxin Qiaoshou, and Century Golden Resources.

Scarcity is an important reason for the primary market to bet on Liquing.

On the one hand, it's the talent with both hardware and software capabilities. Li Yiming's resume spans spatial perception, multi - modal reasoning, autonomous driving, and embodied intelligence.

During his doctoral studies at New York University, he collaborated with Xie Saining (co - founder and chief scientist of AMI Labs) to publish research results on embodied visual reasoning. At the same time, he co - published several highlight papers in CVPR and NeurIPS with NVIDIA and won the NVIDIA Scholarship in 2024 (only 10 recipients globally).

△ Li Yiming. Image source: Provided by the interviewee

Most of the more than 50 members of the Liquing team are students from Tsinghua University, with an average age of 23. "Talent with both hardware and software capabilities is very scarce in China, so Tsinghua provides us with a good talent platform," Li Yiming told us.

On the other hand, it's the scarcity of Liquing's technical route. Li Yiming boldly chose a "heavy" route: self - developed the entire stack from data collection, model training, to the physics engine.

This is quite rare in China. The huge upfront investment and the technical difficulty across hardware and software have discouraged many companies. But Li Yiming believes that only by connecting all the links can the information flow be unobstructed among different links and modules, and different links can be optimized collaboratively.

In Li Yiming's plan, by the end of this year, the team will release a world model that can be applied across B - end scenarios. In 2028, Liquing will achieve the large - scale implementation of solutions. Ultimately, his goal is to deliver a hardware - software integrated solution to customers to solve problems across different bodies and scenarios.

Recently, "Intelligent Emergence" had a conversation with Li Yiming about his technical judgments and his views on the world model and Physical AI.

The following is a summary of Li Yiming's views by "Intelligent Emergence":

Physical AI companies are neither entity companies nor model companies

🤖 What we do is not just a world model, but a system.

We are not guided by technical routes, but by practical problems. The purpose of training the world model is not to train the model itself, but to solve some problems in Physical AI and optimize the success rate of tasks.

Therefore, we don't care what the world model specifically is, but how to couple data, models, hardware, and Infra into a system, and finally become a world model that can work in scenarios.

Our goal is to build an ecosystem driven by both data and physics, with the "world model" permeating every link:

In the pre - training process, the "world model" is used as the self - supervised training target, and both state and action are modeled. In the post - training process, the "world model" is used as an interactive environment where robots can conduct reinforcement learning.

Liquing Intelligence is actually not just a "world model company". The whole team is working on a complete system including the data pipeline, world model, and physics engine. The so - called "model" is just one of the technical components.

🤖 The core feature of the new - generation Physical AI team is full - stack.

We build everything from data collection equipment to data pipelines, from differentiable physics engines to model training:

Self - developed equipment such as full - palm tactile gloves reduces the cost of a single set from the dollar level to the RMB level, achieving large - scale data collection up to millions of hours.

The self - developed differentiable physics engine achieves the Real - to - Sim - Real closed - loop, can model complex materials such as fluids, soft bodies, and elasto - plastic deformation objects, and becomes an efficient post - training platform for reinforcement learning.

Based on the data collected from a wide range of scenarios and the post - training physics engine, our self - developed world model operating system can quickly generalize to various scenarios and also achieve cross - embodiment.

🤖 New - era embodied companies should not be entity companies or model companies, but World Model as Service companies.

In the future, with the rapid accumulation of data, we can achieve rapid cross - embodiment generalization. Ultimately, what we deliver to customers is not a world model, but a hardware - software integrated system.

This system can automatically match the optimal hardware solution according to the implementation scenario and the customer's budget, and is ready to use right out of the box.

🤖 The talent profile for Physical AI is someone with both hardware and software capabilities.

Tsinghua provides a good talent platform. The average age of our team members is those born in 2003, and there are even freshmen born in 2007.

The talent profile for Physical AI is different from that for LLM. We need talent with both hardware and software capabilities. Currently, such people are very scarce because our training system is still in the process of maturing.

So we will train good candidates ourselves after finding them. Students in a good team can make great progress in about half a year to a year.

Don't just focus on data collection and ignore physical laws

🤖 The parameters of an embodied model need to reach at least the same level as those of a language model, or even several orders of magnitude higher, before we can talk about "intelligent emergence".

Language is a compressed set of world rules. Now, language models need hundreds of billions of parameters. Embodied models trained based on natural signals require more data and parameters.

🤖 Human - collected data is easier to scale up than robot - collected data.

There are hundreds of millions of people working on the front line and living in families across China. Compared with collecting data by operating robots, real people with equipment can collect data much more efficiently. After all, it's easier to scale up the number of people than the number of machines or the data collection duration.

Currently, we have found partners in scalable scenarios such as factories, hotels, property management, shopping malls, and kitchens. We will quickly accumulate millions of hours of data in a short period.

🤖 It's unrealistic to build a complete Physical AI Infra just by data collection. We also need a lot of physical laws.

At present, the amount of data collected does not support Physical AI to autonomously generalize to all scenarios. However, there are many scenarios in the real world. Even two apples look different. It's impossible to collect data from all scenarios.

Physical laws can make up for the limitations of data at present. The so - called physical laws, like Newton's laws and the Navier - Stokes equations (laws of motion for viscous Newtonian fluids), are summaries of the rules of the physical world by humans and have a certain degree of universality.

🤖 Liquing Intelligence has designed a world model solution that meets physical constraints. It can train a strategy model with 1% of the real - robot data used by others and achieve the same success rate.

We first collect a small amount of data from real robots. Then we align the state transition of the real - robot data (the change of the world state due to actions) with that of the physical world model and back - propagate the loss to continuously optimize the world model.

The advantage of this approach is that we only need a small amount of real data to "calibrate" the state transition modeled by the world model, and then the robot can learn autonomously in the virtual world.

For example, in the past, a robot needed to cut hundreds or thousands of apples to learn how to cut. Now, it only needs to cut ten times in reality, and the rest of the practice can be done in the physical world model.

VLA, video models, and JEPA are not "native world models"

🤖 The world model is responsible for the interaction between machines and the world, while the language model is responsible for the interaction between machines and humans.

Now, people have realized that building VLM (Vision - Language Model) and VLA (Vision - Language - Action Model) based on LLM is not really suitable for the physical world.

Because the language model is a highly discretized space. Simply put, when we interact with the world, we summarize a set of grammar rules. However, different countries have different languages, and language is full of human biases towards the world. Moreover, there are many things that cannot be clearly explained in language.

In essence, the purpose of language is communication, which is the interface for human - machine interaction, not a modality. A modality is your observation of the world, while language is your summary after receiving signals. Therefore, when training a world model, language is not the center but an auxiliary.

🤖 Training the world model requires both SFT (Supervised Fine - Tuning) and RL (Reinforcement Learning).

The world model needs to conduct SFT in the physical world, but the amount of physical data is insufficient. So we need to collect data ourselves and establish data standards.

LLM can generate arbitrary tokens during the post - training process, but the world model must follow physical laws. So we self - developed a differentiable physics engine to allow post - training to be carried out under physical constraints.

Therefore, training the world model is a system. It requires the combination of pre - training, post - training, as well as data Infra and hardware Infra to maximize the training efficiency.

🤖 Only a model that fully integrates perception, reasoning, decision - making, and action output and is designed for the interaction tasks between machines and the world is a "native world model".

VLA is a non - native world model because its representation is a discrete language space, not the real world. JEPA (Joint Embedding Predictive Architecture) can only predict states but cannot output actions.

Video generation models are also not native world models because the reasoning process is not native. The pixels they generate can only fit the appearance of the world, and it's difficult to ensure the geometric and physical consistency required for learning complex task strategies.

🤖 The key to training a "native world model" is how to efficiently tokenize the physical world.

How multi - modal observations - vision, touch, and force - are compressed into token sequences that the model can digest and reason directly determines what the model can understand and what it can't. The quality of this representation is the ceiling for all subsequent capabilities.

We are one of the few companies globally that can tokenize the representation end, that is, efficiently compress the physical world into tokens that machines can easily understand and learn.

The barrier of this system lies not in technology but in cognition. It requires strong know - how and an understanding of how to build the entire ecosystem. For example, how to clean data and how to optimize models. These issues have strong cognitive barriers.

Currently, the visual tokenizer (used to translate the physical world into tokens) trained within our team has better performance than Meta's visual foundation model DINOv3. Efficient representation of the physical world will also be a key research direction for our team in the future.

🤖 Another challenge in training the world model is how to build the Physical AI Infra.

In addition to building a data platform, we also need to design a good physics engine Infra. For example, how to make the physics engine efficiently model the states of flexible objects and fluids to efficiently calculate state transitions. Only in this way can robots conduct reinforcement learning in the physics engine.

If a company's so - called "Infra" can only support the entity to perform some simple grasping tasks, it's not a real Physical AI Infra.

A real Physical AI Infra can continuously optimize data efficiency, improve the effects of pre - training and post - training on complex tasks, or allow generalization and deployment on complex long - range tasks after training on short - range tasks.

2028 will be a milestone for the large - scale implementation of Physical AI

🤖 Wheeled robotic arms are the hardware implementation form suitable for most operation scenarios.

Humanoid robots have great imagination space, but the technical difficulty is also high. For example, the current payload capacity limits humanoid robots from performing tasks that require greater strength and complex operations. Accurately modeling various parts of