HomeArticle

Hard Kr Exclusive Interview | LUO Jianlan: The Real Scaling Law for Robots Happens in the Closed Loop of Real-World Deployment

邱晓芬2026-06-17 14:14
What the whole world has not made a breakthrough in is the brain. We should quickly run through the closed loop instead of only focusing on the ontology.

Author | Qiu Xiaofen

Editor | Yuan Silai

In the past six months, the domestic embodied intelligence track has undergone a quiet shift in focus: the spotlight has gradually moved from the "degree - of - freedom competition" of hardware bodies to the deep - water area that determines the upper limit of robot intelligence.

However, when the industry repeatedly discusses "whether robots can replicate the Scaling Law of large language models by brute - force data stacking", Luo Jianlan, an associate professor at Shanghai Chuangzhi College and the chief scientist of Zhiyuan Robotics, gave a judgment that doesn't follow the mainstream: embodied intelligence cannot simply copy the development path of large language models.

Luo Jianlan's expression style is highly recognizable. He is used to quickly switching between Chinese and English professional terms, with a dense logical progression and rarely gives vague compromise answers.

Rather than getting stuck in the single - point debate of "which is more important: data, model, or Infra", he prefers to directly point out the problem itself: the core contradiction of current embodied intelligence is not a single breakthrough in one link, but whether these links can form a closed - loop in real - world deployment.

This judgment comes from his experience spanning academic research and industrial implementation. As a former Berkeley Ph.D., he studied under Sergey Levine, a founding figure in the field of embodied intelligence. After graduation, he served as a research scientist at Google X and DeepMind. 14 months ago, he returned to China and joined Chuangzhi College and "Zhiyuan Robotics".

In his view, the training methods of a considerable part of the so - called "embodied foundation models" in the current industry are not real pre - training, but are closer to mid - training or fine - tuning.

The reason is also very practical: currently, high - quality real - machine interaction data is still scarce. Especially the data that covers multiple scenarios, multiple tasks, multiple bodies, and includes failures, error corrections, and long - tail interactions is far from sufficient to support large - scale pre - training similar to that of large language models.

This has also led to a phenomenon: in the stage of insufficient real - machine interaction data, many teams in the industry will choose to overlay high - quality tele - operation data on the existing open - source model base and perform alignment or fine - tuning on specific tasks.

This path can quickly improve the performance of laboratory tasks in the short term, but it is not equivalent to real pre - training of embodied foundation models. The improvement of the Loss curve of the model on offline data only shows that it fits the existing data better; as for whether it can be migrated to new physical scenarios, handle long - tail disturbances, and recover from failures, it still needs to be verified through real - world deployment.

(Author's note: Loss is "the score of how many times the model guesses wrong each time", and the Loss curve is to plot this score over time. A downward - sloping Loss curve usually indicates that the model fits the training data better; however, in the field of robotics, it does not necessarily mean an increase in the deployment success rate in real scenarios.)

Therefore, Luo Jianlan believes that embodied intelligence cannot blindly copy the Scaling Law of the GPT - style.

Specifically, in large language models, there is a relatively stable and predictable statistical relationship between pre - training Loss and model capabilities.

However, in the field of robotics, a decrease in offline Loss does not necessarily correspond to an increase in the real - world deployment success rate. Robots face an open physical world, involving contact, disturbances, long - tail scenarios, hardware differences, and task feedback. Just because the model "remembers" the data does not mean it can truly "control" reality.

Therefore, the real breakthrough in embodied intelligence is not just about stacking parameters or data, but about deploying a closed - loop. Only when the scale of robot deployment expands, the adaptation cost of new scenarios can continuously decrease, and the data back - flow can stably improve the model's capabilities. This is the "Scaling Law moment" in the physical world.

Under this logical framework, Luo Jianlan's core task after returning to China is to build a scalable and evolving embodied intelligence closed - loop.

He condensed the key points of this year's work into three technical fulcrums:

First is SOP (Scalable Online Post - training). SOP addresses the infrastructure issues required for large - scale online post - training of robots, including low - latency data back - flow, cloud computing, training scheduling, and model updates. Its value is not just an algorithm module, but to verify whether robot data can efficiently enter the training closed - loop from the deployment site.

Second is LWD (Learning While Deploying). It attempts to break the past separation between "training" and "deployment", enabling robots to be a system that continuously evolves in real scenarios such as convenience stores and supermarkets, rather than a product that is fixed at the time of manufacture. When a robot encounters an unseen shelf form, product placement, or operational disturbance, the system can continuously accumulate data through real - world interactions and convert these experiences into subsequent model improvements.

Finally, there is the τ0 - WM world model recently jointly released by Shanghai Chuangzhi College and "Zhiyuan Robotics".

τ0 - WM does not regard video generation as the ultimate goal, but uses video prediction as a means to learn physical dynamics and evaluate the consequences of actions. More specifically, it hopes to become an action - conditioned physical inference device: before a robot actually executes an action, it first compares the possible future results of different candidate actions within the model, thereby helping the system choose a more reliable action.

For example, when facing an egg on the table, an ordinary VLA may directly output a grasping action; while an action - conditioned world model can first compare the future consequences of several candidate trajectories and avoid choosing an action that will sweep the egg off the table.

In Luo Jianlan's view, the real decisive point for embodied intelligence in the future is not the hardware, and even less the strength of the single - point capabilities of data, model, or Infra. Instead, it is whether they can form a closed - loop with each other. This is like the different planks of a wooden barrel. If any key link has a short board, it is difficult for the system's capabilities to be truly unleashed.

"Whoever can first run through the data flywheel of 'deployment - data - iteration' in semi - structured scenarios such as convenience stores, supermarkets, and warehouses will truly have the possibility of large - scale commercialization," he said.

And the critical time node may be the next 12 to 18 months.

Recently, Yingke had a conversation with Luo Jianlan. The following is the interview transcript, slightly edited.

The threshold for real embodied pre - training is higher than expected

Yingke: Why do you think there are few teams in the domestic embodied intelligence industry that are truly engaged in foundation model training?

Luo Jianlan: By analogy with the development stage of large language models, I think there are few teams in the robotics field that have the ability to conduct pre - training of embodied foundation models. Most teams are doing fine - tuning or "mid - training".

Even many mid - trainings are not solid enough. Many so - called "robot foundation models" in the current industry are closer to task adaptation or mid - training on existing open - source bases and have not really entered the pre - training stage driven by large - scale, heterogeneous, and real - world interaction data.

There is even a half - joking saying in the industry: "In papers, PI (Physical Intelligence) has never won; in reality, PI has never lost."

What this sentence actually reflects is a problem: robot models cannot be evaluated solely based on paper indicators. Ultimately, it depends on the deployment effect in the real world.

Looking back at the path of LLM, the output of the pre - trained model itself is actually full of noise. It needs to be aligned with high - quality data through mid - training and then further activate specific capabilities through post - training.

Real pre - training of robot foundation models should also, like LLM, absorb extremely extensive data, even including noisy data. However, the data in the robotics field is not static text, but real - world interactions, failures, error corrections, recoveries, and long - tail scenarios.

Yingke: What are the differences in data and architecture between pre - training, mid - training, and post - training?

Luo Jianlan: These are three stages of training, and the core differences lie in data and training algorithms.

Pre - training uses extremely extensive data to train the model, covering a little bit of every data type.

Mid - training uses high - quality robot tele - operation demonstration data to align with task requirements.

Post - training is to optimize specific capabilities. For example, the reasoning ability in large language models often needs to be further activated and aligned through post - training, reinforcement learning, or high - quality task data.

Yingke: What challenges might domestic companies face when filling the gaps in pre - training and post - training?

Luo Jianlan: The core issues are data and real - world scenario deployment. The entire system from data to Infra to the model is interlocking, and none of them is absolutely more important. This is the barrel effect.

I believe that real - world data must serve as the foundation. This is like reading the same book at different ages: a 3 - year - old can't understand it, a 20 - year - old can understand the plot, and a 40 - year - old can see human nature.

If the foundation model is stronger, the efficiency of absorbing heterogeneous data and migrating to new tasks will be significantly improved. However, if there is no real data as the foundation and simply relying on simulation or video data, the upper limit of the model will be restricted.

Yingke: Many companies are talking about the "GPT moment" of robots. How much data do you think is needed to truly achieve generalization?

Luo Jianlan: I oppose blindly benchmarking against the Scaling Law of the GPT - style.

If we limit it to high - quality, real - world interaction, and robot data that can be used for closed - loop deployment, the current data scale in the industry is still far from sufficient. Many so - called "million - level" or "tens of millions - level" data claims have inconsistent calibers: some are videos, some are trajectories, some are simulations, some are tele - operations, and some are repeated collections of a single task. The industry itself has not fully converged on how to measure robot data.

The Scaling Law of large language models is based on a relatively stable and predictable statistical relationship between pre - training Loss and model capabilities. However, this law does not automatically hold in the field of embodied intelligence.

A decrease in the training Loss of a robot only means that the model fits the static data better, and it does not mean an increase in the deployment success rate in the physical world. The complexity of physical interactions means that just because the model "remembers" the data does not mean it can "control" reality.

Therefore, the gold standard for embodied intelligence is not the data scale or Loss value, but the deployment efficiency in real scenarios. The real breakthrough point is when we observe that as the number of deployed robots increases, the adaptation cost of new scenarios continuously decreases, and the model iteration efficiency continuously improves. This is the critical point when the data flywheel starts to turn.

Unfortunately, the academic and industrial circles have not been able to accurately calculate the data volume corresponding to this critical point.

Robots need a closed - loop

Yingke: You returned to China more than a year ago. What do you think is the biggest difference between the domestic and foreign embodied intelligence robot industries?

Luo Jianlan: A robot is a full - stack system that requires hardware, models, and intelligence, and also needs to form a data closed - loop through real - world deployment. We cannot wait for one technology to be fully developed before starting another.

China's advantages lie in its industrial chain, supply chain, engineering capabilities, and talent density. What has not been truly broken through globally is the "brain" of robots. We should combine these advantages to quickly run through the closed - loop and give full play to China's existing advantages in hardware, scenarios, and deployment, rather than just competing in the body.

Yingke: You have done a lot of work since returning to China, such as LWD, SOP, and the world model released some time ago. What are the functions of these research results? What are the main components of this complete closed - loop?

Luo Jianlan: Starting from the bottom, the bottom layer consists of a large number of robot hardware deployed in real scenarios, that is, Fleet learning. First, you need to have a "fleet" of robots of sufficient scale.

The next layer is the infrastructure layer, including real - time cloud computing, data back - flow, communication, training acceleration, and inference acceleration, which are integrated hardware, software, and cloud Infra. The SOP we released before is actually a proof - of - concept of this Infra, proving that this link can work.

The next layer is the algorithm layer, which includes two parts: one is pre - training, and the other is post - training. The LWD we released a few months ago solves the problem of robot post - training and self - evolution. We will also continue to promote our own pre - trained foundation model in the future.

The overall logic of our closed - loop is that real - world deployment is not the end of training but the starting point for the continuous evolution of intelligence. It can form a positive flywheel: deploying more robots generates more data, trains better models, and then deploys more robots.

Yingke: What is the ideal effect of the data flywheel?

Luo Jianlan: It is a positive cycle of getting stronger with more deployment: a stronger model leads to the deployment of more robots; more deployed robots lead to more data back - flow; more data back - flow trains a stronger model.

For example, in semi - structured scenarios such as convenience stores and supermarkets, a large amount of interaction data may need to be collected when deploying the first 20 stores. However, as the number of deployments increases, the adaptation cost of new scenarios will significantly decrease. Ideally, when deploying to the 100th store, the amount of data required for new scenario adaptation will become very small, or even close to being ready - to - use out of the box.

Yingke: What is the significance of opening up this closed - loop?

Luo Jianlan: Although the current hardware is not perfect, it is basically sufficient for building a closed - loop for specific tasks and is not the core bottleneck. The real shortcoming lies in the data closed - loop, that is, the continuous iteration ability from the model, data to the entire link.

Currently, far - sighted CEOs around the world are paying attention to embodied intelligence, and everyone is waiting for the "first signal" to appear. Once someone runs through the commercial closed - loop in semi - open scenarios and proves that the data flywheel can turn, capital and industrial resources will quickly concentrate in this direction.

This is an opportunity for startups. Large companies are restricted by OKRs and existing moats and turn relatively slowly. The advantage of startups lies in speed. We don't need to disrupt all scenarios.

In the next 12 to 18 months, if a team can first run through the positive cycle of "deployment - data - iteration" in semi - structured scenarios such as convenience stores, supermarkets, and warehouses, it will establish a very strong first - mover advantage.

The world model is not about generating videos but predicting the consequences of actions

Yingke: The world model is very popular now. What's your understanding of it?

Luo Jianlan: This topic has been discussed every two years since 2017 and 2018. Previously, it was mainly discussed within the technical circle. Now, with the high social attention to AI, the world model has also become well - known.

For the world model, I am more concerned about the action - conditioned predictive model, which can be understood as a forward dynamics model. Given the current state and action, it predicts the future state, reward, or other utility changes after executing this action. Its core is to evaluate the impact of an action on the future state of the world without actually executing the action.

For example, when boiling an egg in the morning, I will predict in my mind that it will take a long time to boil it on a low flame, so it's better to use a high flame. This process does not require me to actually execute each action first, but to judge the quality of the plan in my mind.

Yingke: Why is the technical route of the world model so inconsistent now?

Luo Jianlan: The biggest problem with the world model now is that its definition is too broad. Many people's understanding of the world model is actually closer to a video prediction model, that is, predicting how the picture will change. However, what robots really need is not just the future picture, but how actions will change the subsequent state of the world. With this, they can do planning and action evaluation.

If a model only generates future pictures but cannot be used to evaluate the impact of actions on the world state, its value for robot decision - making is very limited. For me, the more important thing is the