HomeArticle

VLA won't die, except for those that do not integrate with world models.

智能相对论2026-06-02 08:31
The silent war in the data factory determines who will have the last laugh.

Text | Intelligence Relativity

Author | Ye Yuanfeng

In May 2026, there was a not - so - funny fabricated joke circulating in the embodied intelligence circle: A VLA model was asked during a demonstration to "bring me the apple on the table". The robotic arm gracefully reached out and firmly grasped a mug. The scene fell into dead silence. The engineer broke out in a cold sweat and quickly typed on the Pad: "Redefine apple".

In the past six months, there have been many similar embarrassing jokes. The protagonists range from the most highly - valued domestic unicorns to Figure AI and Physical Intelligence across the ocean, and none have been spared.

In the past two years, the industry was still cheering for the VLA (Vision - Language - Action) technical route. When Covariant's RFM - 1 first appeared, the media was eager to label it as the "singularity of general robots". As soon as Google DeepMind's RT - 2 paper was published, analysts in the secondary market stayed up all night to revise their reports, advancing the commercialization schedule of embodied intelligence by three years.

Now, no one mentions the "singularity" anymore.

What people care about is whether this thing can actually screw a screw into a hole in a factory, rather than stabbing the screwdriver into its own motor. In the VLA system, the rather clumsy performance of embodied intelligence has made Jim Fan, the top person in NVIDIA's robotics division, even directly shout "VLA is dead".

However, it's too early to say so.

VLA won't die. Those VLAs that try to create general robots with only Internet pictures, texts, videos, and a few pieces of robotic arm tele - operation data really should be buried, but another thing is emerging. It integrates the "world model" that the industry has been talking about for several years but never really taken seriously. This may be the only viable path for embodied intelligence in the next three years.

The "Brain in a Vat" Living in the Internet

To understand why VLA keeps failing, we need to first figure out its genetic flaws.

The logic of the current mainstream VLA architectures, whether it's Google's RT - 2 or what domestic companies like Stardust Intelligence are working on, is essentially the same. First, use the massive picture and text data on the Internet to align vision and language, so that the model can understand pictures and human language. Then, connect the robot's action data for end - to - end fine - tuning, so that the model can output action instructions.

The biggest allure of this approach is "cost - saving". It tries to reuse the infrastructure of large language models and vision - language models, turning robot learning into a "lightweight" fine - tuning task.

Investors like this story: There's no need to collect expensive physical world interaction data from scratch, just stand on the shoulders of Internet giants.

But here comes the problem. Internet data has taught the model that "an apple is a red circular object", but it hasn't taught it that "an apple will deform and may roll away when a force of 10 Newtons is applied to it".

The videos on the Internet are all edited segments that conform to human visual aesthetics, full of smooth transitions and large leaps in causal relationships.

When a cup falls from the edge of a table, the next shot is often that it has already broken on the floor or been firmly caught by a hand. The decisive moment - when the cup slips from the fingertips, the friction coefficient is insufficient, and the tilt angle is too large - is forever lost.

The physics that VLA learns is a "pseudo - physics" based on surface associations. It knows that "falling" is often accompanied by "breaking", but it doesn't understand at what angle a glass pot full of hot coffee will cause the lid to slide off due to unstable center of gravity. Google DeepMind's RT - 2 paper also admits that the model's generalization ability drops sharply when facing new combinations of objects or scenarios that require fine force control.

Furthermore, a paper from Physical Intelligence reveals a reality. Even if you expand the model scale ten times and pour in more Internet pictures, its ability to predict physical interactions is almost a flat line. The scaling law in this field has hit a wall in the dimension of physical interactions.

So, the current VLA demonstrations are like a well - rehearsed magic show.

You can only see the robot smoothly grasping objects in a 0.5 - square - meter area in the laboratory, using only three or five fixed props, under strictly controlled lighting and background. Once you slightly change the background or put in a reflective or transparent object, the "brain in a vat" nature of the model is exposed.

It only knows the answer but doesn't know the process.

The World Model Is Not a Panacea, but It's the Only Antidote

The popularity of the term "world model" recently is a bit like that of the metaverse a few years ago. Everyone is talking about it, but it seems that no one has seen its true form. Yann LeCun in Meta's AI department talks about the world model all day long, believing that it is the key to true intelligence. Jensen Huang of NVIDIA also supports it at GTC.

In the context of embodied intelligence, the world model is highly anticipated, but in the hands of some people, it almost becomes a word game. Some teams' approach is simple and crude: At the output end of VLA, they put a ready - made physical simulation engine to "correct" those actions that violate physical common sense.

For example, if the model says to penetrate the table to get something, the simulator will pop up a "collision warning" to stop the arm.

Is this called integrating the world model? This is just patching bad code.

The core of true integration lies in internalization.

A powerful world model should be the "subconscious" and "intuition module" of VLA, rather than an external safety supervisor.

Before VLA makes a decision, it can internally and extremely quickly deduce the physical changes in the next few seconds and in turn restrict and guide the generation of actions.

When I reach out to catch a thrown key, my brain doesn't first plan the precise trajectory of my fingers and then wait for visual feedback to correct it. There is an internalized model in my brain about "how the key will fly in a parabola, how much air resistance there is, and where it will land", which directly drives my muscle memory and makes me almost instinctively adjust my body posture.

The work of Li Feifei's team on RoboAgent and some recent new attempts are moving in this direction. They make the model not only learn "see the cup - output the grasping action", but also force the model to predict the depth map, object segmentation map, and even the distribution of contact forces in the next frame while learning actions.

This is not just an expansion of input and output channels. It forces the model to break away from the associations of two - dimensional pixels and construct an internal, three - dimensional, and causal physical representation.

When the model can accurately predict that "if I push that bottle at this angle and speed, it will tilt to the right in the next 0.5 seconds", it can be said to truly "understand" the dynamic characteristics of the bottle. Only then will the grasping action not be as timid as it is now, either afraid to touch or pushing the object away with too much force.

The prospect is visible. Robot companies of all sizes have started such integration. VLA + world model, with various concept labels, will become the industry consensus.

What Jim Fan shouted, "Long live WAM", is essentially such a combination.

Before long, all serious embodied intelligence companies will write in their technical white papers that "we have built an end - to - end world model", or a similar concept of integrating VLA and the world model - with different names, and may even still be called VLA models, but the essence is the same.

The Silent War in the Data Factory Determines Who Will Laugh Last

Debating whether VLA is dead or whether the world model works is actually a bit off - target.

These problems of the superstructure ultimately boil down to the most basic and least glamorous thing: data.

A guy in charge of data collection at a leading humanoid robot company privately told "Intelligence Relativity" that their biggest headache now is not algorithm parameter tuning, but how to prevent those remote tele - operation annotators from dozing off.

In order to collect high - quality operation data, they invited retired senior engineers to wear gloves and repeat screwing a part all day long. But the old people's hands shake, and there are always problems with the tele - operation mapping of fine actions. After collecting a day's worth of data, after cleaning and aligning, less than 10% can actually be fed to the model.

This is just for one action. To make VLA + world model truly learn to make a cup of coffee, it needs to know the weight change of the kettle, the temperature distribution of the steam, the impact force of the water flow, and the material of the teacup. No Internet picture and text database can provide such data.

This is an unprecedented war in the data factory.

The reason why Tesla's Optimus team is being watched by countless people is not only because of Elon Musk's star power, but more importantly, they are migrating the "shadow mode" and data engine system of autonomous driving in cars to robots. Every success and failure of Optimus screwing a screw in the factory will be automatically labeled, fed back, and iteratively trained. This is a terrifying data flywheel that can self - sustain.

In contrast, most domestic robot companies are still using the ancient "crowd - based" model. They rent a few - thousand - square - meter site and, like the data annotation villages in the past, hire a large number of people for tele - operation. The data quality is uneven, and the collection cost remains high.

This directly leads to a result: Although the technical route of VLA + world model will become the consensus, the real technical barrier will quickly shift from the model architecture itself to the scale and efficiency of the data factory.

The future competition is hierarchical. At the top level are companies that can build the "physical world foundation model", such as OpenAI, Google DeepMind, and NVIDIA. They provide the most basic VLA base that can understand basic physical laws.

In the middle are robot companies that can have efficient, massive, and diverse private data factories. They use the "private - domain data" in their own scenarios to conduct in - depth fine - tuning on the base model to form super - expert models in specific fields (such as 3C assembly and catering services).

Companies without efficient data factories will become distributors of basic model manufacturers or can only compete desperately in low - tech inspection and guidance scenarios.

Data, high - quality data on physical interactions, is the only ammunition that VLA can ultimately use. Without ammunition, even the most advanced gun is just a useless stick.

Look at Physical Intelligence, a star company founded by a group of top academic experts. Since this year, it has been frantically signing cooperation agreements with various manufacturing and logistics companies. What they are after is not the service fees, but the most real, messy, and uncertain physical interaction data in those scenarios. Uber's rise back then was not due to algorithms, but the data monopoly brought by private cars running on city streets around the world.

The Uber moment of embodied intelligence hasn't arrived yet, but the countdown has begun.

Conclusion

VLA is not dead; it's just growing up. The sign of this growth is that it must be uprooted from the Internet's greenhouse and thrown into the soil of the physical world.

It needs to grow a new cognitive organ, the world model, to understand and predict physical causality. Whether all this can happen depends on those corners that are least illuminated by the spotlight - in the data factory, whether the workers' actions are standard, whether the noise of the sensors is filtered out, and whether those failed operations are carefully recorded.

The grand narrative of embodied intelligence has come to an end, and a more boring and cruel engineering battle has just begun.

*All the pictures in this article are from the Internet