HomeArticle

VLA is dead, long live WAM: Has the GPT moment for robots arrived?

脑极体2026-05-19 20:37
The obituary for VLA might have been written too soon.

Just at the end of April that just passed, at the AI Ascent 2026 conference hosted by Sequoia Capital, Jim Fan, the person in charge of NVIDIA's robotics direction, put forward a highly controversial assertion: "The visual language model VLA is dead, and the world action model WAM should rise." He also predicted that within the next one to two years, the main data source for robot learning will shift from expensive human remote operation to first - person human videos readily available on the Internet.

As soon as these words were spoken, they immediately caused a great stir in the field of embodied intelligence.

Shortly before Jim Fan's speech, the LDA - 1B model jointly released by the domestic embodied intelligence company Galaxy Universal, NVIDIA, Tsinghua University, and Peking University had clearly taken a step towards "abandoning reflexive imitation and taking the world model route". Meanwhile, the general world action model Motubrain launched by Shenshu Technology topped both the WorldArena and RoboTwin 2.0 international authoritative lists.

Jim Fan's speech and the practices of technology companies made some people shout "finally finding the right direction", while others sneered "NVIDIA is creating momentum for itself again". Supporters believe that this is the inevitable path for robots to move from imitation to understanding; opponents point out that the advantages of VLA in fine control are still irreplaceable.

So, what exactly is the dispute about the roadmap for the robot brain? Is VLA really a thing of the past in embodied intelligence? What impact does this technological change have on start - up companies in the field of embodied intelligence?

01 What difficulties has WAM overcome?

To understand the value of WAM, we first need to figure out what the problems with VLA are.

The training logic of VLA is very straightforward: imitate human remote operation. If you teach it to pick up a red cup, it will remember the picture of the red cup and the corresponding action. The next time it sees the same cup, it can pick it up.

But the real world is not a laboratory. The color of the cup and the light will change. These seemingly insignificant changes to humans are huge challenges for VLA robots. In other words, what VLA learns is an extremely fragile and standardized "conditioned reflex", which is difficult to apply to complex real - world scenarios.

WAM provides a completely different idea. Its core is prediction and understanding. WAM tries to let the robot preview in its internal model before performing an action: how the object will move, how the liquid will flow, and what changes will occur in the entire scene after this action.

The first breakthrough brought by this physical imagination is a leap in generalization ability. A well - trained WAM robot can make reasonable judgments based on its understanding of gravity, friction, and inertia even when entering a kitchen it has never seen before. Research by HarmoWAM shows that in zero - sample scenarios where the background, position, and object semantics have all changed, the performance of WAM is 33% better than the previous SOTA - level VLA model.

In addition to the breakthrough in generalization ability, WAM has also accomplished another thing of greater industrial significance: a structural relaxation of the data source.

VLA has long been trapped on the expensive island of teleoperation data. Each frame of operation data needs to be collected by real - time human remote control and real - machine operation. WAM can learn from the massive, ready - made, and daily - generated human first - person videos, just as a large language model learns from Internet texts. This means that WAM gives robots the possibility of self - learning about the physical world from Internet videos for the first time. Being - H0.7 of Zhizaiwujie was pre - trained directly with 200,000 hours of human videos, proving the feasibility of this path. The LDA model of Galaxy Universal goes a step further by jointly training simulation data, human videos, and robot operation data, breaking the long - standing "perfect data superstition" in the industry.

Moreover, WAM has also made progress in another problem that has long troubled the robotics field, that is, long - range task ability. VLA can usually only handle simple tasks with two to three actions and easily gets lost when the time sequence is slightly extended. The performance of WAM has begun to move beyond the demo stage. Motubrain of Shenshu Technology can already complete complex tasks at the level of ten atomic actions, which means that robots have more continuous and robust execution capabilities in real - world scenarios.

The progress speed of domestic teams on this track is worthy of attention. The LDA - 1B of Galaxy Universal is jointly signed by Tsinghua University, Peking University, and NVIDIA; Motubrain of Shenshu Technology topped two international lists; Being - H0.7 of Zhizaiwujie ranked first globally in the comprehensive ranking.

Meanwhile, overseas cutting - edge laboratories are also advancing rapidly. DreamZero proposed by NVIDIA demonstrated strong generalization ability for new tasks and new environments in real - machine experiments, with an improvement of more than twice compared to top - tier VLA models.

On this new track, domestic and foreign players are almost on the same starting line. But behind the excitement, a more fundamental question emerges: Should VLA really exit the stage?

02 Is VLA dead?

The direction of WAM is correct, but the judgment that "VLA is dead" needs to be calmly examined.

On the one hand, WAM does show exciting technological potential. It enables robots to move from mechanical imitation to understanding and predicting the physical world, and from relying on expensive teleoperation data to using massive human videos. Being - H0.7 of Zhizaiwujie, pre - trained with 200,000 hours of human videos, was able to rank first in the comprehensive ranking in six international evaluations, which was unimaginable in the pre - VLA era.

On the other hand, there is also a commercial narrative behind this judgment. To understand this, let's first see who is saying "VLA is dead".

NVIDIA is the world's largest AI chip supplier. Whether it is VLA or WAM, the underlying computing power runs on its chips. However, the computing power consumption of the two is not in the same order of magnitude. WAM needs to pre - train on massive video data and also perform complex physical simulations or diffusion generation during inference, with a far greater demand for GPU computing power than VLA. Jim Fan's strong promotion of WAM means greater chip shipments and higher unit prices for NVIDIA. A chip company naturally hopes that the market will shift to technology routes that "consume" more computing power.

However, as observers, when accepting a technological narrative, it is necessary to distinguish which are objective technological breakthroughs and which are expectations magnified by commercial positions. Putting aside commercial positions, WAM itself still has hard nuts to crack.

On the one hand, since video generation focuses more on pixel - level consistency rather than joint - level fine control, in precision assembly tasks that require millimeter - level positioning or two - arm coordination, WAM's performance is significantly weaker than that of VLA models focusing on action optimization, and the inference delay is still higher than the latter even after optimization.

On the other hand, the data and computing power thresholds are not low either. Jointly training videos and actions requires massive real - machine interaction data and high diffusion model training costs, which are far beyond the reach of all teams.

Moreover, when tasks involve abstract language instructions or complex social contexts, pure physical world modeling can easily understand the picture but cannot understand human language. This shows that although WAM has taken an important step in the direction of "understanding the physical world", there is still a long way to go in "entering the real world". Intriguingly, this is precisely the comfort zone of VLA.

In fact, VLA still has value that is difficult for WAM to replace at this stage.

First, look at the deployment efficiency. In tasks that require millimeter - level precision and real - time force adjustment, such as precision assembly and surgical assistance, VLA's lightweight architecture is easier to deploy in real - time. The essence of VLA is an end - to - end "observation - action" mapping. It does not require complex physical simulations during inference, with low computational overhead and fast response speed. A mature VLA system can run on edge devices with relatively low computing power costs.

Second, look at the engineering maturity. After more than a year of rapid development, the model architecture of VLA has become quite mature. There are a large number of open - source models for reference, and the ecological tools are also relatively complete. From data collection, model training to deployment and inference, the entire process already has a relatively standard solution. A startup team can build a usable VLA system in a relatively short period of time. The architecture of WAM is more complex, the training is more unstable, the inference computational overhead is large, and the threshold for engineering implementation is significantly higher.

There is also an easily overlooked dimension: compatibility with the existing industrial system. In the field of industrial robots, a large number of automation tasks do not require complex physical understanding, but only stable, reliable, and high - precision repeated execution. VLA's imitation learning paradigm is naturally compatible with the needs of industrial scenarios. Enterprises can teach robots to complete specific operation tasks through a small number of demonstrations.

Therefore, the more likely evolution path is not "VLA being eliminated", but the in - depth integration of the two. "VLA is dead" is a highly contagious slogan, but it may be too early to regard it as a technological verdict. It is more like an alarm bell, reminding the industry not to stay in the comfort zone of VLA, but to think about how to integrate the ability of physical understanding into the existing framework.

So, while the discussion about whether WAM will replace VLA is in full swing, what are the startup companies that have bet on VLA experiencing?

03 The situation changes in half a year, and startup companies are under pressure

It has only been a little over half a year from the rise of VLA to the question of whether it is "dead". The technological iteration speed in the robotics field has become so fast that the industry is suffocating. For large technology giants, this may just be an adjustment of the research direction, but for startups with limited resources, each "change of situation" may be a gamble that requires re - betting.

The starting point of all this is, first of all, the huge sunk - cost risk in the R & D route.

In the past year, a large number of startups have built their technology stacks around VLA, investing heavily in purchasing teleoperation equipment and forming specialized data collection teams. Founders believe that accumulating high - quality teleoperation data is the moat for the future. After its establishment at the end of 2023, Zibianliang Robotics completed a nearly 2 - billion - yuan Series B financing and has a cumulative financing amount of more than 4 billion yuan, with a considerable part used for the construction of the data collection factory and the establishment of the real - machine data collection team. Zhifang completed 12 rounds of financing within a year, with a total financing amount of more than 1 billion yuan. Its self - built production line was put into operation in September 2025, and it achieved the delivery of 100 units of AlphaBot 2 per month in December of the same year. Undoubtedly, behind these numbers is a complete set of assets, teams, and cognitive frameworks built around VLA.

However, when the wave of WAM hits, the value of these investments is being re - evaluated. For companies that have just completed large - scale financing and have expanded their team size to hundreds of people, adjusting the direction means huge sunk costs.

The switch of the technology route quickly triggered a chain reaction in the talent market.

In the VLA era, the industry needed talents who were good at imitation learning and teleoperation data collection; in the WAM era, the talent demand has shifted to video understanding, physical simulation, and world model construction. The rapid change in the skill set puts pressure on the startup companies to restructure their newly established team structures.

Moreover, the rapid switch of the technology route means that the supply - demand relationship in the talent market is also fluctuating violently. While the WAM direction has become popular, the premium for relevant talents has also risen rapidly, and the VLA teams hired at high salaries originally face the dual dilemmas of loss or transformation. According to the "2026 Spring Recruitment Workplace Insight Report" of Maimai, from January to April 2026, the number of embodied intelligence positions increased by 15 times year - on - year, and the average monthly salary increased from 59,000 yuan to 62,000 yuan. Some industry insiders revealed that the salary increase for job - hopping in the industry can be as high as 150%. For a startup company with limited resources, it is not easy to deal with the situation of being pressured on both sides, that is, competing for talents in the new direction and digesting the inertia of the team in the old direction.

More directly than the talent problem is the doubt about the product value.

A cruel reality is that when the technology route changes every half a year, the products developed based on the old route may suddenly lose their market value. For example, those robot skill models trained based on the VLA paradigm and relying on teleoperation data face re - valuation in the context of the WAM narrative. If the staple food of robots in the future is really Internet videos, how many customers are willing to pay for these "personal - trainer" skills trained at high costs?

All these problems will ultimately be reflected in the capital market. The patience of investors and the window period of the capital market may not keep up with the pace of technology.

The "China Investment Development Report 2026" gives a judgment: The investment in the humanoid robot industry is entering a critical stage of "eliminating the false and retaining the true", and the valuation logic is shifting from concept hype to order verification and supply - chain positioning. The report clearly points out that mid - stream whole