HomeArticle

Annual summary of embodied intelligent robots, from the head of robotics at NVIDIA

量子位2025-12-29 17:53
It's time to stop the demos that only succeed once out of a hundred tries.

"The robotics field is still in the Wild West era."

This is the judgment given by Jim Fan, the head of robotics at NVIDIA, as 2025 was coming to an end.

At first glance, this conclusion may sound a bit harsh.

After all, in this year, we have seen robots play table tennis, basketball, and complete complex long - distance transportation and cross - scenario tasks —

Of course, there are also various embarrassing "miserable" scenes.

But as Jim Fan and smart netizens have repeatedly pointed out:

Many demonstrations are, in essence, just the best attempt selected from hundreds of tries.

Behind this, it precisely exposes the core problem that the robotics field still lacks a unified and reproducible standard evaluation system.

That's why almost everyone can claim to have achieved SOTA by adding qualifiers.

In addition, Jim Fan also pointed out —

Currently, the hardware progress of robots is faster than that of software, but the lack of hardware reliability restricts the iteration speed of software; at the same time, there are also structural problems in the mainstream VLM→VLA technical paradigm itself.

The following is the full text of the sharing:

Three Things the Robotics Field Taught Me in 2025

Everyone is excited about vibe coding. In the festive atmosphere, please allow me to share my concerns about the "Wild West of the robotics field" — these are the three lessons I learned in 2025.

Hardware is ahead of software, but hardware reliability severely limits the software iteration speed

We have seen extremely sophisticated engineering masterpieces: Optimus, e - Atlas, Figure, Neo, G1, and so on.

But the problem is that our best AI has far from squeezed out the potential of these cutting - edge hardware. (The robot's) physical capabilities are significantly stronger than the instructions the brain can currently issue.

However, to "serve" these robots, an entire operation and maintenance team is often required.

Robots don't self - repair like humans: overheating, motor damage, and strange firmware problems are almost daily nightmares.

Once an error occurs, it is irreversible and unforgiving.

The only thing that is really "scaled" is my patience.

The benchmark testing in the robotics field is still an epic disaster

In the world of large models, everyone knows what MMLU and SWE - Bench are.

But there is no consensus in the robotics field: what hardware platform to use, how to define tasks, what the scoring criteria are, which simulator to use, or go straight to the real world?

By definition, everyone is SOTA — because every time a news is released, a new benchmark is temporarily defined.

Everyone will pick the best - looking demo from 100 failures.

By 2026, our field must do better and can no longer treat reproducibility and scientific norms as second - class citizens.

The VLA route based on VLM always feels wrong

VLA refers to the Vision - Language - Action (visual - language - action) model, which is currently the mainstream paradigm for robot brains.

The recipe is also very simple: take a pre - trained VLM checkpoint and "graft" an action module on it.

But if you think carefully, you will find the problem. VLM is essentially highly optimized for climbing benchmarks such as visual question - answering, which directly leads to two consequences:

Most of the parameters of VLM serve language and knowledge, rather than the physical world;

The visual encoder is actively trained to discard low - level details because the question - answering task only requires high - level understanding, but for robots, tiny details are crucial for dexterous operations.

Therefore, there is no reason for the performance of VLA to linearly improve with the increase in the parameter scale of VLM. The problem lies in the misalignment of the pre - training goals themselves.

In contrast, the video world model is obviously a more reasonable pre - training goal for robots. I'm placing a big bet in this direction.

Many netizens also expressed their agreement under Jim Fan's tweet.

Some netizens said that the fault - tolerance ability of hardware is indeed very important:

The slowdown of iteration due to hardware constraints is a bottleneck that is often underestimated. Software can be updated frequently, but the physical system must be built on a reliable mechanical foundation, which requires real - time verification and refinement.

Hardware is crucial, but data is important

In Jim Fan's discussion, hardware was placed at the core, but at the same time, we found that data, as a core element, was ignored.

In robot research, data shapes the model's capabilities, and the performance of the model depends on hardware. This is its typical full - stack characteristic.

This year, we have seen new hardware bodies such as Figure03, Unitree H2, Zhongqing T800, XPENG IRON robot, and ZHIYUAN ELF G2.

Judging from the demonstration effects, these new hardware perform well in terms of motion capabilities:

Whether it's Unitree's somersaults or XPENG robot's gait control, they have significantly exceeded the average level at the beginning of the year, and have proven that large robots (adult height) can be as flexible as small robots.

But the real practical problem, as Jim and netizens mentioned, may be how to further improve hardware reliability while maintaining high performance, such as engineering challenges like fall resistance, battery heating, and long - term operation stability.

In terms of data, one of the most notable examples this year is Generalist, which has proven the Scaling law of embodied intelligence through a large amount of data.

Among them, the larger the data and the higher the model parameters, the better the model performs on specific tasks, which is consistent with the phenomenon we observed in LLMs.

At the same time, there has also emerged customized robot hardware like Sunday that facilitates data collection.

This system is designed in coordination with the robot's hand. It uses skill - capturing gloves to collect human motion data and can convert it into data usable by robots with a success rate of nearly 90%.

Also receiving attention is Egocentric - 10K, a large - scale dataset that aggregates 10,000 hours of work data.

It can be said that in the field of embodied intelligence, the importance of data is self - evident. But the specific data route has not yet converged: human - centered collection (wearable devices, Umi, videos), real - machine teleoperation data, simulation data, as well as Internet data, data modalities, and ratios are still open questions.

The Hottest Word in the Robotics Field in 2025 — VLA

In terms of models, VLA is undoubtedly the hottest word in the robotics field in 2025.

According to the latest reviews from research institutions such as King's College London and The Hong Kong Polytechnic University, more than 200 VLA works were published in 2025 alone.

Some netizens even joked recently that there might be 10,000 VLA works in 2026.

So, what exactly is VLA?

Simply put, VLA endows the robot with a brain that can simultaneously process information in the following three modalities:

Vision (V): Perceive the environment through cameras and understand the shape, position, color, state, and scene layout of objects.

Language (L): Understand human natural language instructions (e.g., "Put the red apple on the table into the bowl") and conduct high - level reasoning.

Action (A): Convert the understood instructions into a sequence of low - level physical actions that the robot can execute (e.g., moving joints, grasping, pushing, etc.).

Traditional robots usually need to be specially programmed or trained for each new task. However, the VLA model can perform tasks not explicitly seen in training through large - scale data learning and can even work in unfamiliar environments, thus having generalization ability.

But as Jim Fan mentioned above, the skeleton of the VLA model based on VLM (visual - language model) is essentially optimized for question - answering and knowledge reasoning. Its huge parameter library and service goals are seriously misaligned with the fine physical operations required by robots.

In this review, we also found some responses to the views put forward by Jim Fan, sorted out in the form of questions and answers as follows:

Q: The visual encoder of VLM tends to discard low - level physical details and only retain high - level semantics (e.g., "This is an apple"). And these tiny details precisely determine the success rate of actions such as grasping and pushing

A: Future VLA needs to integrate a physics - driven world model, internally representing 3D geometry, physical dynamics, causality, and affordance to achieve the unity of semantic instructions and physical accuracy.

Q: Since the pre - training goals of VLM are misaligned with robot control, increasing model parameters will not linearly improve performance.

A: Decouple high - level semantic planning from low - level body perception control through "morphology - independent representation", enabling the general robot brain to achieve zero - sample cross - embodiment migration through lightweight adapters, thus giving play to the generalization ability brought by data scale instead of blindly stacking parameters.

Q: Jim Fan suggested using the video world model as the pre - training goal for robots because it naturally encodes temporal dynamics and physical laws.

A: The current research trend is to "graft" world model capabilities onto VLM. For example, train a data - driven simulator to learn physical dynamics and then embed it into VLA as a decoupled internal simulator to achieve explicit planning, making VLA transform from a "passive sequence generator" to an active physical perception agent.

In addition, in terms of data and evaluation benchmarks, the review tends to the "simulation school"