NVIDIA Equips Robots with Lobster Brains, and the Harness for Embodied Intelligence Arrives
The wind of "Harness" has finally blown from large models to robots!
Just now, NVIDIA open-sourced a brand-new robot control framework -
CaP-X.
Based on this framework, robots can understand the environment through the camera and then write a Python code on - the - spot to control themselves.
Crucially, this isn't a one - time thing. If a piece of code successfully completes a task, it will be automatically stored in the skill library and is applicable to robot systems of different bodies and forms.
(Doesn't it seem a bit like the Skill of OpenClaw?)
The most astonishing part is that this framework can also use embodied large models (such as VLA) as APIs, directly allowing a "big brain" to "harness" various "small brains" (perception and control).
In the actual test, CaP - Agent0, built on this framework, achieved a success rate in 4 out of 7 core tasks that matched or even exceeded that of programs written by human experts.
Even when competing with pre - trained end - to - end large models like OpenVLA and Pi series, the "logic - based" solution of CaP - X demonstrated comparable or even superior performance.
Jim Fan, the head of NVIDIA's robotics department, directly concluded:
The era of agentic robotics has arrived!
If Harness for large models is like installing an engine in a car;
then CaP - X for robots is like enabling this engine to write its own driver programs according to road conditions and upgrade its "code cheats" at any time.
The release of this framework marks that the robotics field has officially entered its own "Harness" era.
Regarding this, Ken Goldberg, a professor at UCB, commented:
I'm very excited about the prospects of "Code as Policy" (CaP) for robots!
From "Manual Scaffolding" to "Code as Policy"
To understand what CaP - X is doing, let's first briefly review the current mainstream approaches to robot control.
In traditional robot control, engineers have to write the logic for perception, planning, and feedback line by line (such as the classic TAMP framework), which is the so - called Human - in - the - loop (manual intervention).
Although this method is precise and transparent, its generalization ability is extremely poor - often, "change a cup, rewrite the code."
Later, inspired by the Scaling Law of large models, the robotics field began to adopt end - to - end Vision - Language - Action (VLA) models based on the data - driven paradigm.
In the past year, the VLA architecture (Vision - Language - Action) has achieved remarkable results, and robots have started to fold clothes and do odd jobs.
However, the problem is that VLA is a "black box." Once an error occurs, it's difficult to debug, and new data needs to be collected and trained for new tasks.
Recently, inspired by the progress of a series of programming agents such as OpenClaw and Claude code.
Researchers began to wonder if large models like Gemini and GPT could replace human engineers in traditional control and directly call robot interfaces with Python code?
This is the background for the emergence of CaP - X, which transforms large models from "command - issuing commanders" into "code - writing programmers."
Furthermore, in the CaP - X framework, even the VLA strategy is just an API that can be called at any time.
Simply put, in the past, VLA was the "whole brain" of the robot, responsible for everything from image recognition to finger movement. But in CaP - X, VLA becomes a function in the code.
For example, when a robot needs to perform a high - frequency, delicate task like "unscrewing a lid," the programming agent no longer writes complex geometric coordinates by itself but directly calls VLA to execute the delicate and complex operation.
In this way, CaP - X replaces human engineers with a general programming agent, equips a full set of perception and drive interfaces, and can even automatically synthesize a skill library during the work process and call embodied models specialized in operations.
Next, let's take a closer look.
The Harness of Embodied Intelligence
CaP - X is essentially not a model but a complete set of harness frameworks, including: an interactive training environment CaP - Gym, a hierarchical benchmark test CaP - Bench, an untrained agent framework CaP - Agent0, and a reinforcement learning evolution algorithm CaP - RL.
CaP - Gym
As the core of the entire framework, CaP - Gym is a hierarchical control framework built on the standard Gymnasium interface.
It connects the digital brain with the physical body. Every time the large model writes a line of code, the physical world (simulator or real machine) will give real - time feedback.
In terms of the framework, CaP - Gym unifies perception primitives and control primitives:
In terms of perception, the agent obtains data from the environment through modular perception primitives, which abstract the raw sensor data into structured semantic objects.
It has built - in tools such as SAM3 (semantic segmentation) and Molmo 2 (point selection), which directly transform raw images into structured semantic objects like "there is an apple here" and "there is a cup there."
In terms of control, the agent does not directly issue joint - space action instructions but calls a motion planner or an inverse kinematics (IK) solver (such as PyRoki) to automatically handle collision detection and path planning.
That is to say, whether it's single - hand grasping, two - arm cooperation, or mobile robots, CaP - Gym provides an interactive sandbox that allows large models to perform "logical programming" directly in Cartesian space.
CaP - Bench
Based on CaP - Gym, the research also introduced CaP - Bench to measure whether a model can "harness" a robot.
It is specifically used to test the code quality, logical rigor, and error - correction ability of a model when it is pushed to the front line to "write action code" and faces physical feedback.
CaP - Bench mainly conducts tests from three dimensions:
Abstraction Level: Transform the action space from manually designed macro - commands (high - level) to atomic basic primitives (low - level);
Temporal Interaction: Compare zero - shot single - round program generation with multi - round interaction to quantify fault recovery and iterative reasoning abilities;
Perceptual Grounding: Evaluate how different forms of visual feedback affect the agent's ability to transform task - related visual features into code generation.
After a single - round blind test of 12 state - of - the - art large models (including OpenAI o1, Gemini 3 Pro, etc.), the results show that:
As human priors (scaffolding) are removed, the performance of all cutting - edge models drops precipitously, and none can achieve the zero - shot success rate of human experts at the low - level primitives.
This proves that without good interfaces, even models as powerful as GPT and Gemini 3 Pro will still be "lost" in the face of low - level action logic and are far from the level of human experts.
CaP - Agent0
Based on the failure modes and experience of CaP - Bench, the research further introduced CaP - Agent0.
CaP - Agent0 enhances the base model through a dedicated multi - round reasoning loop and a dynamically synthesized skill library. The core components are as follows:
Multi - round Visual Difference Comparison (VDM): The model often "goes blind" when directly looking at raw images. VDM can transform the visual differences between frames into structured natural - language feedback, and the agent then modifies the code based on the language feedback.
Automatically synthesized persistent skill library: When the model accidentally succeeds in groping at the low - level, CaP - Agent0 will automatically extract the successful code and encapsulate it into a reusable "skill." As the number of attempts increases, it accumulates a large skill library, making complex problems simpler.
Parallel integrated reasoning: When facing difficult problems, multiple solutions are generated and attempted in parallel, and multiple candidate solutions are sampled in each round.
In addition, the team also introduced CaP - RL, which directly uses the success or failure of environmental feedback as a verifiable reward and uses reinforcement learning (GRPO) to post - train the programming model itself, making its code - writing intuition more and more accurate!
Experimental Conclusions
As mentioned at the beginning, in the 7 core tasks of CaP - Bench, even when all high - level interfaces are removed and only the lowest - level atomic primitives are provided, CaP - Agent0 still performed excellently.
In a total of 7 tasks, it not only matched the success rate in 4 tasks but even surpassed the reference programs written by human experts.
In the long - range tasks of LIBERO - PRO, in the face of random interference in instructions or positions, the untrained CaP - Agent0 demonstrated stronger robustness than end - to - end models such as OpenVLA.
In addition, since CaP - RL conducts reinforcement learning at the code logic level rather than the pixel level, this ability can be directly transferred to real - world robots in a zero - shot and lossless manner.
At the end of the paper, the team also candidly shared the current limitations:
Although programmatic control (CaP) works well in long - range reasoning and logical planning, in "delicate tasks" that require extremely high - frequency visual feedback and delicate tactile perception (such as pouring water and precise plugging and unplugging), the current pure - code solution still seems a bit fragile.
A very promising direction is the CaP–VLA hybrid strategy:
The programming agent manages high - level task logic and error recovery, while delegating low - level execution to the VLA model.
From the perspective of robotics technology, by introducing optimization - based control primitives (allowing the agent to specify task - level constraints and consider obstacle avoidance), the robustness can be further improved.
Code as Policy
To be honest, the idea of "Code as Policy" is not something new.
As early as 2022, Google proposed CaP.
(Yes, Karol Hausman, the CEO of Physical Intelligence, was also one of the authors of that paper.)
The core idea of CaP is: Don't let the large model only output "what to do next," but let it directly write the Python code for the robot to execute.
That is to say, different from the previous method of using the large model as a high - level planner to output abstract steps first and then having other modules execute them, CaP directly generates strategy code closer to the final control layer.
There are two obvious benefits to doing this:
Firstly, code is naturally suitable for expressing conditional judgments, local feedback loops, and precise numerical control.
Secondly, it's also easier to convert vague instructions like "get closer" and "go faster" into specific action parameters.
In recent years, there have been many explorations along this path, but most studies often have difficulty distinguishing clearly:
Is it the model itself that is smart, or have the interfaces designed by engineers already done most of the work in advance?
In addition, people also haven't figured out whether allowing the model to think more and debug more times (i.e., perform calculations during testing) can make up for its clumsiness in low - level operations.
CaP - X is like an enhanced version of this route, and it proves that:
The breakthrough of embodied intelligence may not require an endless accumulation of real tele - operation data.