NVIDIA Initiates Robot Development and In - house Research on Robotics Technologies

To make you burn tokens, NVIDIA has gone all-in on robots.

Well well well, NVIDIA has found a new way to burn tokens (doge).

Just now, NVIDIA, CMU, and Berkeley jointly launched the Embodied Intelligence Autoresearch framework -

ENPIRE.

Put simply, ENPIRE allows AI agents to conduct robot research on their own. It enables 8 Coding Agents to each control a dual - arm robot.

The agents will read papers, modify algorithms, train strategies, deploy experiments, analyze results, summarize experiences on their own. If they're not satisfied, they'll change their approach and start over.

The researchers at GEAR don't need to stare at the screen to adjust parameters. They just need to come in the next morning to check the reports.

The specific scenario is like this: The laboratory sets up the scene in advance, and then leaves the rest to Codex and the robots.

The results are actually quite good.

In the most representative Pin Insertion task, in just 3 hours, the success rate of the robot inserting the pin into a 4 - millimeter hole increased from 0 to 99%.

There was no human involvement throughout the process. Jim Fan, one of the project leaders, tweeted:

Part of the GEAR laboratory is now self - improving overnight. We just need to come in the morning to read the reports.

However, some netizens said:

High - EQ way of putting it: Self - improving overnight; Low - EQ way of putting it: Burning tokens around the clock.

Harness for Embodied Intelligence Research

First, it should be noted that ENPIRE doesn't let agents directly write control codes to manipulate robots. It's more like a robot researcher, which needs to reset the experimental scene in the real world, retrieve literature, implement ideas, verify results, analyze problems, and optimize the next iteration.

Different from similar code - as - policy methods, the final product of ENPIRE is not a control script, but a real policy that can be deployed on robots.

The reason why it's difficult to build an automated framework for the real - world environment is that the real world is different from the code world.

In the code world, if an agent writes wrong code, it can simply delete it and start over; if an experiment crashes, it can be restarted.

But in robot research, after an experiment fails, objects may be tilted, the scene may be in a mess, and the robot may even knock things off.

If researchers have to manually reset the scene, record results, and organize data for each experiment, agents can't conduct continuous research for 24 hours.

So, what ENPIRE does is essentially build an automated experimental platform for AI researchers.

The paper calls it the Harness Framework.

It can be understood that it equips Coding Agents with a complete set of infrastructure required for physical experiments.

This infrastructure consists of four parts, which also correspond to the name ENPIRE:

EN (Environment) Environment Module: Responsible for setting up the experimental environment, including safety boundaries, automatic reset, and automatic scoring.
PI (Policy Improvement) Policy Improvement: Agents propose new solutions based on the task objectives. They can try behavior cloning, reinforcement learning, heuristic rules, or even a combination of several methods.
R (Rollout) - Deployment and Testing: Deploy the new strategy on real robots for execution, and record trajectories, videos, and sensor signals.
E (Evolution) - Evolution: The core of multi - agent collaboration. 8 agents each occupy a robot, share code through Git, absorb effective solutions from each other, and eliminate failed routes.

After connecting the four modules, a complete closed - loop is formed:

Propose an idea → Train a strategy → Test on a real machine → Automatic scoring → Summarize experiences → Propose a new idea.

The entire process doesn't require human supervision. Agents are responsible for conducting experiments and learning from them on their own.

The most crucial part is actually the Environment module. Because it solves the most headache - inducing problem in embodied intelligence research:

How to make the experiment run automatically.

In the simulation environment, resetting often only requires a single command: env.reset()

But there is no env.reset() in the real world.

After a failed experiment, the robot must first restore the scene to its initial state before the next round of the experiment can start.

Take the GPU plug - in task as an example. The robot needs to first unplug the GPU from the motherboard, move it to a specified position and release it, and then return to the initial state.

The entire process involves complex force - control operations because a slight mistake may damage the GPU pins.

The same goes for automatic scoring.

For example, in the zip - tie threading task, the agent needs to determine: "Has the end of the zip - tie successfully passed through the head of the zip - tie?"

To answer this question, the agent even designed its own visual detection solution.

Two cameras, one on the top and one on the side, observe the target area simultaneously and perform image segmentation respectively. Only when both perspectives confirm that the end of the zip - tie has passed through the head of the zip - tie will the system determine that the experiment is successful.

The entire detection delay is compressed to less than 150 milliseconds, which is close to the human visual reaction speed.

Once these automatic reset, automatic scoring, and safety control interfaces are adjusted, they will be solidified as standard APIs.

Subsequent agents don't need to worry about the underlying experimental process when conducting research.

Thus, the real world has finally become a research environment that can be repeatedly called and continuously optimized for the first time.

Good Agents Are No Worse Than Researchers

Of course, having an experimental platform alone is not enough. The really interesting question is:

After you've prepared the robots, GPUs, and tokens, can agents actually conduct research?

The answer given by ENPIRE is: Yes, and they do it quite well.

As mentioned at the beginning, the paper verified the method on four high - difficulty dexterous manipulation tasks:

Push - T (Push the T - shaped block to the target position), Pin Insertion (Insert the pin into a 4 - millimeter hole), GPU Insertion (Insert the GPU into the motherboard slot), and Zip - tie (Thread and cut the zip - tie).

Ultimately, the success rate of all four tasks reached 99%.

But what's more interesting than the results is the process by which the agents achieved these results, especially in the Pin Insertion task.

The paper directly published the agent's Idea Tree, which is the complete evolution process of its research ideas.

From it, we can clearly see a very familiar research path:

First, try behavior cloning. The effect is average.
Add online reinforcement learning data, and the performance starts to improve.
Add a regularization term, and the success rate increases significantly.
Then continue to adjust the batch size, compensate for the controller delay, and further improve the stability.

Throughout the process, the agent is just like a human researcher, trying step by step and increasing the success rate from nearly zero to nearly 100%.

Throughout the process, no human told it which modules to add, and no human specified the experimental order.

All solutions come from the agent's own hypotheses, which are then verified through real experiments.

If we hide these records and only look at the research process, it's hard to say that there is any fundamental difference between this and a robot Ph.D. student conducting research in the laboratory.

What's even more interesting is that the agent can even actively change the research route according to the task characteristics.

In the Zip - tie task, it quickly found that end - to - end training was not effective.

The reason is simple because this task is too long:

Find the scissors → Grab the scissors → Find the zip - tie → Align the position → Complete the cutting.

The entire operation chain spans multiple stages, and it's difficult to learn well by simply relying on the end - to - end strategy. So the agent changed its route on its own.

First, use the VLA model (Vision - Language - Action) for rough positioning, and then call the tool API to perform fine - grained operations.

To some extent, it even designed a system architecture on its own.

If we want to find a direct reference, it's actually the Autoresearch proposed by Karpathy some time ago.

Both are essentially doing the same thing: Let AI automatically propose ideas, run experiments, compare results, and continue to iterate based on the results.

The difference is that Autoresearch takes place in the digital world. If the code is written wrong, it can be rewritten; if the experiment goes wrong, it can be restarted.

Computing power is almost the only cost, while ENPIRE has moved this research cycle into the physical world for the first time. Robots are not code.

You can't perform a Git Revert on a damaged robotic arm. In the real world, friction changes, object positions change, lighting changes, and sensors also generate noise.

The core value of ENPIRE is to package the originally chaotic physical world into an experimental environment that agents can repeatedly call through automatic reset, automatic scoring, and safety control interfaces.

For agents, the real world has for the first time gained the iterability similar to a software development environment.

Another interesting discovery is the so - called "Physical Scaling".

In the past, large - scale models scaled parameters, data, and computing power. ENPIRE starts to scale the number of experiments.

In the paper, 8 agents each occupy 8 robots and explore different routes simultaneously.

As a result, the time required for the Pin Insertion task to reach the target success rate was shortened from 1.5 hours in the single - robot mode to 40 minutes.

In other words, if large - scale models in the past were expanding GPU clusters, ENPIRE is expanding a fleet of robots.

Of course, this kind of scaling is not cheap.

As the number of agents increases, each agent needs to read the code of other agents, understand others' discoveries, summarize experiences, and synchronize knowledge.

Therefore, the consumption of tokens increases faster than the number of robots. The paper even specifically proposed two indicators to measure this cost:

Mean Robot Utilization: How much time the robots are actually used for experiments.
Mean Token Utilization: How many tokens the system burns per minute.

By now, we can probably understand why Jim Fan is so excited. Because they found that research itself seems to have become scalable.

Even the inheritance of experience has emerged. There is a very interesting experiment in the paper:

The experience accumulated by the agent in the Pin Insertion task was summarized into a text, which was then directly put into the prompt of the GPU Insertion task.

As a result, the subsequent research efficiency was significantly improved. Note that what was transferred here was neither the model weights nor the training data.

It was a research note, almost the same as what humans do in the laboratory for mentoring.

The Last Piece of the Puzzle for The Great Parallel

In May this year, Jim Fan gave a speech at the Sequoia Capital AI Ascent Conference and proposed the The Great Parallel framework: The robotics field is accelerating to follow the path that large - language models have taken.

If we add the latest autonomous research, language models are going through four stages - pre - training, alignment fine - tuning, reinforcement learning inference, and autonomous research.

Robots are also going through the same four steps, except that the medium for each step has changed from text to the physical world.

NVIDIA already has corresponding layouts for the first three steps: In the pre - training stage, there is EgoScale (training motion priors with 20,000 hours of human first - person view videos) and DreamZero - a brand - new World Action Model (WAM), which uses the video world model to predict the next physical state instead of the language model predicting the next token; in the alignment stage, a small amount of sensorized human data is used for action fine - tuning;

In the reinforcement learning stage, there is Dream Dojo - a pure neural simulator that doesn't use a physical engine and directly uses the video world model to generate a simulated environment. Robots perform RL in the "dream".

But the fourth step - autonomous research - has never had an executable implementation in the physical world. ENPIRE is this step.

Wenli Xiao, the first author, wrote on Twitter:

Autoresearch has finally left the sandbox and entered the embodied world.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA has started developing robots and conducting in-house research on robotics technologies...

Harness for Embodied Intelligence Research

Good Agents Are No Worse Than Researchers

The Last Piece of the Puzzle for The Great Parallel