Shanghai Jiao Tong University Fills in Spatial Sense for VLA with 0.9B Model Achieving 90% Success Rate on Real Devices

The ultimate balance of performance, cost, and real-time performance

Robots can see, but they may not see accurately.

Many VLA models still mainly rely on 2D vision. Once they encounter tasks that require spatial perception, such as precise positioning, fine placement, and occlusion judgment, the success rate will significantly decline.

There are two ways to supplement spatial information, but both come with costs.

Explicit 3D approach relies on depth sensors and point cloud reconstruction. It has a long hardware chain and is sensitive to calibration errors. Implicit 3D approach learns geometry from RGB images, saving on hardware. However, many solutions rely on heavy base models, resulting in relatively high training and inference costs.

Now, the MINT team from Shanghai Jiao Tong University has proposed an intermediate approach:

Evo - Depth, with approximately 0.9B parameters. It doesn't add extra hardware burden. It writes the sense of space into the VLA strategy using compact implicit depth encoding, balancing performance and deployment efficiency in both simulations and real - world machines.

In simulations, it achieves 84.4% in Meta - World, 95.4% in LIBERO; the average success rate on real machines is about 90%; on the deployment side, it requires about 3.2 GB of video memory and has an inference frequency of about 12.3 Hz.

The code, weights, and training scripts are fully open - sourced.

Lightweight and end - to - end trainable

To get straight to the point, the core idea of Evo - Depth is:

Extract compact implicit depth representations from multi - view RGB images, then integrate them into the vision - language pathway in a lightweight manner, and finally output continuous actions through a flow - matching action expert.

The entire system mainly consists of three parts:

1. IDEM: Implicit Depth Encoding Module.

IDEM is responsible for extracting implicit depth features from multi - view images, emphasizing spatial layout and relative geometric relationships rather than explicitly generating high - cost 3D intermediate representations.

In the paper, the backbone of IDEM has about 0.13B parameters and is initialized with multi - view depth pre - training, introducing depth - related inductive biases under lightweight conditions.

2. SEM: Spatial Enhancement Module.

SEM uses implicit depth as a modulation signal to enhance vision - language representations.

Compared with directly adding an independent depth branch, this fusion method is more restrained:

The original VLM continues to be responsible for semantic understanding.

The depth features are mainly responsible for spatial enhancement.

At the same time, it tries to control latency and video memory overhead.

3. Progressive Alignment Training.

Joint training of multiple modules often suffers from optimization instability.

To address this, the authors adopt Progressive Alignment Training, which is completed step - by - step through phased training: depth representation alignment - multi - modal fusion - action learning.

The action head uses the flow - matching approach commonly used in current VLA.

With a total parameter setting of about 0.9B, the results reported in the paper are as follows.

Simulation: 84.4% in Meta - World, 41.1% in VLA - Arena, 95.4% in LIBERO, 69.6% in LIBERO - Plus.

Real machine: The average success rate is about 90%.

Deployment: It requires about 3.2 GB of GPU video memory and has an inference frequency of about 12.3 Hz.

It's worth noting that, in addition to focusing on benchmark scores, the paper also provides deployment - side overhead and real - time performance indicators.

For VLA that needs to run in the robot control loop, this information is often equally important.

Trade - off between performance, cost, and real - time performance

Ultimately, the problem that Evo - Depth solves can be summarized in one sentence:

How to improve the spatial ability of VLA without significantly increasing the system burden.

The result is that compared with pure 2D VLA, it supplements spatial information; compared with heavier 3D approaches, it tries to maintain deployment efficiency.

For teams working on robot manipulation, spatial intelligence, or VLA systems, such performance - cost - real - time trade - off solutions may become increasingly important.

Official repository: https://github.com/MINT-SJTU/Evo-Depth

Model weights: https://huggingface.co/MINT-SJTU/EVO-Depth-LIBERO

This article is from the WeChat official account “QbitAI”, written by the MINT team from Shanghai Jiao Tong University, and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

With a 90% success rate on real devices achieved by a 0.9B model, Shanghai Jiao Tong University fills in the sense of space for VLA.

Lightweight and end - to - end trainable

Trade - off between performance, cost, and real - time performance