Achieve 120-Fold Acceleration: Peking University, CUHK, and Shanghai AI Lab Develop VGGT-Edit for 3D Scene Editing in 5 Seconds!

No longer revert to 2D

The 3D world can "see", but it can't "modify" yet.

From NeRF to 83D Gaussian Splatting, and then to feed - forward 3D reconstruction models like VGGT and π³, the progress speed of the entire industry has significantly accelerated. With just a few pictures, a complete 3D scene can be reconstructed within seconds.

However, the problem lies right here. Although these models can understand the 3D world, they can't modify it yet. You can ask it to reconstruct a room, but it's difficult to really tell it:

Move the chair to the window, delete the middle chair, and change the gray leather sofa to a white shaggy sofa.

What's even more troublesome is that once complex editing is involved, the existing methods often break down rapidly. For example, the chair disappears from certain angles but reappears from another perspective; the background that wasn't supposed to be changed deforms along with it.

To address this challenge, research teams from institutions such as Peking University, The Chinese University of Hong Kong, Shanghai AI Lab, and NTU have proposed a native 3D editing framework: VGGT - Edit.

The core idea can be summarized in one sentence:

Instead of going back to 2D, perform editing directly in the 3D space.

On the DeltaScene test set, VGGT - Edit outperforms existing methods in three dimensions: semantic consistency, multi - view stability, and inference speed. A single edit only takes about 5 seconds, achieving a maximum acceleration of 120 times.

The problem has always been with 2D

Most current 3D editing methods are essentially based on "2D thinking". They first break down the scene into multiple 2D images, edit each image one by one, and then piece them back together into a 3D scene.

However, since each perspective is processed independently, the following issues are likely to occur:

The chair has been deleted from one perspective;
The chair reappears from another angle;
The background area drifts along;
Ghosting and flickering appear at the edges of objects.

△

Many results look more like "photoshop - edited images from different angles" rather than a truly stable 3D space.

For fields such as robotics, AR/VR, and spatial intelligence, this is almost a fatal problem. What these scenarios really need is not just "looking correct from a certain angle", but a consistently stable 3D world.

Native 3D editing is starting to move from concept to usability

The core idea of VGGT - Edit is very straightforward: Since the problem comes from 2D, don't go back to 2D.

The entire framework is built on the VGGT - Like feed - forward reconstruction model, inheriting its fast and efficient 3D representation ability. Interestingly, the team didn't choose to regenerate the entire scene but proposed a very ingenious mechanism:

Residual Field Prediction.

Put simply, the model first retains the stable 3D structure of the original scene and then only learns "where changes are needed". For example:

Move the chair to the right;
Change the material of the sofa;
Delete an object;
Add a piece of furniture.

These changes are represented as: New scene = Original scene + Local residual changes

This design has a very important advantage. Since most areas don't need to change, the model doesn't need to "regenerate the entire world". It only needs to modify local areas, resulting in a very stable background area that hasn't been changed.

This is also one of the most obvious differences between VGGT - Edit and many existing methods.

Text semantics are truly starting to "align" with the 3D space for the first time

The research team found that if a simple text is input into the model, a common situation may occur: the model knows "what you want to change" but doesn't know "where to change".

To solve this problem, VGGT - Edit designed a key mechanism:

Depth - Synchronized Text Injection

Essentially, it can be understood as continuously synchronizing text semantics and 3D space features at the same depth level.

Traditional methods usually inject text information only once in the beginning, but VGGT - Edit continuously fuses text semantics at multiple key layers. In this way, during the entire 3D generation process, the model always knows:

Which area should be modified currently;
What the modification target is;
Where the spatial location is.

Meanwhile, the team also specifically designed a " view importance weighting " mechanism. Since not all views are equally reliable, some angles may be blocked, and some views can only show half of an object.

VGGT - Edit can automatically determine which view is more trustworthy, ultimately making the multi - view editing results more stable.

An editing head truly designed for "3D editing"

In addition to the overall framework, VGGT - Edit also has a very crucial part: an editing head specifically designed for 3D editing tasks.

The research team found that for VGGT - Like models, the original reconstruction head focuses more on "how to restore the scene", but the real problem that 3D editing needs to solve is: how to modify only local areas while keeping the whole stable.

Therefore, VGGT - Edit additionally designed an editing branch to specifically predict local changes in the scene.

This editing head directly acts on the 3D representation space and outputs the corresponding residual field changes. Essentially, it learns:

Which areas should remain unchanged;
Which areas need to be edited;
How to maintain multi - view consistency after editing.

Compared with directly regenerating the entire scene, this method is more stable and efficient. This is also a key step in enabling the VGGT - Like feed - forward reconstruction model to have editing capabilities.

A dataset of 100,000 samples specifically for training "3D editing"

To train VGGT - Edit, the team specifically built a new 3D editing dataset DeltaScene with a scale of nearly 100,000 groups, covering various scenarios such as living rooms, offices, residences, and commercial spaces.

△

More importantly, the entire data generation process is highly automated.

The team used Qwen3.5 - Plus, SAM3, and Qwen - Image - Editing - Max to automatically complete editing instruction generation, target recognition, multi - view editing, and 3D consistency filtering, ultimately obtaining training data that truly meets "multi - view geometric consistency".

△

For native 3D editing, this step is very crucial. What the model really needs to learn is not just "image changes", but how the same edit can maintain spatial consistency from different perspectives.

3D editing is starting to approach real - time interaction for the first time

The results show that this approach is indeed effective.

On the DeltaScene test set, VGGT - Edit outperforms existing methods in three dimensions: semantic consistency, multi - view stability, and inference speed.

Especially in complex tasks such as adding furniture, adjusting positions, and modifying materials, many traditional methods still show obvious "texture mapping" and geometric drift, but the results generated by VGGT - Edit are more like a real and stable 3D space.

△

More importantly, it's about speed. In the paper, a single edit with VGGT - Edit only takes about 5 seconds. Compared with many traditional methods that require long - term optimization, it can achieve a maximum acceleration of 120 times.

This means that 3D editing is truly starting to approach real - time interaction for the first time.

For fields such as robotics, digital twin, and AR/VR, this change is very important. Only when the editing speed is fast enough can the 3D world truly become an "interactive" world.

△

The model is starting to truly understand "spatial changes"

There is also a very interesting experiment in the paper. The researchers input an instruction that had never appeared in the training: "Rotate the middle chair 90 degrees clockwise."

The model still successfully completed the edit.

△

This shows that what VGGT - Edit has learned is not just fixed templates. It is truly starting to understand how text semantics map to 3D space changes.

And this may be more important than "being able to generate 3D" itself. For spatial intelligence, the truly crucial ability in the future may not be "generating a world", but whether it can modify the world freely, stably, and in real - time like a human being.

VGGT - Edit is taking this a step further.

Paper link: https://arxiv.org/abs/2605.15186

This article is from the WeChat official account "Quantum Bit", author: VGGT - Edit team. Republished by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。