HomeArticle

Netflix has also released a video model: It's not just about "erasing," but "rewriting" the physical world.

账号已注销2026-04-08 07:43
Netflix proposed the VOID framework to achieve the removal of video objects with physical interaction perception.

Video object removal is a fundamental task in video editing. Existing methods have performed excellently in handling "simple" removal scenarios, such as filling in the background behind an object after removing it, or eliminating its shadow and reflection.

But here comes the question: What if the object to be removed has physical interactions with other objects in the scene?

Imagine a row of dominoes falling in a chain reaction. If we use a video inpainting model to remove a few dominoes in the middle, existing methods would let the remaining dominoes continue to fall, which is physically impossible because there are no dominoes to push them. Another example is a person spinning a top with their hand. If we remove the hands, the top should continue to spin on its own, rather than suddenly disappearing or stopping.

These scenarios require the model to have causal reasoning ability: not only to "remove" the object, but also to deduce "how the entire scene would develop if this object had never existed." This is exactly the ability that current video editing models generally lack.

In this work, the Netflix team and its collaborators proposed the "Video Object and Interaction Deletion" (VOID) framework.

Paper link: https://arxiv.org/pdf/2604.02296

VOID not only removes the target object, but also can reasonably model the physical chain reaction caused by its removal. The framework contains three core innovations: the construction of a counterfactual dataset based on a physical simulation engine, an interaction-aware "quadmask" conditioning strategy, and the use of a vision-language model (VLM) to automatically identify the affected areas during inference.

It is worth mentioning that VOID is built on the CogVideoX video generation model of Zhipu and fine-tuned for the video inpainting task with interaction-aware mask conditioning.

The research results show that VOID was selected as the SOTA result in 64.8% of human preference evaluations, far exceeding the 18.4% of the second-place Runway.

At the same time, VOID also shows the generalization ability to physical effects that have not appeared in the training data, such as "the balloon will float away after removing the bear holding the balloon" and "the blender will not start after removing the person pressing the blender button", which shows that VOID does not simply memorize the training samples, but learns to use the physical intuition of the underlying model for reasoning.

Overall, this work provides an important reference for video editing models to move towards a "world simulator".

"Video removal" with a better understanding of physics

VOID is built on the CogVideoX DiT backbone and initialized from the pre-trained weights of Generative Omnimatte, inheriting its hierarchical object-effect decoupling ability.

On this basis, the research team uses counterfactual data pairs and quadmasks for fine-tuning, allowing the model to learn to generate physically reasonable new motion trajectories after removing the object.

The overall process of VOID is as follows: The user provides a video and specifies the object to be removed. The system automatically infers which areas will change due to the disappearance of the object, and then generates a physically reasonable counterfactual video.

Figure | Schematic diagram of VOID

1. VLM-guided quadmask generation during inference

During inference, the user only needs to simply click on the target object. The system uses a vision-language model (VLM) to analyze the scene and automatically infer which objects will be affected and where they will appear in the counterfactual scenario. The specific process is as follows:

1) The VLM receives the video and the object mask and outputs a list of descriptions of the affected objects;

2) Use SAM 3 to segment the affected objects and obtain their original position masks;

3) Overlay a spatial grid on the video, and the VLM predicts the new positions of these objects in the counterfactual scenario;

4) Merge the two sets of masks to generate the final quadmask.

2. Two-stage inference

Based on the generated quadmask, VOID generates the final result through two-stage inference.

First stage: Counterfactual trajectory synthesis. The model generates a preliminary counterfactual prediction based on the input video and the quadmask. This stage can capture the correct motion assumptions in the general direction, such as an object starting to free fall after losing support. However, since the video diffusion model is prone to problems such as object deformation when generating complex motions, further optimization is required.

Second stage: Optical flow-guided noise stabilization. Inspired by the Go-with-the-Flow method, VOID extracts the optical flow field from the output of the first stage, uses it to generate temporally correlated distorted noise, and then uses it as the input for the second stage. This allows the diffusion model to perform consistent denoising along the correct trajectory, significantly reducing object deformation. The VLM will automatically determine whether to trigger the second stage (only enabled when significant dynamic changes are detected).

Research results

Experiments on both real and synthetic data show that compared with existing video object removal methods, this method can better maintain the consistency of scene dynamics after object removal.

1. Real-world video evaluation

Since there is no "standard answer" for real-world videos, the research team adopted a variety of evaluation methods.

Human preference study: 25 participants each evaluated 5 scenarios and selected the best result from the outputs of 7 models. The results show that VOID achieved the SOTA result with a win rate of 64.8%, even when Runway received additional text instructions describing the expected scene changes.

VLM referee evaluation: The research team used three VLMs, Gemini 3 Pro, GPT-5.2, and Qwen 3.5-32B, as automatic referees to score from dimensions such as interaction physics, object removal, temporal consistency, and scene preservation. In the evaluations of all three referees, VOID obtained the SOTA total score. The advantage in the "interaction physics" dimension is more obvious. In the Gemini 3 Pro evaluation, VOID scored 3.66, while the second-place Runway only scored 2.61.

Qualitative comparison: In multiple real scenarios, the baseline methods had various failures: the object was not correctly removed in the collision scenario, the pillow remained sunken after removing the heavy object, and new paint still appeared on the wall after removing the paint roller. VOID showed correct physical reasoning in all cases.

Generalization to unseen effects: In terms of generalization, VOID successfully handled a variety of interaction types that never appeared in the training data. As shown in the following figure: after removing the cartoon bear holding the balloon, the balloon floats upward; after removing the child pressing the blender button, the blender no longer starts; after removing the dog biting the stick, the stick naturally falls; after removing the rubber duck obstacle, the ball changes its rolling trajectory, etc.

2. Synthetic dataset evaluation

On a synthetic benchmark containing 10 classic shadow/reflection removal cases and 30 dynamic interaction cases, VOID also showed SOTA capabilities.

For example, VOID was the best in all metrics except LPIPS. It is worth noting that LPIPS is sensitive to local displacements - if the model correctly simulates the object falling but the speed is slightly off, it may get a lower score than a model that simply deletes the object. In the video-level metrics FVD and VLM referee scores, the gap between VOID and the baseline is the most significant, which strongly proves its advantages in physical rationality and semantic consistency.

In addition, an ablation study on 75 real-world test cases shows that the diversity of mixing two datasets (even if the total amount remains the same) is better than a single data source; the fine quadmask combined with the VLM-guided mask generation process is significantly better than the rough global mask strategy.

Limitations and future prospects

Although VOID shows strong generalization ability, this research also has some limitations. As follows:

  • Domain gap problem: When the camera angle of the test video is abnormal or too close to the object, the performance will decline.
  • Data source limitation: Currently, all the training data comes from the rendering engine. In the future, more diverse data acquisition methods can be explored.
  • Video length and resolution: The generated video is still limited to a length of a few seconds, and there is room for improvement in resolution.

The research team said that with the emergence of more powerful video generation models and VLMs, the performance of this framework is expected to be further improved. More importantly, this work reveals an interesting and under-explored direction, that is, how to transfer the powerful world modeling ability to the field of video editing.

This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), author: Academic Headlines. Republished by 36Kr with permission.