StartseiteArtikel

Neues Werk des Teams von Xie Saining: Präzise Steuerung von 3D-Bildern ohne Prompt

量子位2025-07-03 15:49
Die visuelle Generierung schreitet hin zu einem neuen intelligenten Schaffensparadigma, das die Integration von Multimodalität, 3D-Semantik und Interaktion umfasst.

There was a time when generating images from text has become as common as drawing with a pen.

But have you ever thought about controlling the image by dragging the arrow keys?

Like this, drag the arrow keys (or use the mouse to drag the slider) to move the objects in the image left and right:

You can also rotate the angle:

Zoom in and out:

This amazing operation comes from the newly released Blender Fusion framework by the Xie Saining team. By combining the graphics tool (Blender) with the diffusion model, it enables visual synthesis to no longer rely solely on text prompts, achieving precise image control and flexible operations.

Three steps for image synthesis

The core of BlenderFusion's "generating images by pressing keys" does not lie in the innovation of the model itself, but in its efficient combination of existing technologies (segmentation, depth estimation, Blender rendering, diffusion model), which has opened up a new pipeline.

This pipeline consists of three steps: First, separate the objects from the scene → Then, perform 3D editing with Blender → Finally, generate high-quality composite images with the diffusion model.

Next, let's see how each step is done!

Step 1: Object-centric Layering.

The first step is to separate each object in the input image or video from the original scene and infer their three-dimensional information.

Specifically, BlenderFusion uses existing powerful visual foundation models for segmentation and depth estimation: It uses the Segment Anything Model (SAM) to segment the objects in the image and the Depth Pro model to infer the depth and assign depth to the objects.

By performing depth estimation on each segmented object, the 2D input from the image or video is projected into the 3D space, thus laying the foundation for subsequent 3D editing.

This approach avoids training a 3D reconstruction model from scratch and makes full use of the existing large-scale pre-training capabilities.

Step 2: Blender-grounded Editing

The second step is to import the separated objects into Blender for various refined edits. In Blender, you can perform various operations on the objects (such as changing colors, textures, local editing, adding new objects, etc.), and you can also control the camera (such as changing the camera viewpoint and background).

Step 3: Generative Compositing

Although the scene after Blender rendering is highly accurate in terms of spatial structure, the appearance, texture, and lighting are still relatively rough.

Therefore, in the last step of the process, Blender Fusion introduces the diffusion model (SD v2.1) to enhance the visual quality of the result.

For this purpose, Blender Fusion proposes the dual-stream diffusion compositor.

This model receives both the original input scene (unedited) and the coarsely rendered image after editing. By comparing the two, the model learns to make high-fidelity changes only in the areas that need editing while maintaining the global appearance consistency. This can avoid the distortion caused by the traditional diffusion model's "redrawing the whole image" and prevent the degradation of the unmodified parts.

Some tricks

In addition, to improve the generalization ability of Blender Fusion, the paper also reveals two important training techniques:

Source Masking: Randomly mask parts of the source image during training to force the model to learn to restore the complete image based on the conditional information.

Simulated Object Jittering: Simulate the random offset and perturbation of the objects to improve the decoupling ability of the camera and the objects. This combination significantly improves the realism and consistency of the generated results.

Result demonstration

Blender Fusion has achieved good results in visual generation for object and camera manipulation.

As demonstrated in the demo at the beginning of this article, by arbitrarily controlling the arrow keys to control the position of the objects in the image, the image maintains strong consistency and coherence.

In addition, Blender Fusion can also maintain spatial relationships and visual coherence in various complex scene edits, mainly including:

Single image processing: Flexibly rearrange, copy, and transform objects, and change the camera perspective.

Multi-image scene recombination: Combine objects from any image to create a brand - new scene.

Generalization: These editing functions have been successfully extended to objects and scenes that were not seen during training.

In the current situation where AI visual synthesis is becoming increasingly competitive, Blender Fusion is like giving creators an extra "third hand".

Users are no longer trapped by prompts, and they can piece together the ideal image without repeated trial and error.

From object layering to 3D editing, and then to high-fidelity generation, this process not only makes AI image synthesis more "obedient", but also allows for more freedom in gameplay.

Perhaps, your next image generation will no longer be about "choosing the right words", but you can place every detail in place by hand, just like building with blocks.

Paper link: https://arxiv.org/abs/2506.17450 

Project page: https://blenderfusion.github.io/#compositing 

This article is from the WeChat official account "QbitAI" (ID: QbitAI), author: henry. It is published by 36Kr with authorization.