StartseiteArtikel

NVIDIA nutzt 3D-Modelle, um eine "AI-Architektensondertruppe" zu formen. Acht chinesische Forscher haben gemeinsam ein Werk verfasst, darunter ein Praktikant von Qianwen.

智东西2026-02-03 19:42
Eine neue Studie von NVIDIA ermöglicht es KI, zunächst Häuser zu bauen und dann zu renovieren.

According to a report by Zhidx on February 3rd, NVIDIA recently announced that its new dissertation on a universal 3D model will be published at the International Conference on 3D Vision 2026. A pre - print of the dissertation was already published in July of last year. This dissertation has developed a new paradigm for building the 3D world and confirmed that “AI - generated 3D synthetic data” can replace hand - annotated data on a large scale, which can significantly reduce the costs for pre - training visual models.

The main result of the dissertation is the 3D - GENERALIST model. This model uses a unified framework to integrate the four core factors of 3D environment generation, namely layout, material, lighting, and assets, into a sequential decision - making framework. The research company also proposed a self - improving fine - tuning strategy based on CLIP evaluations, which enables the model to correct the errors of the previous round by itself in the next generation round.

There are eight Chinese authors for this dissertation. The first two authors are Chinese students. Jiajun Wu, an assistant professor at Stanford University, who comes from the famous “Yao Class” of Tsinghua University, is also among them.

During CES 2025, NVIDIA officially introduced the world base model platform Cosmos. In his speech at CES 2026, Jensen Huang further defined “Physical AI” as the core of the entire release and officially positioned Cosmos as the “basic code” and the “world simulator” of Physical AI. In addition, Huang also released Cosmos Reason 2, which enables AI not only to generate worlds but also to draw causal chain inferences in natural language.

Which puzzle piece will the technology of 3D - GENERALIST supplement for NVIDIA's Cosmos? And how will the technological breakthrough be achieved? We try to find the answers in the dissertation.

01 Existing Problems: Only 3D images are generated, glasses and cups cannot interact independently

Currently, there are still many problems in creating interactive 3D environments.

For example, existing technologies often focus on single aspects of 3D generation, such as optimizing the layout or synthesizing textures. It is difficult to achieve coordinated optimization of all factors.

In addition, separable and operable objects and surfaces are missing in the scenes generated by existing technologies. Even with methods from large language models or diffusion models, it is difficult to improve the generation quality by expanding computing resources. The generated data is also not suitable for synthetic data applications that require precise annotations or for robot simulation scenarios. There is a difference between the generated data and the quality requirements of the 3D environment for downstream tasks.

Simply put, existing technologies only generate an entire 3D image. Glasses and books in the virtual world cannot interact independently.

And it is precisely these problems that 3D - GENERALIST aims to solve.

02 Research Approach: Introduction of a self - improving mechanism, diffusion model draws, VLM guides, API executes

The core idea of the research by Stanford and NVIDIA is to expand a “designer” into an “architectural team” and refine the work of building a house. Each step is assigned to a specific expert.

Specifically, the research company first generates a 360° guide image using a panoramic diffusion model. This is equivalent to drawing a floor plan, and all further construction work must follow this image.

Then the research company proposed a “scenario strategy” consisting of three steps:

First, the basic structure of the room is extracted using HorizonNet to build the beam structure of the house. Then, Grounded - SAM technology is used to segment the exact areas of doors and windows on the identified walls. Finally, the types of doors and windows as well as the materials are annotated by a VLM (visual language model) such as GPT - 4o, and a 3D room with basic components is programmatically generated.

After the rough house is completed, the VLM is used as the “brain” for decision - making. Multi - perspective scene renderings with coordinate markings and asset name markings as well as text prompts are input to it.

Then the VLM directly outputs specific action instructions in code form, such as adding assets, adjusting the lighting, or changing the material. These code instructions are forwarded to the tool API of the 3D environment, and the API automatically executes the instructions and updates the entire 3D room in real - time.

To ensure that each object in the virtual scene can interact independently, the research company also developed an asset - level optimization strategy.

Specifically, the company first identifies the container assets in the scene that can hold small objects, such as tables and bookshelves, using GPT - 4o. Then, network - based surface recognition technology is used to locate the effective areas on these carriers that are suitable for placing objects.

Subsequently, the visual language model Molmo - 7B, which specializes in finer pixel - level inferences, is introduced to determine the exact pixel points for placing small objects. Through 3D ray conversion, the pixel positions are converted into precise 3D space coordinates.

In combination with collision detection technology, 3D - GENERALIST finally achieves interaction effects that conform to real - world logic, such as placing a book on a table or putting a pen on a book.

In addition, 3D - GENERALIST is supported by three key technologies:

First, the research company introduced a self - improving fine - tuning mechanism. In each fine - tuning round, the model generates multiple candidate action sequences. Through CLIP evaluation, the optimal action that best matches the text prompt is selected. Then the VLM is monitored and fine - tuned using this optimal action to improve the self - correction ability of the model.

Second, the research company also standardized the domain - specific language for scenarios and defined the core descriptions such as category, placement position, material, and lighting to standardize the format of the action instructions output by the VLM and ensure that it is compatible with the tool API.

The research company uses a context corpus that contains action code snippets, which can significantly improve the CLIP matching score. During generation, random samples are taken as examples to improve the diversity and effectiveness of the action sequences.

03 Result Control: Physical rationality of 99%, training results with synthetic data approach those with real data

In the task of generating 3D environments for simulation, 3D - GENERALIST has outperformed the baseline methods such as LayoutGPT, Holodeck, and LayoutVLM in all aspects of the quality of 3D environment generation.

In terms of physical rationality, 3D - GENERALIST has a collision - free score of 99.0 and a score within the limits of 98.0. In terms of semantic consistency, the scores for position and rotation consistency have reached 78.2 and 79.1 respectively. The comprehensive physical - semantic alignment score has reached 67.9, which is far higher than the maximum value of 58.8 of the baseline methods.

After three rounds of self - improving fine - tuning, 3D - GENERALIST has achieved a CLIP score of 0.275, which is significantly higher than the score of the version without fine - tuning and the version without the context corpus. In addition, it can iteratively correct the defects in the scene.

The average CLIP score of the scenes generated with the asset - level strategy has reached 0.282, which is higher than the score of 0.269 of the baseline methods. Thus, the semantic matching and the physically reasonable placement of small objects can be naturally achieved, and the overlap of objects can be avoided.

The introduction of the self - improving fine - tuning technology has also reduced the rate of visual hallucinations of the VLM. In the benchmark tests of Object HalBench and AMBER, the hallucination - related indices of the fine - tuned model are better than those of the original GPT - 4o.

When a visual model is pre - trained on ImageNet - 1K Top - 1 based on synthetic data generated by 3D - GENERALIST and 860,000 labels are used for training, the accuracy reaches 0.731, which exceeds the performance of the model based on the artificially created HyperSim dataset.

When the number of labels is expanded to 12.17 million, the accuracy of ImageNet - 1K Top - 1 rises to 0.776, which approaches the performance of a model trained based on 5 billion real data. This confirms the advantages of 3D - GENERALIST in the scalable generation of synthetic data.

04 Research Company: Eight Chinese startup founders, geniuses from the Yao Class of Tsinghua University, and an intern from Qwen

Apart from the research itself, the list of authors of the dissertation is also very eye - catching.

The first author of the dissertation, Fan - Yun Sun, is a doctoral student in computer science at the Artificial Intelligence Laboratory (SAIL) of Stanford University and belongs to the Autonomous Agents Lab and the Stanford Vision and Learning Laboratory (SVL).