NVIDIA uses 3D models to create an "AI Architect Special Team". It is co-authored by eight Chinese people, including an intern from Qianwen.
According to a report by Zhidongxi on February 3rd, recently, NVIDIA announced that a paper on its new 3D general model will be published at the 2026 International Conference on 3D Vision. A pre - print of the paper was published last July. This paper constructs a new paradigm for building the 3D world. It verifies that "AI - generated 3D synthetic data" can be used on a large scale to replace manually labeled data, which can significantly reduce the cost of pre - training visual models.
The main achievement of the paper is the 3D - GENERALIST model. This model uses a unified framework to integrate the four core elements of 3D environment generation, namely layout, material, lighting, and assets, into a sequential decision - making framework. The research team also proposed a self - improvement fine - tuning strategy based on CLIP scores, which allows the model to autonomously correct previous errors in the next round of generation.
There are eight Chinese authors of this paper. The first and second authors are both Chinese students studying abroad. Jiajun Wu, an assistant professor at Stanford University who graduated from the "Yao Class" of Tsinghua University, is also among them.
At CES 2025, NVIDIA officially launched the world's foundational model platform, Cosmos. In his speech at CES 2026, Jensen Huang still regarded "Physical AI" as the core of the entire release and officially positioned Cosmos as the "underlying code" and "world simulator" of Physical AI. In addition, Jensen Huang also released Cosmos Reason 2, enabling AI not only to generate the world but also to conduct chained causal reasoning in natural language.
What piece of the puzzle will the 3D - GENERALIST technology complete for NVIDIA's Cosmos? How does it achieve technological breakthroughs? We tried to find answers from the paper.
01 Existing Pain Points: Only Generating 3D Images, Cups and Glasses Cannot Interact Independently
Currently, creating an interactive 3D environment still faces many pain points.
For example, existing technologies often focus on a single aspect of 3D generation, only optimizing the layout or synthesizing textures, making it difficult to achieve collaborative optimization of all elements.
Moreover, the scenes generated by existing technologies lack separable and manipulable objects and surfaces. Even with the methods of large language models or diffusion models, it is difficult to improve the generation quality by expanding computing resources. The generated data is also not suitable for synthetic data applications that require precise labeling or robot interaction simulation scenarios, falling short of the quality requirements of downstream tasks for 3D environments.
Simply put, existing technologies only generate an overall 3D image, and cups and books in the virtual world cannot interact independently.
The 3D - GENERALIST is designed to solve these pain points.
02 Research Method: Introduce a Self - Improvement Mechanism, Let the Diffusion Model Draw, the VLM Command, and the API Execute
The core idea of the research teams from Stanford and NVIDIA is to expand a "designer" into an "architect team" and break down the work of building a house, assigning each step to a specialized person.
Specifically, the research team first generates a 360° guiding image through a panoramic diffusion model. This step is equivalent to drawing a floor plan first, and all subsequent construction will be based on this image.
Then, the research team proposed a "scenario - based strategy", which is divided into three steps:
First, use HorizonNet to extract the basic structure of the room and set up the beam structure. Then, through Grounded - SAM technology, identify and segment the specific areas of doors and windows on the identified walls. Finally, let VLMs (visual - language models) such as GPT - 4o label the types and materials of doors and windows, and construct a 3D room with basic components through procedural generation.
After building the rough - finished room, the research team uses the VLM as the decision - making "brain" and inputs multi - view scene rendering maps with coordinate markers and asset name markers and text prompts to it.
Subsequently, the VLM will directly output specific action instructions in code form, such as adding assets, adjusting lighting, and changing materials. These code instructions will be connected to the tool API of the 3D environment, and the API will automatically execute the instructions to update the entire 3D room in real - time.
In order to enable each object in the virtual scene to interact independently, the research team also designed a targeted asset - level optimization strategy.
Specifically, the team first uses GPT - 4o to identify container - type assets in the scene that can hold small objects, such as tables and bookshelves. Then, through grid - based surface detection technology, accurately locate the effective areas on these carriers where objects can be placed.
Subsequently, the team introduces the visual - language model Molmo - 7B, which is good at pixel - level fine reasoning, to further determine the specific pixel points for placing small objects, and converts the pixel positions into high - precision 3D spatial coordinates through 3D ray conversion.
Combined with collision detection technology, 3D - GENERALIST finally achieves interactive effects that conform to real - world logic, such as placing books on the table and putting pens on the books.
In addition, there are three key technologies supporting 3D - GENERALIST:
First, the research team introduced a self - improvement fine - tuning mechanism. In each round of fine - tuning, the model will generate multiple candidate action sequences, select the optimal action that is most aligned with the text prompt through CLIP scores, and then use this optimal action to perform supervised fine - tuning on the VLM to improve the model's self - correction ability.
Second, the research team also standardized the scene - specific domain language, defined core descriptors such as categories, placement positions, materials, and lighting, and standardized the format of action instructions output by the VLM to ensure compatibility with the tool API.
The research team uses an action code snippet library that can significantly improve the CLIP alignment score. During generation, random samples are taken as examples to improve the diversity and effectiveness of action sequences.
03 Performance Verification: 99% Physical Rationality, Training Effect of Synthetic Data Close to Real Data
In the task of generating a simulation - ready 3D environment, the quality level of the 3D environment generated by 3D - GENERALIST comprehensively surpasses baseline methods such as LayoutGPT, Holodeck, and LayoutVLM.
In terms of physical rationality, the collision - free score of 3D - GENERALIST reaches 99.0, and the in - boundary score reaches 98.0. In terms of semantic consistency, the scores of its position coherence and rotation coherence are 78.2 and 79.1 respectively, and the comprehensive physical - semantic alignment score reaches 67.9, far higher than the highest baseline value of 58.8.
After three rounds of self - improvement fine - tuning, the CLIP score of 3D - GENERALIST reaches 0.275, significantly higher than the version without fine - tuning and the version without an action code snippet library, and it can iteratively correct scene defects.
The average CLIP score of the scenes generated by the asset - level strategy reaches 0.282, higher than the 0.269 of the baseline method, and it can naturally achieve semantic alignment and reasonable physical placement of small objects, avoiding object overlap.
The introduction of the self - improvement fine - tuning technology also reduces the visual hallucination rate of the VLM. In the Object HalBench and AMBER benchmark tests, the hallucination - related indicators of the fine - tuned model are better than the original GPT - 4o.
When pre - training the visual model ImageNet - 1K Top - 1 with synthetic data generated by 3D - GENERALIST using 860,000 labels, the accuracy reaches 0.731, exceeding that of the manually constructed HyperSim dataset.
When the number of labels is expanded to 12.17 million, the accuracy of ImageNet - 1K Top - 1 increases to 0.776, approaching the effect of a model trained with 5 billion real data, verifying its advantage in large - scale generation of synthetic data.
04 Research Team: Eight Chinese Startup CEOs, a Genius from Tsinghua's Yao Class, and an Intern from Qwen
In addition to the research itself, the list of authors of the paper is also very eye - catching.
The first author of this paper, Fan - Yun Sun, is a Ph.D. student in computer science at the Stanford Artificial Intelligence Laboratory (SAIL), affiliated with the Autonomous Agents Lab and the Stanford Vision and Learning Laboratory (SVL).
During his Ph.D. studies, he was also deeply involved in the work of NVIDIA Research and has worked in the Learning and Perception Research Group, Metropolis Deep Learning (Omniverse), and the Autonomous Vehicle Research Group.
His research interests mainly lie in generating embodied environments and data for training robots and reinforcement learning strategies, and he is committed to promoting the development of embodied, multi - modal foundational models and their reasoning abilities.
In addition, he founded the AI game company Moonlake, a cutting - edge artificial intelligence laboratory focusing on interactive world building that integrates multi - modal reasoning and world modeling.
The startup has previously raised $28 million (approximately RMB 195 million) in seed funding from Threshold Ventures, AIX Ventures, and NVentureS (NVIDIA's venture capital department).
The second author, Shengguang Wu, is currently a Ph.D. student in the Department of Computer Science at Stanford University and obtained his master's degree from Peking University.
Previously, he was a research intern on the Qwen team and participated in the research work of Qwen 1.
Jiajun Wu is an assistant professor of computer science and psychology at Stanford University. He graduated from the "Yao Class" of the Institute for Interdisciplinary Information Sciences at Tsinghua University in 2014 and studied under Professor Zhuowen Tu. During his time at university, he ranked first in his grade for three years and served as a reviewer for the world