Fei - Fei Li's World Model: Generate a 3D World with One Sentence as AI Begins to Comprehend Reality

Spatial intelligence is the scaffolding for all cognitive abilities.

The world model is finally here!

Early this morning, Fei-Fei Li, a Stanford professor known as the "Godmother of AI," announced that her startup, World Labs, has officially launched its first product, Marble. This is the first time that a product in the field of world models has appeared in front of the public in a usable form.

The core capabilities of Marble can be summarized into three points:

First, multi-modal generation. It can reconstruct a 3D world with a complete structure and rich details based on a single image, a video, or even a text prompt.

Second, AI-native world editing capabilities. Marble allows users to make local replacements, change materials, adjust lighting, or restructure the layout of the world, just as they would in a real scene.

Third, a truly implementable production process. Marble supports exporting the generated world in formats such as Gaussian splatting, triangular meshes, or videos, which can be directly imported into common creative tools like Unreal, Unity, and Blender, and integrated into the workflows of industries such as gaming and film.

Fei-Fei Li believes that the significance of Marble goes far beyond "making 3D creation more convenient." As she wrote in her long article "From Language to the World: Spatial Intelligence is the Next Frontier of AI," Marble is just the first step in creating a world model with true spatial intelligence.

From this perspective, Marble not only brings the world model to the public for the first time in the form of a "usable product," but also symbolizes the official start of the era of spatial intelligence:

From being a tool for creators to build 3D worlds at the beginning, to potentially helping robots understand the real environment in the future, and then to being used for virtual experiments and predicting results in scientific research.

More importantly, it allows the outside world to clearly feel for the first time:

AI may move from understanding images and language to understanding and manipulating a complete world composed of structures, physical laws, and dynamic rules.

Currently, Marble is officially open for use. The address is as follows: https://marble.worldlabs.ai/

01 Create a World with Just a Sentence or an Image

The most "magical" aspect of Marble can actually be summed up in one sentence:

No matter what you give it - a sentence, an image, several videos, or even a rough draft made up of a few cubes - it can turn it into a complete 3D world.

This may sound like exaggerated advertising, but let's start with the simplest input method.

Let's first look at text generation. Suppose you give Marble the following prompt:

"An open kitchen that combines the aesthetics of a mid-century restaurant with orbital technology, featuring a checkerboard floor and stainless-steel fittings, and illuminated by soft light blue lighting."

Although it seems long, Marble will automatically extract the key elements - checkerboard floor, stainless steel, light blue lighting, and open kitchen - and generate a three-dimensional space that you can "walk into" within a few seconds.

It will look something like this:

In addition to text, Marble also supports more complex creative methods:

Single-image generation: Given a photo, it can generate a navigable 3D world;

For example, given a photo, Marble can automatically complete the scene into a navigable 3D world based on the photo's perspective, lighting direction, and the way objects are arranged.

The result will be something like this:

Furthermore, if you provide multi-view images or videos, it can also capture the key elements and restore a more complete and accurate three-dimensional space.

▲ The first photo is the front view, and the second is the side view

For example, you can give Marble two photos: one of the front and one of the side. The system will combine the information from the two photos and restore a more complete and three-dimensional space. The effect is as follows:

In addition to text, images, and videos, Marble also provides the Chisel tool for more professional creators.

This is an experimental editing method launched by Marble for advanced creators. When using Chisel, creators can first build a very rough framework in a three-dimensional space.

This framework can be as simple as consisting of just a few boxes, planes, or walls, or as complex as containing multiple rooms, corridors, or even multi-story structures. In addition, users can also import existing 3D resources and embed them into the scene as part of the world.

After such a "skeleton" is built, the AI will enter the second stage. Creators only need to describe the desired style in one sentence, whether it's a modern art museum, a Nordic-style guesthouse, or a science fiction experimental cabin. The system will then complete the materials, lighting, and details based on the existing structure, giving the entire world a unified visual language.

For example, based on a 3D geometric original image and the following text prompt: A beautiful modern art museum with a wooden floor, filled with colorful paintings and sculptures with elegant curves. The following effect will be achieved:

The key to this method is that it clearly distinguishes between spatial logic and visual style. The rough three-dimensional layout determines the basic structure of the scene, while the text prompt controls the final style and atmosphere. The two can be freely combined, so the same framework can give rise to completely different worlds.

02 The Real Disruption of Marble: A World That Can Continuously Evolve

Generation is just the starting point. Another important breakthrough of Marble is that it turns "world editing" into an AI-native capability.

It allows users to adjust the generated three-dimensional world just as they would a real scene: delete an object, replace a material, change the lighting, expand an area, or even restructure the entire spatial layout.

Here is a demonstration case of Marble:

This "editable" feature allows three-dimensional generation to break free from the limitation of "one-time imaging" for the first time and become a continuous creative process, which is closer to the real three-dimensional production workflow.

Moreover, Marble also provides a new method for "expanding" the world.

In traditional 3D creation, the larger the scene, the stronger the expressiveness, but expanding the scene often means higher costs. In this regard, Marble gives creators great freedom.

After the initial world is generated, users can expand any area within it. Simply select an area, and the system will infer a new environment based on the existing scene logic and complete the previously blurred or undeveloped parts.

For example, the corners of a room may not be detailed enough in the first version of the generation, or the information on the back of the furniture may not be fully presented. By expanding, these weak points can be supplemented to make the scene more unified and complete.

Larger areas can be extended into courtyards, streets, or even entire landscapes, allowing the initial small scene to naturally develop into an immersive space.

Here is a demonstration case of Marble:

In addition to extending within a single world, Marble also provides another way to build large-scale scenes: through the "combination mode," multiple independently generated worlds can be combined.

This "combination mode" allows creators to arrange the relationships between different worlds just like piecing together a map. Whether it's juxtaposition, connection, or nesting, they can be freely arranged according to needs.

This means that users can first generate several spaces with different styles and then combine them to form a large-scale and multi-layered virtual environment.

These two methods make Marble not just a tool for one-time scene generation, but more like a world-building platform that can be continuously expanded.

03 Generated Content as Assets: The AI World Can Be Used in Gaming and Film

After the world is generated, how to integrate it into the real production process is the key to whether a 3D creative tool can truly realize its value.

Marble does quite well in this regard. It allows users to export the scene in multiple formats for further use in subsequent game development, film production, architectural visualization, or robot simulation.

One of the export methods is Gaussian particle rendering. Gaussian particles can be understood as a "three-dimensional image composed of countless tiny points." When generating a world, Marble breaks the scene into thousands of small particles with color, transparency, and depth information, and then superimposes them to form a realistic three-dimensional image.

This method is particularly suitable for presenting soft lighting, complex materials, and delicate spatial layers, so it is used to showcase the highest-precision effects of the Marble world.

To allow these particle worlds to be directly viewed on the web, Marble has also launched an open-source renderer called Spark. Based on the common THREE.js (a tool library for displaying 3D content on the web), it allows users to directly load and display these Gaussian particle worlds in the browser.

If you need more traditional 3D assets, Marble also supports exporting the world as "triangular meshes". This is a common format in the gaming, film, and design industries, and almost all professional software can directly open it.

Marble provides two types of meshes with different levels of precision:

One is the "collision mesh", which has a relatively rough structure and is used for physical simulation, such as character collision detection and robot path planning.

The other is the "high-quality mesh", which retains as many details and lighting effects of the original world as possible and is suitable for use in official game levels, animation shots, or architectural displays.

After being exported as meshes, these scenes can seamlessly enter mainstream production tools such as Blender, Maya, Unity, and Unreal, and be fully integrated into the existing creative pipeline without the need for additional conversion. This means that the assets generated by Marble have the opportunity to be directly used in the workflows of industries such as gaming and film.

Of course, if your goal is just to showcase the world, Marble also supports directly rendering the entire world as a video. Almost all of the official example videos are directly generated by Marble.

In addition, Marble also supports enhanced processing of the exported video. It will automatically add more delicate details, eliminate unnatural parts in the picture, and even add some dynamic effects, such as the flickering of flames, the drifting of smoke, or the rippling of water. The entire enhancement process is still based on the original three-dimensional structure, so the camera angles, lighting, and perspective can remain consistent.

Through these export methods, Marble is no longer just an "AI that can generate worlds," but a three-dimensional creative platform that can truly be integrated into the workflows of various industries.

04 What Does It Mean When AI Starts "Generating Worlds"?

After seeing the capabilities of Marble, a question almost naturally arises:

What does it mean when AI really starts "generating worlds"?

Actually, before the release of Marble, Fei-Fei Li published a long article titled "From Language to the World: Spatial Intelligence is the Next Frontier of AI," which almost serves as a theoretical foundation for world models like Marble.

The article discusses a more fundamental question: the relationship between spatial intelligence and world models, and why they will become the key to the next generation of AI.

Fei-Fei Li believes that spatial intelligence determines how humans interact with the physical world and is the scaffolding for almost all cognitive abilities. From the improvement of spinning machines to the discovery of the DNA double helix, most of the breakthroughs in civilization have come from the understanding of "spatial problems," which cannot be solved by language description alone.

Therefore, if AI wants to truly understand the world, enter real-world scenarios, and interact with the physical environment, it must possess this "spatial intelligence," which in turn depends on a more fundamental ability: world models.

In Fei-Fei Li's view, a mature world model should have at least three core capabilities:

First, generativity. It can create a three-dimensional world with a complete structure and reasonable physics. Instead of generating a single image, it creates a world that can "operate," and different input methods (text, images, structures) can generate continuous and coherent scenes.

Second, multi-modality. It can infer the state of the world from various sensory inputs - pictures, videos, text, actions, and even gestures - which allows humans and intelligent agents to communicate in the same world.

Third, interactivity. When you take a step forward, open a door, or move an object, the world model must be able to predict the next frame of the world and maintain internal logical consistency.

To achieve these capabilities, the technological threshold that world models need to cross is much higher than that of language models:

There is no unified training task like "predicting the next word." The input of world models is much more complex than text.

It requires a huge amount of highly complex data, including not only videos but also information such as depth, lighting, materials, and physical behavior.

It requires a new model architecture to represent 3D/4D space, rather than "flattening" all information like large language models (LLMs).

The release of Marble is the first "productized debut" of a world model. Its ability to generate a consistent 3D environment based on multi-modal input is just one of the basic capabilities of a world model.

In the future, when such models truly master the complete chain of "seeing, thinking, and acting," robots will be the most direct application area. Deeper scientific applications, such as automated experiments, material design, and simulation research, may take longer to mature.

However, the emergence of Marble shows that this path has begun to become clear:

From language intelligence to spatial intelligence; from the text world to the three-dimensional world.

This article is from the WeChat public account "Silicon Base Observation Pro," author: Silicon Base Jun. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Fei-Fei Li's world model is here. Generate a 3D world with just one sentence. AI is really starting to understand reality.

01 Create a World with Just a Sentence or an Image

02 The Real Disruption of Marble: A World That Can Continuously Evolve

03 Generated Content as Assets: The AI World Can Be Used in Gaming and Film

04 What Does It Mean When AI Starts "Generating Worlds"?