HomeArticle

The latest judgment from Fei-Fei Li's World Labs: After AI writes code, is the next step to "write the world"?

机器之心2026-03-04 19:44
3D as code

In the era of AI, we've grown accustomed to accomplishing everything by "talking": If you want a table, just speak up, and the large model will generate it in minutes. If you need to complete a complex task, simply clarify the goal and constraints, and the "AI assistant" will get to work.

In this era, as long as you can use language, you can mobilize increasingly complex systems. The reason large models have rapidly penetrated various industries is essentially that they are built on the mature interface of "text".

However, when the problem shifts to the spatial domain, things become more complicated. For example, if you want to change the layout of a house or teach a robot to handle goods in a new warehouse environment, you can't simply rely on repeatedly "generating images" with a single sentence. If you have to re - render the entire world every time you move a wall or change a light, both efficiency and reliability will be compromised. The spatial world requires structure, persistent objects, and rules, just as a program requires code rather than recalculating the result every time.

This blog post from World Labs addresses precisely this question: When AI truly starts to participate in spatial creation and real - world tasks, what is the "universal interface" for it to communicate with humans and other systems? The author's answer is: 3D. 3D is not just a visual effect; it's a structured representation like code. It can be generated, inspected, modified, and version - managed. It can also be integrated into simulation systems, robot systems, and existing design toolchains.

Building on this core analogy, the article further elaborates: Neural graphics is like a programming language, responsible for expressing spatial structure; the simulation engine is like a chip, responsible for executing rules and physics; and the world model begins to play the role of "writing spatial code". Understanding this is actually understanding a greater change - when space itself becomes a programmable medium, the way humans and machines collaborate will also be redefined.

The following is the detailed content of the blog.

3D —— The "Code" of Space

We can understand the role of 3D in the spatial domain by comparing it with code. Code is a persistent abstraction designed to specify the underlying logic to be executed by a processor. For decades, it has driven a large part of the modern world. Today, AI models have become extremely proficient in reasoning and generating code; subsequently, this code is executed on hardware that predates the emergence of LLMs. As interfaces, code and 3D share important structural similarities in terms of why and how we use them.

Between Humans and Machines

Code is an extremely powerful interface between humans and machines. When an AI system generates code, humans can inspect, modify, debug, and integrate it into a larger system. This enables composite workflows: programmers and AI programming agents can collaborate to iteratively refine solutions.

3D representations can play a similar role. When a world model generates a 3D scene, object, or environment, humans can open it in familiar tools, edit the geometry, adjust constraints, rerun simulations, and correct errors. Here, composite workflows and pipelines can also be built: designers and engineers can collaborate with generative world models.

Between Machines

Code can also serve as a machine - to - machine interface. AI - generated programs can be plugged into compilers, runtime environments, APIs, and existing software infrastructures. Since code follows established abstractions, it can interoperate with existing tools.

Similarly, 3D outputs can be integrated with rendering engines, simulation systems, physics solvers, robot software stacks, and CAD tools. When a world model generates a structured 3D representation rather than pixels, it can participate in existing pipelines and interface with editing software and simulation engines.

In both cases, the key attribute is to externalize the state into structured components that other systems can use.

Imagine an alternative in the "code" domain. Instead of having an LLM write a program, we could make it the program itself. For example, we could prompt an LLM: "Sort the following list of one million numbers." The model has the ability to try to simulate this behavior entirely within its token stream by ingesting the list and attempting to re - output it in sorted order.

However, we rarely use LLMs in this way, except as a "trick", and we don't expect them to succeed perfectly in such tasks. Why? Because code execution provides guarantees that raw reasoning does not, such as repeatability, human readability, and modular composability. Code can be stored, version - controlled, tested, and run independently of the model's transient context window. It separates reasoning, representation, and execution: you think about the algorithm, write the program as text, and then run it.

There is a direct parallel in spatial systems. The equivalent of making an LLM "be the program" is to abandon structured world representations and simulation engines and instead rely solely on black - box systems that mix states and observations, such as querying models for action - conditioned pixels or state generation frame by frame. Such models may excel at their core tasks and be used in various applications, but they lack an operable structure: their outputs cannot be inspected, edited, easily shared (e.g., in shared experiences like multiplayer systems or shared intentions and states between robots), nor can they be integrated into existing simulation and control systems.

Neural Graphics —— The "Programming Language" of Space

If 3D is the analogy of code in the spatial domain, then what plays the role of a programming language: precise, expressive, and general enough to simulate the world?

Over the decades, a variety of 3D representations have emerged: meshes, voxels, point clouds, implicit fields, CAD formats, etc. However, creating rich, large - scale spaces, especially for digital twins, has been difficult and limited by hardware. Traditional 3D engines are built around strict memory and computational constraints, requiring simplified geometries and often manual asset creation. To minimize memory usage and bandwidth, pipelines are designed for asset reuse and compression. Data - driven methods are too expensive and conflict with the basic assumptions driving the design of these systems.

The explosive growth of hardware and software optimized for machine learning has broken these limitations. Modern GPUs, originally created for rendering triangles, have proven to be extremely useful for supporting large - scale matrix multiplication operations in neural networks. A new generation of GPUs is explicitly designed to handle AI workloads, with large memory chips to accommodate models and datasets. At the same time, these GPUs can still render graphics and run simulations exceptionally well.

Specifically, this hardware trend has enabled new memory - and compute - intensive technologies such as NeRF and Gaussian splatting to shine. We can now generate, store, and render world - scale representations that fit into memory and dynamically recalculate them when needed. Pipelines that once relied on static assets can become (partially or fully) generative. This has led to more realistic environments, greater diversity, and new application areas. For example, digital twins can transform from simplified and manually updated models to high - precision mirrors that continuously update with their physical counterparts, supporting monitoring, control, and safety - critical workflows.

In this novel architecture stack, neural graphics plays a role similar to a programming language. It provides a richly expressive medium for describing and generating spatial structures, just as high - level languages describe computational structures.

Simulation Engine —— The "Chip" of Space

A world model becomes truly useful when it runs over time to enable interaction, persistence, and dynamic change. If 3D is code, then the simulation engine is the chip that runs it.

Interactivity is not just a single function. It's a series of system problems that simulation engines have been solving for decades: state management, physics, collision detection, lighting, synchronization, determinism, and replay.

At the very least, long - term interactive experiences require persistence. The world must have an identity that survives beyond a single rendering pass. Actions leave traces, objects maintain states, and a session can be resumed. This involves three core components:

  • State management (what exists)
  • Update rules (how actions and physics/rules change it)
  • Observation (how the current state is rendered into pixels or sensor outputs)

In principle, a large diffusion or generative model can collapse all of this into an end - to - end mapping: (history + action) → next frame. Here, "state" only exists in transient neural activations. This is an attractive research direction, and several models and projects are exploring how far this "fully pixelated" approach can go.

However, collapsing this architecture stack introduces a fundamental trade - off. When memory, dynamics, and rendering are all entangled within a single network, the boundary between creation and consumption becomes blurred. Physical interactions at runtime (kicking a ball) and non - physical edits (demolishing a wall) become the same type of input. Using our analogy above, editing code becomes indistinguishable from executing code. While this is convenient as a goal for training large - scale models, this conflation weakens guarantees regarding physical consistency, replayability, and determinism.

An alternative is a factorized or hybrid runtime: Learned world models generate and interpret structures, but with the mediating role of 3D interfaces and representations, they selectively use external tools similar to existing engine components. Given the development trajectory of LLM - based programming, these models are likely to be able to build custom logic more suitable for their use cases than off - the - shelf libraries and engines. However, we predict that there will still be a clear distinction between components used for perception, generation, and reasoning and those where "rules matter".

In a factorized system, 3D becomes a powerful interface between humans and machines, exposing controllable, repeatable, and interoperable inputs and outputs.

3D as a Human - Machine Interface

Given our analogy of 3D to code, let's explore why 3D is a powerful medium for interaction between humans and machines, capable of describing and interacting with the physical and virtual worlds.

For machines: Many software systems already operate in the spatial dimension: simulators, robot software stacks, game engines, CAD tools, and GIS systems all interact through geometries, transformations, materials, trajectories, and constraints. If a world model generates outputs in the same structured language, it can directly plug into existing pipelines.

Equally important, machines increasingly need to communicate spatial intentions with each other. Planning agents may mark target areas, security monitors may mark restricted areas, perception modules may label uncertain geometries, and rendering modules may request new perspectives: these are all spatial concepts.

If all spatial reasoning is entangled in a single large model, one way to achieve this might be to share latent vectors. However, this is a strong assumption that requires sharing models or at least a shared latent space. In a heterogeneous, modular environment, this assumption does not hold. Even language is an inefficient exchange format for conveying geometry and constraints; structured 3D is a more natural lingua franca.

The ability to export is also crucial. When a world model can externalize its "thoughts" into concrete representations (such as splats, meshes, videos), they become components that can be inspected, verified, version - controlled, tested, and reused — composable pipelines emerge.

For humans: 3D interaction is also natural for humans. We spend our waking lives navigating in space: reaching, walking, manipulating, aligning... Our mental models are built around persistent objects and relationships: "The chair is under the table", "The porch connects these rooms". When systems expose this explicit structure, they align with our existing ways of thinking.

This stands in sharp contrast to purely image - based workflows. In 2D animation, each frame must be redrawn, effectively reconstructing the world dozens of times per second. In 3D, the world is built once, and then you only need to move the camera, change the lighting, and make objects move. A single spatial edit automatically propagates to every rendered frame.

This separation of the 3D spatial representation from rendering reflects the separation between code and execution. You only need to modify the source code once and then rerun it, rather than rewriting every output from scratch.

Towards the Future

If 3D plays a role similar to code as a human - machine interface, then the development trajectory is clear: The world becomes "programmable", a medium that both humans and machines can generate, edit, combine, and share.

This is precisely the direction we're working towards at World Labs:

Marble is a multimodal world model designed to reconstruct, generate, and simulate 3D worlds. It can create persistent, navigable worlds from text, images, videos, or rough 3D layouts. These worlds can be edited, expanded, exported (as Gaussian splats, meshes, or videos), and integrated into downstream tools.

Marble's 3D conditioning interface is an experimental feature called  Chisel , which advances the concept of using 3D as a coarse - grained control layer. It allows creators to outline structures using walls, planes, volumes, and imported assets, and then provide these as inputs to our model to generate rich and detailed visual effects on top. Separating layout and style gives users explicit control over composition and appearance.

RTFM and Spark explore the rendering layer. RTFM is experimenting with "learned rendering", capable of producing complex visual effects (such as reflections and shadows) from simple structured inputs. Spark is a high - performance Gaussian splatting renderer that integrates WebGL, bringing neural graphics into the real - time web environment.

This field is evolving rapidly. World models will increasingly participate in hybrid architecture stacks: generating structured worlds ("code"), expressing them through neural graphics ("language"), and executing them within simulation engines ("chips"). This is a paradigm shift towards programmable, data - driven spatial systems that can support realistic environments, digital twins, robots, training, design, and entirely new application categories. The core premise remains the same: Reliable communication and collaboration between humans, agents, and software require a precise, compact, inspectable, and manipulable interface.

That interface is 3D.

This article is from the WeChat official account "MachineHeart", and is published by 36Kr with authorization.