HomeArticle

Fei-Fei Li's 10,000-word article has gone viral, defining the next decade of AI.

新智元2025-11-11 10:56
The next decade of AI will be about building machines with spatial intelligence. Fei-Fei Li's latest in-depth article reveals the core framework and three key pillars of the "world model" for spatial intelligence.

The next frontier of AI is "spatial intelligence."

It is a technology that can elevate "seeing" to "reasoning," transform "perception" into "action," and turn "imagination" into "creation."

But what exactly is "spatial intelligence"? Why is it so important? How can it be built? And how can it be applied?

Today, Fei-Fei Li wrote a ten-thousand-word article sharing her thoughts on building and using "world models" to unlock spatial intelligence.

In the new article, she outlined a framework for the goals that a "world model" truly equipped with spatial intelligence must achieve.

Specifically, building such AI must possess three core capabilities:

Endow AI with the imagination of a storyteller to create,

Give it the agility of a first responder to navigate,

And equip it with the rigor of a scientist to reason about space.

One point that Fei-Fei Li and LeCun agree on is that the "world model" is the core to unlocking spatial intelligence.

It must be able to generate worlds that follow the laws of physics and are spatially consistent, handle multi-modal inputs from images to actions, and predict how these worlds will evolve or interact with them.

The application domain of spatial intelligence is evolving along a clear path.

Currently, it is empowering creativity. The World Labs Marble project has already put these capabilities into the hands of creators and storytellers.

Next, it will take control of the physical world, enabling robots to achieve a closed loop between perception and action.

The most transformative scientific applications, although they require more time, are expected to have a profound impact on human well-being.

The philosopher Ludwig Wittgenstein once wrote, "The limits of my language mean the limits of my world."

Fei-Fei Li said, "I'm not a philosopher, but I know well that, at least for AI, the world is far more than just words."

Spatial intelligence represents the frontier beyond language—it is an ability that integrates imagination, perception, and action, opening up infinite possibilities for machines to truly improve human life, from healthcare to creative expression, from scientific exploration to daily assistance.

Many netizens commented that this is a very important article by Fei-Fei Li, a must-read on spatial intelligence!

Here is the full translation. Let's take a look.

From Language to the World: Spatial Intelligence is the Next Frontier of AI

In 1950, when computing was just a synonym for automated arithmetic and simple logic, Alan Turing posed a question that still resonates today: Can machines think? It took extraordinary imagination to foresee what he envisioned: that intelligence could one day be constructed rather than innate.

This insight later initiated an unremitting scientific exploration called "artificial intelligence" (AI).

In the twenty-fifth year of my involvement in the AI field, Turing's foresight still inspires me. But how close are we to this goal? The answer is not straightforward.

Today, top AI technologies represented by large language models (LLMs) have begun to change the way we access and apply abstract knowledge.

However, they are still masters of words in the dark; eloquent but lacking experience, knowledgeable but detached from the foundation of reality.

Spatial intelligence will change the way we create and interact with the real and virtual worlds—bringing revolutionary changes to fields such as storytelling, creativity, robotics, and scientific discovery. This is the next frontier of AI.

The pursuit of visual and spatial intelligence has been the "North Star" guiding me since I entered this field.

That's why I spent several years building ImageNet—the first large-scale visual learning and benchmark dataset, which, together with neural network algorithms and modern computing devices such as graphics processing units (GPUs), has become one of the three key cornerstones of modern AI.

That's why my academic laboratory at Stanford University has been committed to combining computer vision and robot learning in the past decade.

And that's why, more than a year ago, I co-founded World Labs with co-founders Justin Johnson, Christoph Lassner, and Ben Mildenhall: to bring this possibility to full fruition for the first time.

The founding team of World Labs, from left to right, are Ben Mildenhall, Justin Johnson, Christoph Lassner, and Fei-Fei Li

In this article, I will explain what spatial intelligence is, why it is important, and how we can build a "world model" that can unlock it—its far-reaching impact will reshape creativity, embodied intelligence, and human progress.

Spatial Intelligence: The Cornerstone of Human Cognition

The development of artificial intelligence has never been more exciting. Generative AI, such as large language models, has moved from the laboratory to daily life, becoming a tool for creativity, productivity, and communication for billions of people.

They have demonstrated capabilities that were once considered out of reach, easily generating coherent text, vast amounts of code, realistic images, and even short video clips. The question of whether AI will change the world is no longer in doubt.

By any reasonable standard, it has already done so.

However, there are still many areas beyond our reach. The vision of autonomous robots, although fascinating, remains theoretical and far from becoming the daily necessity that futurists have long promised.

The dream of making great leaps in research in fields such as disease treatment, new material discovery, and particle physics remains largely unfulfilled.

And the promise of AI truly understanding and empowering human creators—whether it's helping students understand the complex concepts of molecular chemistry, assisting architects in envisioning spaces, supporting filmmakers in building worlds, or providing support for anyone seeking a fully immersive virtual experience—also remains out of reach.

To understand why these capabilities are still difficult to achieve, we need to examine the evolutionary history of spatial intelligence and how it shapes our perception of the world.

Vision has long been the cornerstone of human intelligence, but its power stems from a more fundamental ability. Long before animals learned to build nests, raise offspring, communicate in language, or establish civilizations, the simple act of perception quietly ignited the evolutionary spark towards intelligence.

This seemingly isolated ability to collect information from the external world (whether it's a glimmer of light or a touch) built a bridge between perception and survival, and it became stronger and more sophisticated with each generation. Layers of neurons grew from this bridge, forming a nervous system that could interpret the world and coordinate the interaction between organisms and the environment.

Therefore, many scientists infer that the cycle of perception and action became the core driving force for the evolution of intelligence and the foundation for nature to create us humans—the ultimate creation that integrates perception, learning, thinking, and action.

Spatial intelligence plays a crucial role in defining how we interact with the physical world.

Every day, we rely on it to perform the most ordinary actions: parking a car by imagining the shrinking gap between the bumper and the curb, catching a bunch of keys thrown across the room, navigating through a crowded sidewalk without collisions, or pouring coffee into a cup without even looking in a sleepy state.

In more extreme situations, firefighters navigate through a smoke-filled and rickety building, making instant judgments about the structural stability and their own survival, and communicating through gestures, body language, and an irreplaceable professional intuition.

And infants learn about the world through playful interactions with the environment long before they learn to speak. All of this happens so intuitively and naturally—something that machines have yet to achieve with such ease and proficiency.

Spatial intelligence is also the cornerstone of our imagination and creativity. Storytellers create incredibly rich worlds in their minds and use various visual media, from ancient cave paintings to modern movies and immersive video games, to present these worlds to others.

Whether it's a child building a sandcastle on the beach or playing Minecraft on a computer, spatial-based imagination forms the basis of interactive experiences in the real or virtual world. In many industry applications, the simulation of objects, scenes, and dynamic interactive environments powers countless key business use cases, from industrial design to digital twins and robot training.

History is full of moments that define the progress of civilization, in which spatial intelligence played a central role.

In ancient Greece, Eratosthenes transformed light and shadow into geometry—measuring a 7-degree angle in Alexandria at the same moment when the sun was directly overhead in Syene—thus calculating the circumference of the Earth.

Hargreaves' "Spinning Jenny" revolutionized the textile industry with a spatial insight: placing multiple spindles side by side in the same frame, allowing a single worker to spin multiple threads simultaneously, increasing production efficiency by eight times.

Watson and Crick discovered the structure of DNA by manually building 3D molecular models. They kept manipulating metal plates and wires until the spatial arrangement of base pairs "clicked" into perfect alignment.

In each case, when scientists and inventors needed to manipulate objects, envision structures, and reason about physical space, spatial intelligence drove the progress of civilization—and these cannot be carried by words alone.

Spatial intelligence is the cornerstone on which our cognition is built. Whether we are passively observing or actively creating, it is at work. It drives our reasoning and planning, even when dealing with the most abstract topics.

It is essential for the way we interact—whether verbally or physically, with our peers or with the environment itself.

Although most of us don't reveal new cosmic truths like Eratosthenes every day, our daily thinking is no different from his—perceiving the complex world through our senses and then using an intuitive understanding of how physics and space work to make sense of it.

Unfortunately, today's AI cannot think like this.

Great progress has indeed been made in the past few years. Multi-modal large language models (MLLMs) are trained with a large amount of multimedia data in addition to text data, introducing some basic spatial awareness. Today's AI can analyze pictures, answer questions about pictures, and generate super-realistic images and short videos.

Through breakthroughs in sensor and tactile technologies, our most advanced robots can start to manipulate objects and tools in highly restricted environments.

However, the frank truth is that the spatial capabilities of AI are still far from human levels, and its limitations are quickly exposed.

In tasks such as estimating distances, directions, and sizes, or "mentally rotating" objects by generating images from new angles, the performance of the most advanced MLLM models rarely exceeds random guessing. They cannot navigate through mazes, identify shortcuts, or predict basic physical phenomena. AI-generated videos—although showing promise and indeed cool—usually