Fei-Fei Li Reveals the "Achilles' Heel" of Large Models: Without Spatial Intelligence, All the Chatting is Just Empty Talk
While the tech community is still deeply entrenched in the "parameter involution" of large models, Professor Fei-Fei Li, a professor at Stanford University and the co-founder of World Labs, pointed out a more fundamental bottleneck: Current AI is trapped in a "flat world" composed of text and two-dimensional images, which is seriously out of touch with the three-dimensional, physically governed real world we live in.
On November 11th, in a widely circulated long article, Fei-Fei Li clearly stated that spatial intelligence is the key to breaking through this cognitive barrier. It not only represents the next frontier in the evolution of artificial intelligence but also the turning point for AI to truly integrate into the physical world and transform from a "conversation tool" into an "action partner."
This article summarizes Fei-Fei Li's systematic elaboration on the technical path and application prospects of spatial intelligence in this long article. Combined with the insights of many industry practitioners, we jointly look forward to how this transformative force will reshape the human-machine relationship and industrial ecosystem.
From language to the world, spatial intelligence is the dawn of AI
Current artificial intelligence, especially generative AI, has profoundly changed the world in terms of creativity, efficiency, and communication.
However, Fei-Fei Li pointed out that the grand vision of current AI applications in many key areas has far from been realized. The development of autonomous robots has not yet gone beyond the laboratory and specific scenarios, and the vision of their "integration into daily life" still remains at the stage of conceptual deduction;
In scientific research, although AI has shown potential, there is still a considerable distance from truly achieving an efficiency revolution in disease diagnosis and treatment, new material research and development, and basic physical exploration;
In the aspect of creative empowerment, whether it is to assist students in understanding complex and abstract concepts, support architects in spatial thinking, or help creators build virtual worlds, AI still lacks a deep understanding of human intentions and scene requirements and is difficult to achieve true cognitive collaboration.
She further emphasized that fundamentally, it is because AI lacks the innate spatial intelligence of humans.
Spatial intelligence is the cornerstone of human cognition and civilization. It is not an advanced skill but the fundamental ability for us to interact with the physical world through the "perception-action" cycle, driving our daily behaviors, non-verbal communication, imagination, and creativity. From Eratosthenes' measurement of the Earth's circumference in history to Watson and Crick's discovery of the DNA double helix, major breakthroughs in civilization often stem from the ability to manipulate, visualize, and reason about space, which cannot be achieved by pure text.
Regrettably, the spatial ability of current AI has fundamental limitations.
Although multi-modal large language models (MLLMs) have basic spatial perception ability through training with massive multimedia data, they can analyze images, answer relevant questions, and generate ultra-realistic images and short videos. At the same time, with the breakthroughs in sensor and tactile technologies, the most advanced robots can manipulate objects and tools in highly restricted environments.
However, the spatial ability of AI is still far from the human level. The most advanced multi-modal large language models perform almost as poorly as random guessing in tasks such as estimating distances, directions, sizes, or performing "mental rotation" by reconstructing objects from new angles. They cannot get out of mazes, identify shortcuts, or predict basic physical laws. The videos generated by AI often lose coherence after a few seconds.
She analyzed that these models' understanding of the world is superficial and fragmented, lacking the holistic, associative, and intuitive cognition of humans. Human cognition of the world is holistic, including not only what we see but also the spatial relationships between all things, their internal meanings, and importance.
Without this ability, AI is out of touch with the physical reality it tries to understand. It cannot effectively drive cars for us, manipulate robots in homes and hospitals, provide new immersive interactive experiences for learning and entertainment, or accelerate the exploration process in the fields of material science and medicine.
The power of spatial intelligence lies in understanding the world through imagination, reasoning, creation, and interaction (rather than just description).
Therefore, Fei-Fei Li concluded that the future of AI lies in going beyond the boundaries of language and developing powerful spatial intelligence, which will be the key to the next leap forward.
The key to the next generation of AI is to develop a "world model"
Fei-Fei Li pointed out that to build artificial intelligence with spatial intelligence, it is necessary to go beyond the current paradigm of large language models and turn to developing a more fundamental "world model." The core of this model is the ability to understand, reason about, and generate a complex world that is consistent in semantics, geometry, physics, and dynamic rules.
She further said that to achieve this goal, the world model needs to have three basic abilities: First is generative ability, that is, the ability to create a simulated world that is completely consistent in perception, geometry, and physical dynamics, and to deeply understand the continuity of the world state over time;
Second is multi-modal ability, which can naturally process various forms of input and output such as images, videos, texts, and actions;
Finally is interactive ability, which can predict the next state of the world based on the input actions, thus connecting the perception-action cycle.
However, building such a world model faces far greater challenges than language models. Language is a purely generative phenomenon in human cognition, while the "world" follows much more complex rules, and the dimensionality and complexity of its representation are far beyond that of language, which is a one-dimensional sequential signal.
At World Labs, Fei-Fei Li and her research team are committed to making fundamental progress towards the following goals:
One is to define a new general training objective function, whose status should be similar to the "next-word prediction" in language models. However, due to the extremely complex input and output spaces of the world model, defining such an objective function itself faces huge challenges. Although the path ahead is not clear, this objective function and its internal representation must be able to accurately reflect geometric and physical laws to reflect the essence of the world model as a unified carrier of reality and imagination;
Two is to solve the problem of large-scale training data. Although Internet images and videos constitute a huge data source, the core challenge lies in how to effectively extract three-dimensional spatial information from these two-dimensional signals. The key to the research is to build a model architecture that can fully utilize this large-scale visual data. At the same time, high-quality synthetic data and multi-modal data such as depth and tactile data are also indispensable. The future development depends on more advanced sensing systems, more robust signal extraction algorithms, and more powerful neural simulation methods;
Three is a new model and representation architecture. The existing paradigms (such as MLLM and video diffusion models) label data as one-dimensional or two-dimensional sequences, which are difficult to handle basic spatial tasks such as counting and long-term memory. The breakthrough depends on the adoption of new architectures such as 3D/4D perception and memory mechanisms. For example, the RTFM model developed by World Labs introduces space-related frames as memory units, achieving efficient real-time generation while maintaining the persistence of the world, demonstrating the direction of architectural innovation.
Fei-Fei Li believes that although the challenges are huge, this is the key path to achieving a breakthrough in the spatial intelligence of artificial intelligence. This research will give rise to a new generation of creative and productivity tools, and ultimately enable artificial intelligence to obtain the core ability to interact deeply and effectively with the physical world.
From creative tools to scientific partners, the three-stage empowerment path of spatial intelligence
Fei-Fei Li elaborated on the core motivation for promoting the development of artificial intelligence and the application vision of spatial intelligence. She firmly believes that the fundamental purpose of artificial intelligence must be to enhance human capabilities rather than replace humans. AI should expand human creativity, connection efficiency, and sense of fulfillment in life, and always respect human autonomy and dignity. Guided by this people-centered concept, spatial intelligence is regarded as the key frontier to achieve this vision.
She pointed out that the application of spatial intelligence will be deepened in multiple fields in stages.
In the short term, creative tools such as the Marble platform of World Labs are empowering creators to quickly build and iterate on explorable 3D worlds, thus transforming the way of storytelling and spatial narration in fields such as film, games, architecture, and industrial design, and giving rise to new immersive interactive experiences.
In the medium term, robotics is the core practice of the embodiment of spatial intelligence. Regarding the core bottleneck of the current scarcity of robot training data, Fei-Fei Li believes that the world model can greatly expand the boundaries of robot learning by generating high-fidelity simulation data, quickly narrowing the gap between simulation and reality, and enabling robots to learn in a large number of states and environments, thus achieving generalizable understanding, reasoning, and interaction abilities.
On this basis, for robots to become true human-machine collaboration partners, they not only need to have the spatial intelligence of perception, planning, and action but also need to have empathy with human goals and behaviors, effectively assisting humans in scenarios such as laboratories and homes while fully respecting their autonomy.
In addition, Fei-Fei Li pointed out that the world model will drive robots to break through the limitations of humanoid forms and develop towards diverse forms such as nano and soft robots. By integrally modeling robot perception, movement, and the environment, it provides key simulation training and evaluation support, unlocking their full potential in specific scenarios.
In the long term, the profound impact of spatial intelligence will radiate to key fields such as science, healthcare, and education.
In science, it can simulate experiments, test hypotheses, and explore extreme environments, accelerating discoveries in fields such as climate science and material research.
In the medical field, it will bring about changes at multiple levels, from the simulation of molecular interactions, the auxiliary diagnosis of medical images to environmental monitoring and robot-assisted nursing.
In education, it can greatly improve learning efficiency and skill training effects by concretizing abstract concepts and creating immersive and interactive learning experiences.
Fei-Fei Li finally emphasized that although the application scenarios are infinite, the common goal of all these developments remains the same: to use artificial intelligence (especially spatial intelligence) to enhance human expertise, accelerate human discoveries, and deepen human care, rather than replace the unique judgment, creativity, and empathy of humans. Achieving this grand blueprint requires the collective efforts of the entire artificial intelligence ecosystem.
Spatial intelligence: Reconstructing the human-machine relationship and industrial ecosystem
The blueprint of spatial intelligence depicted by Fei-Fei Li reveals that it is far more than just a technological breakthrough but the cornerstone driving the next generation of human-machine interaction revolution. Traditional AI is often positioned as a "tool" that relies on screens and texts to understand the world, while spatial intelligence enables AI to truly integrate into the real environment and become a "scene partner" that can perceive context, understand intentions, and actively collaborate.
Liu Zhenfei, the chairman of AutoNavi Maps recently pointed out at the Yunqi Conference that spatial intelligence will become a standard infrastructure for all industries to interact with the physical world, just like cloud computing. He emphasized: "If large language models endow AI with thinking ability, then spatial intelligence endows AI with the ability to understand and predict the physical space-time, driving AI to transform from a conversation tool into an action partner."
This judgment reveals the core direction of technological evolution: when AI can not only understand instructions but also perceive the environment, anticipate needs, and perform tasks in three-dimensional space, its value creation mode will undergo a qualitative leap.
This means that the measurement standard of intelligence will shift from processing speed to the ability to adapt to real scenarios. Whether it is the VR/AR glasses at the forefront, the robots regarded as the next-generation computing platform, or the self-driving cars reshaping transportation, their essence is an intelligent agent that must "survive" autonomously in the three-dimensional physical world, requiring accurate perception of the environment, understanding of physical laws, real-time decision-making, and dexterous action execution.
Although there are many challenges, Fei-Fei Li's theoretical framework is integrating with industrial practices, outlining the development outline from technological breakthrough to ecosystem construction.
Huang Xiaohuang, the co-founder of Qunhe Technology, one of the "Six Little Dragons in Hangzhou," clearly pointed out that spatial intelligence is a crucial new field after large language models and hailed the research direction of Fei-Fei Li as "true spatial intelligence," believing that it includes tools, large models, and data, rather than the previous generation of monitoring technologies based on image or video understanding.
He regards this as the inevitable path for machines to move from automation to "embodied intelligence" and predicts that the number of robots may reach 70 billion in the future, far exceeding that of humans. Facing this network of intelligent entities, the business model will also shift from "charging humans" to "serving machines."
Qunhe Technology has transformed from an Internet company into a spatial intelligence company, focusing on the "spatial understanding" ability of robots. It believes that hardware has been laid out by other enterprises, and it focuses on intelligent algorithms itself.
Meanwhile, Deng Yongqiang, the founding partner of InnoSpring proposed the concept of the "New AI Continent" from the perspective of investment and ecosystem, elevating spatial intelligence to the height of civilizational evolution. He believes that this is not only a technological revolution but also a "super cycle" comparable to the industrial revolution, with the core being a fundamental leap from "information intelligence" to "embodied intelligence."
He particularly emphasized that "it is not about replacement but symbiosis and co-prosperity" between AI and traditional fields. This concept provides a more inclusive framework for technological development. Deng Yongqiang predicts that 2025 will be a crucial year for the large-scale implementation of spatial intelligence-related technologies, and the current window period when the "technological paradigm has not yet converged" is a strategic opportunity for innovators to participate in defining the next-generation standards.