World models are attracting fierce funding: Are they the ultimate destination for AI?
In November last year, Fei-Fei Li, a professor at Stanford University, put forward a concept: the world model, which sparked extensive discussions in the AI industry.
Meanwhile, the valuation of Fei-Fei Li's company, World Labs, skyrocketed to $5 billion. Similarly, Yann LeCun, a scientist also focusing on large models, has a company valued at over $3 billion.
Thus, a crucial question arises: Is the world model the ultimate destination for artificial intelligence? This article will explore this topic. The guests participating in this discussion are as follows:
Highlights in advance:
1. What is the "world model"?
Answer: It is a model that enables AI to understand and predict the real world. There are many versions, and there is no consensus.
2. Why is the world model so popular?
Answer: It has strong financing, high valuation, and is extremely useful to humans.
3. Is the world model the ultimate destination for AI?
Answer: It might be. AI can handle the execution, and humans only need to focus on creativity.
4. How can the world model make money?
Answer: By turning it into products, such as the brain of embodied intelligence.
For more insights, please read the live transcript of the round - table discussion.
What exactly is the world model?
Lin Juemin: Currently, the "world model" is indeed very popular. People see that Yann LeCun has raised billions of dollars in financing, and Fei-Fei Li has raised $5 billion. The valuations of these companies are rising rapidly, and a new wave has also emerged in China.
Interestingly, there seems to be no consensus on what the "world model" exactly is.
Wang Sheng: First of all, people may have different understandings of the world model.
In fact, there are two typical schools of the world model: one is the world model of embodied intelligence, and the other is the world model of the digital space.
The world model we understand does not fully simulate the real world but models specific domains or "worlds". For example, the medical field, the financial field, and the legal field. Each field can be regarded as an independent world.
Taking the medical field as an example, suppose we build a "medical world model" that can simulate the entire process after you get sick. If someone gets influenza A, through this model, we can see the patient's physical reactions, symptom changes, and biochemical index changes without intervention.
If the patient receives treatment, the model will show the effects of the medication until recovery or the aggravation of the condition. We use this model to explore the real "ground truth".
For example, the Tsinghua Zijing Zhikang team we invested in. Their medical AI has exceeded 97% of doctors globally in the diagnosis of nearly 40 diseases in over 30 disease diagnosis and treatment fields.
Their success is due to simulating the entire process of disease development through the medical world model. Through this world model, we can enable AI to learn faster and even accumulate experience in a short time to become a world - class doctor.
Wu Wei: We believe that to understand the essence of the world model, we first need to understand its two core keywords: simulation and interaction.
"Simulation" refers to building a virtual world through simulation technology to train AI for reasoning and decision - making. "Interaction" refers to enabling AI to better adapt to and respond to changes in the real world through interaction with the environment and humans.
From the perspective of academic and industrial development, the concept of the world model was first proposed around 2018 and has been developing for about seven or eight years. During this period, there are mainly three different schools of the world model:
The first school uses the world model as a simulator, synthesizing a large amount of simulation data in the cloud for agent training. NVIDIA's Omniverse and Cosmos systems follow this route.
The second school uses the world model as a general interaction interface. Projects such as Google's Gemini3 and Fei - Fei Li's team's Marble belong to this category, mainly for entertainment and digital experience applications.
The third school, which is also our focus, empowers the brain directly with the reasoning ability of the world model, enabling AI to have endogenous spatial reasoning and imagination abilities. In this way, AI can guide robots to make more efficient decisions and interactions through reasoning and simulation without having seen certain data.
This method is different from traditional imitation learning because imitation learning relies on the accumulation of offline data, while we focus more on how AI can predict and adapt to new environments through endogenous simulation abilities.
Wu Wei: In our understanding, the world model is a "foundation model", which is the basic model required for robots. What we need to model is the endogenous cognition at the level of physical space movement and operation. This is our technical route.
From our definition, the world model is actually an end - to - end model, or it can be understood as a two - end end - to - end large model.
In contrast, Qianjue Technology mainly focuses on the internal brain architecture. That is, the human brain has multiple partitions, and each partition corresponds to a different small model, or using a popular term in the agent field - "skills". Combining these small models with the top - level model can achieve a lower - power brain simulation.
This is my understanding of the two. Of course, the team's genes also play a decisive role. Our genes determine that we focus on the end - to - end construction of the model and data scaling.
Song Yachen: Fei - Fei Li recently completed a new round of financing, and the valuation has reached $5 billion. I secretly read their business plan, and it mentions that there are three main application scenarios for the world model defined by Fei - Fei Li:
The first is 3D generation in the entertainment industry;
The second is spatial intelligence in XR (extended reality) and the metaverse;
The third is robotics.
Actually, the first two scenarios were the earliest ones to be explored.
The key point I want to make is that I think the world model may indeed be the ultimate answer in the AI era, which includes two things:
First, the world model can help develop embodied abilities, making various embodied scenarios more popular and enabling more robots to replace human work.
Second, if the labor force is liberated, what should humans do?
From the agricultural era to the industrial era and then to the information era, we can observe two characteristics:
1. The happiness index of humans is getting higher, the life cycle is getting longer, the infant survival rate is getting higher, and the choices of products in supermarkets are increasing;
2. However, people's working hours are getting longer, and the competition is getting fiercer.
In the early agricultural era, people worked for a short time. But as time passed, we entered a more competitive society. Especially in the information era, the 996 work schedule (from 9 am to 9 pm) has become the norm, and people work hard for the so - called "blessings" of large companies.
However, the emergence of AI has changed all this. Theoretically, the AI era should be more competitive, but in fact, there seems to be nothing left to compete for because robots can do everything for humans. So, where will productivity and the labor force go?
I have a theory that humans will ultimately compete in their own creativity.
When AI can help humans amplify their creativity in real - time without any threshold or cost, everyone can create world - class, interactive experiences, just like God created the world, setting physical rules and creating all things.
If this day comes, everyone can create their own virtual world and have a better experience.
For example, gravity is no longer limited to 9.8. You can fly, grow wings, and create different social rules, evaluation systems, and even physical rules. Humans will have more choices and can spend time on things they really like.
This will be an era where everyone serves others and others serve themselves. Everyone can use AI to amplify their creativity and attract more people to join their world.
If such a world can really come, then we will be in an era of creating a blissful world for others, like saints providing the best experiences for others.
The role of AI in this is to enable everyone to create complete, world - class, interactive experiences like a god. This is why I think the world model is so important for the ultimate destination of AI.
In the future, everyone will be like "Ma Liang" with a magic brush, making their wishes come true.
Jiang Yizhou: The earliest research on the world model was mainly to understand and predict the physical world around us.
Just as Newton deduced the law of universal gravitation by observing the falling of an apple. Without the world model, our reasoning ability is limited, and we can only make judgments based on the observed phenomena.
As research deepens, Fei - Fei Li proposed that the world model is not only about understanding the world but also predicting it. We started working on "video prediction" ten years ago to predict the movement trajectory of robots, which is very useful for robots.
Robots need to be able to predict future situations in reality, not just make decisions based on past data.
For example, folding clothes, a seemingly simple task, requires different operations for clothes of different shapes. Through the world model, robots can better understand the characteristics of clothes and make more precise movements.
Brain - like intelligence is the direction I am currently focusing on. It emphasizes the collaboration of multiple small models rather than solving all problems with a single large model. In the field of robotics, the world model helps robots predict future scenarios, making them more efficient in performing tasks.
For example, when a robot is cleaning, it will adjust the task steps based on the predicted results to improve execution efficiency.
An interesting experiment is tying a knot on a plastic bag. Our initial training method was not flexible enough. Later, we created a "plastic bag world model" to enable the robot to understand the physical characteristics of different plastic bags and handle various situations intelligently.
This method allows the model to adapt to more scenarios, not just specific tasks.
In short, the world model helps robots better understand and predict the unknown world, thereby improving work efficiency.
How can the world model be implemented? Who can succeed?
Lin Juemin: With so many development directions for the world model, what are people ultimately competing for? How can we compare different technical routes?
Wang Sheng: From an investor's perspective, why are people so interested in the world model?
For us investors, the "world model" is now a consensus label - just like "embodied intelligence" in the past two years. Once you hear it, you feel like investing.
But in fact, it's just a consensus label.
People have different definitions of the world model. Just like the guests here today, each person's understanding is not exactly the same.
As investors, we are willing to accept all seemingly reasonable definitions of the world model. The key is whether it can be implemented in specific technologies, whether it can achieve continuous growth, and whether it has high market potential.
From my personal perspective, the future world model needs to have two core elements:
First, it needs to have a verification system close to the "ground truth" that can generate a large amount of high - quality data. The data not only needs to be large in quantity but also real and of high quality to provide valuable feedback for model training.
Second, the data distribution should be balanced, including both dense data and sparse data, to avoid over - fitting of the model and ensure that the trained model has strong generalization ability. Generating a large amount of high - quality data through the world model is the basis for model training.
Wu Wei: From the perspective of the business essence, the competition of the world model still boils down to a core question: Can a company survive in the competition?
As a commercial company, we must understand that there are only two ways to ensure survival: either having a healthy cash flow or having a path of high growth and a high ceiling.
From the development stage of the world model, it is currently closer to the second model - the stage of rapid growth. For companies working on the world model, the key to survival is whether they can find a suitable implementation direction and achieve rapid growth.
Taking our company as an example, our first productization direction is the embodied brain. Through thought experiments, we estimate that the amount of data a person collects in a lifetime is about 3 million one - minute video clips, which is equivalent to the experience accumulated before the age of 18.
If we assume that it takes one year to master a job skillfully, then the data volume is about 300 million clips. We use the accumulation of this data to estimate the maximum intelligent upper limit of the human world model.
If we can build a world model with a data volume of billions and pre - train it to perform well in zero - shot and few - shot scenarios, then the commercial value of this world model will be very high.
Therefore, the key in the future is how to collect enough high - quality data, conduct good pre - training, and finally have strong generalization ability in practical application scenarios.
Song Yachen: In fact, we need to think about a core question: Why are people discussing the world model now? Why are startups, capital, and top talents flocking to this field? Is it because AI has developed to a certain stage and the world model has emerged? Or is it because the technology of embodied intelligence has matured?
I think these two factors are not the fundamental reasons for the rise of the world model.
The most fundamental reason for the emergence of the world model lies in the change of information carriers. In the past, the improvement of information carriers has been a process of continuous dimensional upgrading, from text to pictures, then to videos, and now to the 3D world. With the improvement of information density and experience quality, we have also welcomed the 3D world and the world as new information carriers.
In the past, text, pictures, and videos were the mainstream forms of information expression. But now, with the progress of AI technology and hardware infrastructure, the 3D world and higher - dimensional worlds have become the ultimate carriers for us to express and transmit information.
For thousands of years, text has been a tool to express the world. But with the development of information technology, the expression forms of the 3D world and the world are just beginning to become mainstream.
We are about to enter a new era where AI can help us directly process and understand the 3D world and create richer interactive experiences.
This is actually an improvement in information utilization efficiency. The higher the information density, the faster the communication efficiency.
When we could only engrave words on turtle shells in ancient times, the information communication efficiency was very low. But with technological progress, the emergence of the Internet, pictures, and videos has gradually improved the communication efficiency. And the 3D world and the world itself will ultimately become the main carriers for our information transmission and creation.
Jiang Yizhou: I have a different understanding of the world model.
We are working on brain - like intelligence, which is a non - end - to - end design. Initially, we were working on brain - like robots, especially in national projects. We believe that the world model is not limited to vision or a single input mode.
Taking a blind person operating an object as an example, even though he cannot perceive the world through vision, he can still master the characteristics of the object through other senses and infer the possible consequences of his actions.
This understanding of causal relationships is what we consider the most core part.
Through the brain - like model, our advantage is that we do not require as much data. Traditional reinforcement learning requires a large amount of data, while our non - end - to - end method can effectively reduce data requirements by understanding the causal relationships of the world.
We believe that the world model is not limited to the natural world but also applies to the world constructed by humans. The large language model (LLM) is a typical example. Language, as an abstract tool for humans to understand the world, can help us understand and express most things.
Through the understanding of these abstractions, machines can also build a logical world model.
This article does not constitute any investment advice.