Meta released a 40-page report stating that the next step for embodied intelligence is the "mental world model": it can listen, see, understand, and empathize.
Recently, Meta has been very active. On one hand, its boss, Zuckerberg, has personally stepped in and spent $100 million to recruit talent.
On the other hand, Meta's in - house intelligent research has also come up with something significant: a 40 - page report.
Besides the oft - mentioned world model by LeCun, the most eye - catching part is:
This report for the first time places the inference of human mental states on an equal footing with the physical world model and conceptualizes it as the mental world model.
Compared with traditional world models (such as LeCun's JEPA) that only focus on physical laws (object motion, mechanical causality), the mental world model for the first time incorporates psychological laws (intentions, emotions, social relationships) into the world model framework, achieving "dual - track modeling".
It has to be said that Meta is still far ahead!
From the Physical World Model to the Mental World Model
As is well - known, under LeCun's leadership, Meta is quite critical of large models, and this report is no exception:
Although large models are powerful, they are too bulky, inefficient, and lack abstract reasoning ability.
Just like when we go home and open the door, we don't predict every pixel of the door in the next second in our minds. Instead, we focus on the state of the door (open or closed) and the position of the keyhole and take corresponding actions, such as finding the key to complete the task of entering the house.
Therefore, to construct an embodied intelligent agent like a human, the world model needs to abstract useful information from perception to understand the environment, then conduct reasoning, planning, and take actions.
So, the question is, what kind of information is considered useful?
Here, the report divides the information required by the world model into two categories. One category is the information required by the physical world model, which includes:
Objects and their attributes (e.g., shape, size, color)
Spatial relationships between objects (e.g., proximity, distance)
Dynamic changes in the environment (e.g., motion, changes over time)
Causal relationships between actions and results based on physical laws
The other category is the information required by the mental world model, including:
Goals and intentions (including their motives, preferences, and values)
The user's emotional and affective states, and understanding how these emotions affect behavior
Capturing social dynamics, including relationships between individuals, groups, and institutions, as well as cultural norms, customs, and expectations
Understanding verbal and non - verbal communication, including language, intonation, body language, and facial expressions
We are all familiar with the role of the physical world model. For example, knowing Newton's laws, an embodied intelligent agent can predict the motion of objects in the future environment.
For instance, if a pen falls from the edge of a table and will do free - fall motion, the intelligent agent needs to catch the pen in time before it hits the ground.
Then why do we still need the mental world model?
For humans, the mental world model is the process of mental representation of the world, including the representation of objects, events, and relationships.
It enables humans to simulate situations, predict results, conduct counterfactual and causal reasoning, and thus make more informed decisions.
For example, we say that Xiaoming received a burnt hamburger at a hamburger shop, then he left the shop angrily without paying.
Then, according to the mental world model, we can reasonably infer that Xiaoming didn't eat the hamburger.
Therefore, in order to better assist and cooperate with humans, intelligent agents must learn human mental states and understand human behavior patterns and cultural practices.
To achieve this, the mental world model is needed to represent the mental states of human users or other AI intelligent agents.
By representing and understanding these mental states, embodied intelligent agents can
predict the user's goals and intentions, enabling the intelligent agent to proactively provide help or guidance to help the user achieve their goals; infer belief differences and predict how people with false beliefs will act; predict emotional responses and adjust strategies to better meet the user's needs
This will greatly improve the efficiency and comfort of human - machine interaction and multi - agent interaction.
Then how do I know that this thing won't make wild guesses, cause trouble, or do more harm than good?
In response, Meta has designed a series of benchmarks to test the performance of embodied intelligent agents.
Unfortunately, taking goal inference as an example, on the Egocentric Multi - modal Goal Inference Benchmark, the success rate of the vision - language model is only 55%, far from the level required for practical use.
Yes, there is still a long way to go.
The Future of the World Model
Although the current performance is "dismal", the physical (mental) world model is still a promising direction.
To achieve this, Meta points out in the report:
To enable AI to have true autonomous learning ability, it is necessary to combine System A's Learning by Observation and System B's Learning by Action.
System A learns abstract representations from a large amount of perceptual data (e.g., self - supervised or unsupervised learning).
Its advantage is that it can efficiently learn general and abstract representations, which are helpful for subsequent tasks.
However, the disadvantage is that it requires a large amount of clean data, doesn't know what to learn, and it's difficult to combine what it has learned with actual actions. It often only stays at the level of "understanding" but may not be "usable".
System B learns how to do things through exploration and trial - and - error, such as reinforcement learning.
Its advantage is that it is directly related to actual behavior, can adapt to dynamic environments, and may discover new methods.
However, the disadvantage is that it is very inefficient, requires a large number of trials to learn simple tasks, is easy to get stuck in complex situations, and is highly dependent on clear reward signals, while there are often no ready - made rewards in reality.
In simple terms, System A is good at extracting knowledge from big data but can't "take action"; System B is good at exploration and action but has low learning efficiency.
By effectively integrating the two, System A provides abstract structures, priors, and compressed representations to help System B plan efficiently. System B collects better data through active exploration and provides practical verification for System A.
Realize perception - driven action, and action in turn enriches perception, promoting the autonomous progress of the AI system.
One More Thing
Although the mental world model's current performance is still immature, its potential in multi - agent collaboration should not be underestimated.
It provides a theoretical basis for establishing a "shared mental model" among multi - agents:
It enables each intelligent agent not only to see the external world but also to infer others' beliefs and intentions, forming a higher - order understanding than single - perception.
When different embodied intelligent agents perform tasks together, the mental model can help them align goals, coordinate actions, and even find a balance in conflicts in an uncertain environment.
This is also an important step for human - machine interaction to move from mechanical execution to being empathetic and context - aware.
In this sense, the mental world model may not be an easy path, but it opens the door for embodied intelligence to enter a more complex social form.
Report link: https://arxiv.org/abs/2506.22355
This article is from the WeChat official account "QbitAI", author: henry, published by 36Kr with permission.