1X Achieves 5 Million Views by Truly Applying "World Model" to Robot NEO

As long as it can save me from housework, it is my god.

Remember the home humanoid robot NEO that wears "Lululemon" leggings and offers gentle companionship?

Last time we talked about it, people were complaining about the privacy and security issues of its "remote control" and joking that there might be an "Indian guy" behind each robot.

Yesterday, 1X Company unveiled its brand - new "brain": 1X World Model. This time, NEO seems ready to liberate the "operator behind the scenes".

To put it simply, NEO now doesn't just memorize actions by rote. It has learned to "imagine" like a human. By watching a vast amount of online videos and real - life operation videos from a human first - person perspective, it has understood how the physical world works: things fall when dropped, and doors can be pushed open.

They installed a video - generation technology similar to Sora into NEO's "brain". When it receives an instruction, it first generates a video in its "mind" of "successfully completing the task", and then figures out how its body should move to turn this imagination into reality.

However, the official blog also states that sometimes there is a situation where "the brain has learned, but the hands haven't": the video imagined in the mind is perfect, but the actual movement may miss the target.

So, is this the real deal under the "yoga suit", or just the "editing magic" that only exists in the demo? Regardless of whether the technology has been implemented or not, the hype has already reached an all - time high. By the time of writing, the official tweet had been viewed over 5 million times.

It seems that after being bombarded by all kinds of cool demos in the AI era, people still can't help but wonder: Has it really developed a "brain" this time?

Here is a hardcore breakdown of this "new brain" by the 1X technology team:

For household robots to truly enter the real - world environment, they must have common - sense behavior abilities and a deep understanding of the physical world.

Currently, many basic robot models adopt the VLA paradigm: that is, on top of a pre - trained VLM, an output head for predicting robot actions is added (such as PI0.6, Helix, Groot N1.5). A VLM can learn a wealth of knowledge from Internet - scale data, but its training goal is more focused on visual and semantic understanding rather than predicting physical dynamic processes.

Therefore, even for tasks that are very simple for humans, the model often needs tens of thousands of hours of costly robot data to learn to complete them. In addition, to further strengthen the model's understanding of the spatial relationships in physical interactions, researchers usually need to introduce various auxiliary training objectives (such as MolmoAct, Gemini - Robotics 1.5).

In this blog, 1X introduced a video - pre - trained world model - 1XWM and integrated it into the NEO robot as its control strategy.

Different from VLA models that directly predict action trajectories from static image - language inputs, the world - model - driven strategy infers the actions that the robot should take through text - conditioned video generation. By leveraging the real - world dynamic laws contained in Internet - scale videos, this world model can generalize to new objects, movement patterns, and task scenarios without large - scale pre - training of robot data and without relying on any relevant tele - operation demonstrations.

This marks a shift in the paradigm of robot intelligence: robots are beginning to directly benefit from the ability leap brought about by large - scale video pre - training, and all of this is made possible by a complete set of hardware systems designed for high - fidelity migration from human embodiment to robot embodiment.

From Video Knowledge to World Model

Nowadays, cutting - edge text - to - video models such as Veo and Sora can generate extremely realistic video content. However, these models are not aligned with the robot's embodiment in zero - shot generation scenarios, so they often fall short in several key dimensions required for control tasks, which are manifested in the following aspects:

Visual/Spatial Level: Is the generated video consistent with the robot's camera intrinsic parameters and first - person perspective? Can it accurately retain the depth information and precise spatial relationships required for manipulation tasks?

Kinematics Level: Are the robot's actions in the generated video achievable in its embodiment? Do they follow its structural characteristics, joint limits, speed constraints, and actuator capabilities?

Physical Level: Does the generation process avoid physically impossible results (such as object teleportation), thus ensuring that it can be successfully executed in the real world?

The original video can show what might happen, but it doesn't tell how to do it. To transform video knowledge into a world model that can truly be used for control, 1X adopted a two - stage alignment process with its own end - to - end system architecture, which is in line with existing works such as DreamGen and UniPi:

World Model Backbone: This is a text - conditioned diffusion model. It is first pre - trained on Internet - scale video data, then mid - trained on human first - person perspective video data, and finally fine - tuned on NEO's exclusive sensor - motion logs. This model can predict the evolution of the scene over time with high fidelity and performs well in terms of visual, spatial, and physical consistency.

Inverse Dynamics Model (IDM): By training the IDM, the pixel space is connected to the actuator control, enabling it to predict the exact action sequence required to complete the state transition between generated frames. At the same time, using the IDM's evaluation metrics and rejection sampling mechanism, kinematic constraints are imposed on the generated results to ensure the feasibility of the actions at the embodiment level.

During the inference phase, the system receives a text instruction and an initial frame: the world model is responsible for generating the evolution of the future scene that matches the intention, the inverse dynamics model extracts the required action trajectory from it, and finally the robot executes this action sequence in the real world.

Training and Inference Process of 1XWM

The backbone model of 1XWM is based on a 14 - billion - parameter generative video model. To adapt this model to NEO's embodiment, 1X also adopted a multi - stage training strategy:

First - Person Mid - Training: Train with 900 hours of human first - person perspective video data to align the model with first - person operation tasks. At this stage, the model can learn general operation behavior patterns, but it is still difficult to generate videos of NEO performing specific tasks.

Embodiment Fine - Tuning: Then, fine - tune with 70 hours of robot data to further adapt the model to NEO's visual appearance and kinematic characteristics.

Taking works such as DALL·E 3 as an example, existing research has shown that training with more descriptive visual text annotations can significantly improve the ability of the visual foundation model to follow prompts. However, many first - person datasets only contain brief task descriptions. Therefore, 1X used a VLM to generate more detailed descriptive subtitles and used them for training through subtitle up - sampling.

In addition, the IDM is trained on 400 hours of unfiltered robot data, which includes both random exploration data and motion trajectories unrelated to any specific tasks. This enables the model to accurately track NEO's movements in any state.

During the testing phase, the system receives an initial frame and a text instruction guiding NEO to perform an action. 1XWM is responsible for generating the future video sequence, and then the IDM extracts the corresponding robot action trajectory from the generated video and directly sends it to the robot for execution. To ensure the smoothness of the trajectory, the output of the IDM is time - averaged over multiple initial noise samples and sliding window dimensions.

NEO's post - training dataset mainly contains high - quality grasping and placement data (98.5%). These data are screened to only include desktop operation scenarios where the hands are visible. By leveraging the network - level pre - training of the basic video model, the 1XWM model can generalize to various unseen objects, environments, and tasks.

What Can 1XWM Do?

The research team further evaluated the ability of 1XWM in task generalization, focusing on whether it can complete tasks that NEO has never experienced before and the consistency between the generated video and the real robot execution.

In the experiment, NEO equipped with 1XWM was used to perform a variety of tasks beyond its existing experience, including:

Grasping objects within and outside the distribution;

Manipulating previously unseen objects with complex affordances;

Completing new tasks that require new action patterns.

The experimental results show that the video generated by 1XWM is highly consistent with the real - world execution process as a whole. By comparing the video generated by the model side by side with the video taken after the robot actually completed the task, it can be found that the two are very similar in visual performance. This indicates that 1XWM has strong capabilities in terms of understanding spatial structures, modeling kinematic constraints, and ensuring physical consistency.

Grasping:

New Action: Cleaning

Next, 1X attempted tasks that require two - hand coordination and human - robot interaction. These abilities are not included in the training dataset. This indicates that such knowledge comes from video pre - training and human - robot interaction training from a first - person perspective. Since NEO's body structure is very similar to that of humans, the functions learned from human video data can be directly transferred and applied.

The research team also evaluated the performance of 1XWM on in - distribution (ID) and out - of - distribution (OOD) tasks through systematic physical experiments. Each type of task was repeated 30 times. The results show that 1XWM maintains a stable success rate for various action primitives, but some tasks that require high - precision operations (such as pouring liquids, drawing, etc.) still pose certain challenges.

Can Video Quality Be Linked to Task Success Rate?

If so, visual indicators can be used to measure and improve video quality and estimate the possibility of actual task success.

Sometimes, it is obvious whether the generated video can succeed. For example, when inputting an instruction to pull a tissue into the 1XWM model, it sometimes generates a video of the NEO robot picking up the tissue box instead of pulling the tissue. When executing these wrongly generated videos, the success rate is almost 0%.

The 1X team noticed that methods like calculating during testing can improve the task success rate. Inspired by this, they tried to generate multiple videos in parallel and execute the one with the best quality. This selection process can be done manually, but it can also be automated using a VLM evaluator.

The Importance of First - Person Data and High - Quality Subtitles

Based on the previous hypothesis that there is a correlation between the quality of the generated video and the task success rate, the research team conducted a visual ablation analysis on several

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

With 5 million views, 1X has truly applied the "world model" to its robot NEO.

From Video Knowledge to World Model

Training and Inference Process of 1XWM

What Can 1XWM Do?

Can Video Quality Be Linked to Task Success Rate?

The Importance of First - Person Data and High - Quality Subtitles