New research from Alibaba: Unifying VLAs and world models
If vision enables AI to see the world and actions allow AI to change the world, then ——
WorldVLA is enabling AI to understand the world.
As the name suggests, WorldVLA is a unified framework that integrates the Visual-Language-Action model (VLA) with the world model. It was jointly proposed by Alibaba DAMO Academy, Hupan Laboratory, and Zhejiang University.
Under this framework,
The world model predicts future images by combining the understanding of actions and images. It aims to learn the underlying physical laws of the environment to improve the accuracy of action generation;
The action model generates subsequent actions based on image observations. It not only contributes to visual understanding but also promotes the visual generation ability of the world model in reverse.
Experimental results show that the performance of WorldVLA is significantly better than that of independent action models and world models, fully demonstrating the mutual enhancement effect between the two.
Let's take a closer look below.
Unify VLA and the World Model
Today, although VLA and the world model are advancing separately, their functional limitations have become the key bottlenecks restricting development:
The VLA model: Built on a pre - trained multi - modal large language model (MLLM), it has the ability to generalize across robot tasks. However, it only takes actions as outputs and does not deeply integrate them as inputs for analysis, lacking a comprehensive understanding of actions.
The world model: It can predict future visual states based on current observations and actions, and understand visual information and behavioral dynamics. However, it cannot directly generate actions, which limits its application in scenarios that require clear action planning.
To solve the above problems, the research team proposed WorldVLA —— an autoregressive action world model for unifying action and image understanding and generation.
The team initialized WorldVLA based on the Chameleon model and let it use three sets of independent tokenizers to encode images, texts, and actions.
The image tokenizer uses the VQ - GAN model (an image generation model that combines vector quantization and generative adversarial networks) and introduces perceptual loss optimization for specific image regions (such as faces, prominent objects, etc.).
It is worth mentioning that the compression ratio of this tokenizer is 16, and the codebook size is 8192. For a 256×256 image, 256 tokens will be generated; for a 512×512 image, 1024 tokens will be generated.
The action tokenizer discretizes each dimension of continuous robot actions into 256 intervals, and the interval width is determined according to the range of training data. An action is represented by 7 tokens, including 3 relative positions, 3 relative angles, and 1 absolute gripper state.
The text tokenizer uses a trained BPE tokenizer with a vocabulary size of 65536, including 8192 image tokens and 256 action tokens.
All texts, actions, and images are discretized into tokens and trained in an autoregressive manner.
The standard attention mechanism in the autoregressive model usually uses a causal attention mask, that is, the current token can only access the information of the previous tokens and cannot obtain the information of subsequent tokens, as shown in Figure (a) below.
However, this traditional configuration has obvious deficiencies when generating action blocks (i.e., multiple consecutive actions). Under the default attention mask, errors generated by early actions will be passed on to subsequent actions, resulting in performance degradation.
To solve this problem, the team proposed an alternative attention mask for action generation, as shown in Figure (b) above. This mask ensures that the generation of the current action only depends on text and visual inputs and shields the influence of previous actions.
This design enables the autoregressive framework to generate multiple actions in parallel, while the world model part still follows the traditional causal attention mask, as shown in Figure (c) above.
After that, the team jointly trained WorldVLA by fusing action model data and world model data.
Among them, the introduction of world model data to enhance action generation ability is mainly based on three considerations:
1. Understanding of environmental physics: The world model can predict future observations through the current state and executed actions, thereby learning the physical laws in the environment. This kind of cognition is particularly important for manipulation tasks.
2. Action evaluation and risk avoidance: The world model can simulate and evaluate the potential results of candidate actions, which helps to avoid actions that may lead to adverse states.
3. Precise action analysis: The world model needs to accurately interpret action inputs, which in turn supports the action model to generate more effective and context - appropriate actions.
In addition, the action model can also enhance visual understanding ability, thereby further supporting the visual generation of the world model.
The Action Model and the World Model Help Each Other
Benchmark Test Results
As can be seen from the table below, even without pre - training, the WorldVLA model shows better performance than the discretized OpenVLA model, which proves the effectiveness of its architectural design.
In addition, the model performance is positively correlated with the image resolution. Specifically, a 512×512 pixel resolution brings a significant improvement compared with a 256×256 pixel resolution.
This phenomenon is mainly attributed to the pre - training strategy of the Chameleon backbone model. Its image tokenizer and large language model components are optimized at a 512×512 resolution.
At the same time, a higher resolution naturally provides more visual detail information, which is particularly important for robot grasping tasks that require high operation accuracy.
The World Model Helps the Action Model
In addition, the research also shows that the introduction of the world model can significantly improve the performance of the action model.
The core function of the world model is to predict changes in the environmental state based on the current state and executed actions. This generation mechanism prompts the model to learn the underlying physical laws of the system, and mastering these laws is the key prerequisite for achieving fine - grained operation tasks such as grasping.
Looking deeper, the world model endows the system with the ability of forward - looking deduction: By predicting the possible consequences of candidate actions, it provides key information for the decision - making process, thereby optimizing the action selection strategy and improving the task success rate.
The following comparison case intuitively shows this advantage. The baseline action model will directly move to the target position but fail to successfully grasp the cheese or the bottle, while WorldVLA will continuously attempt to grasp until it confirms that the operation is successful and then move to the target position.
The Action Model Helps the World Model
In terms of generation quality, WorldVLA is significantly better than the pure world model, especially when generating longer video sequences.
In addition, the pure world model presents obvious deficiencies in multiple scenarios: it fails to successfully open the drawer (a), the bowl disappears after moving the plate (b), and it fails to place the bowl stably on the stove (c). In these scenarios, the action - world model generates subsequent states that are coherent and in line with physical laws.
Introduction to the Core Authors
The first author of the paper is Cen Jun. He joined Alibaba DAMO Academy as an Alibaba Star in August 2024. He graduated from Zhejiang University with a bachelor's degree and from the Hong Kong University of Science and Technology with a master's and a doctor's degree. He visited Nanyang Technological University in Singapore for half a year in 2023 and has interned at Microsoft Research Asia (MSRA), Shanghai AI Lab, Hikvision, and Alibaba Tongyi Laboratory.
One More Thing
Regarding VLA and the world model, Chen Long, the senior research director and chief scientist of Xiaomi Auto, also publicly expressed his views:
There is no need to choose between VLA and WM. The two can be combined to promote each other.
One is in charge of "abstract thinking", and the other is in charge of "physical perception". The combination of VLA + WM is the answer to embodied artificial general intelligence (AGI).
Paper link: https://t.co/ZgHyhqQnyf
Github link: https://t.co/SxDZGuhbL7
Reference link: https://x.com/EmbodiedAIRead/status/1980216687124476256
This article is from the WeChat official account "QbitAI". Author: Shiling. Republished by 36Kr with permission.