World Model == VQA? Robots Don't Need to Imagine Images, Predicting Semantics Is Enough
Is it really necessary for robots to imagine precise future scenarios through world models? In a new paper, researchers from the University of Washington and Sony AI posed this question.
As is well - known, the world model is a learning method that enables AI to "imagine the future". It can learn the operating laws of the world from a large amount of data and then predict what might happen in the future based on the current state. This ability is crucial because if AI can make reasonable predictions about the future, it can plan smarter and more robust action strategies in advance.
In practice, world models can take various forms, ranging from small - scale state - based dynamic models to large - scale action - conditioned video prediction models. However, regardless of the form, most models attempt to "reconstruct future scenarios". Although this method can often generate realistic images, it may not be suitable for decision - making. The reason is that no matter how real the images look, they may miss some truly crucial semantic details, such as whether two objects are actually in contact.
Some previous methods have attempted to model only "task - relevant" information, but these methods often require additional assumptions, such as knowing the reward function or certain known factors in the task. This makes them less flexible in practical use.
If pixel information is not necessary for planning, then what is truly needed to make action decisions?
This paper proposes that it is sufficient to predict semantic information about future outcomes. World models should no longer focus on predicting raw visual frames but rather capture task - relevant objects and their interaction information, such as: "Is the robotic arm closer to the target object?", "Has the red square toppled?", "Has the blue ball been picked up?"
The paper models this information as a visual question - answering (VQA) problem about the future, leveraging the fact that any target outcome can be expressed through a series of "yes/no" questions. In other words, the world modeling problem can be re - defined as a VQA problem about future outcomes.
Currently, there is a type of model with a well - established visual question - answering tool system, namely the vision - language model (VLM). In world modeling tasks, VLMs have two major advantages:
First, they have obtained powerful visual question - answering capabilities and broad generalization abilities through large - scale pre - training;
Second, they encode prior knowledge about the semantic features of tasks and scenarios.
These advantages enable cutting - edge VLMs to pose task - relevant questions and provide reliable answers given static observations. However, they lack the ability to predict future outcomes, which limits their direct application in decision - making tasks.
To address this, the new paper proposes the concept of the "Semantic World Model (SWM)". The SWM is a world model with generalization ability. It exists in the form of an action - conditioned vision - language model and can answer questions about the semantic effects of future actions.
Paper title: SEMANTIC WORLD MODELS
Paper link: https://arxiv.org/pdf/2510.19818
Project link: https://weirdlabuw.github.io/swm/
Different from traditional world models that predict future frames, given the current observation (image representation) and a sequence of actions, the SWM answers natural - language questions about the future.
As shown in Figure 1, the model inputs include: the current observation, a series of actions to be executed, and a natural - language question about the future. The model generates corresponding text answers by understanding the consequences that these actions will bring in the environment.
Since the SWM is essentially a task - independent world model, it can be trained with very low requirements for the quality of general sequence data, including game data and non - optimal data. The training data can be easily obtained from any (expert or non - expert) data corpus in the format of the current observation results, actions, (future - related) questions, and expected answers.
By using the SWM to reason about future outcomes, AI can perform flexible, open - world multi - task planning in the action space.
When the task is described in natural language, the system can understand the goal in two ways: either automatically parse the task intention using a pre - trained VLM or have humans break down the task into a set of textual "question - expected answer" pairs. After obtaining this set of Q&As, the SWM can be used to plan actions to maximize the possibility of getting these expected answers in the future.
Given a task description in natural - language form, one can either use a pre - trained VLM or manually decompose the task description into a set of questions and expected answers in text form. With this Q&A set, the SWM can be used to plan actions, making it highly likely to obtain the expected answers to these questions in the future.
Although there are many techniques available for this type of planning, this study shows that it is compatible with both zero - order sampling - based methods and first - order gradient planning methods, which optimize for the expected likelihood target. The study shows that these planning methods are computationally feasible and can bring significant improvements during testing compared to conventional action selection methods. In addition, it demonstrates the scalability of such planning methods for multi - step long - range problems.
In terms of experiments, the SWM was evaluated on two commonly used multi - task simulation environments, Language Table (LangTable) and OGBench. The results show that the SWM can accurately answer questions about future outcomes and can generalize to new scenarios. The SWM can be combined with standard sampling - based planning techniques and gradient - based improvement techniques to achieve significant strategy improvements through optimization during testing, thus solving various robot tasks.
In summary, the SWM represents a new type of world model that leverages the rich pre - trained knowledge of VLMs to achieve practical, flexible, and scalable robot control.
Overview of the Semantic World Model
Figure 2 below shows an overview of the semantic world model. The SWM is a vision - language model that has been adjusted to answer future - related questions determined by the actions used to adjust the model. Through a series of questions and expected answers, its predictions can be transformed into planning signals, and the action sequence can be iteratively optimized.
Dataset Generation
To train a world model that can answer questions about the future, this paper generates a state - action - question - answer (SAQA) dataset. Figure 3 shows the pairing of a single state with multiple questions and answers in this dataset.
Architecture Overview
The SWM is a model that can answer questions about future events given action conditions. A model with this ability is essentially an action - conditioned visual question - answering model. Therefore, it is natural to start from a large pre - trained vision - language model (VLM) and transfer its generalization ability to robot tasks. This SWM architecture is based on the open - source vision - language model PaliGemma.
The model consists of three core pre - trained components: a Transformer - based autoregressive language model (with a token embedding size of d_tok), a visual encoder v_ϕ (with a feature size of d_img), and a projection matrix
. The PaliGemma architecture is built on two separately trained components: the Gemma large - language model and the SigLIP image encoder V_sc. W is used to project from Z_sc to Z_LLM, where Z_sc is the feature space of v_ϕ, and Z_LLM is the input token embedding space of the large - language model. This paper uses the 3 - billion - parameter checkpoint of PaliGemma as the base model.
To enable the base model to answer questions about "a specific future (generated by actions)", the model must be conditioned on these actions. To this end, the authors introduce a new projection matrix
, which maps a single action
to the latent space Z_LLM similar to the projection matrix W.
Given a tuple (S_i, a_{i:j}, Q_{S_j}, A_{S_j}) in the dataset D_SAQA, the input sequence is formed by concatenating the image embedding, action embedding, and question token embedding:
Subsequently, the model is fine - tuned in an end - to - end manner by optimizing the standard cross - entropy loss
to predict the target answer A_{S_j}.
This training process enables the model to capture the dynamics of the environment in the language space, thus answering questions about future states without explicitly generating pixel - level representations.
Experimental Results
Is the SWM an effective decision - making world model?
First, the authors evaluated the planning ability of the SWM by applying the sampling - based planning method MPPI to the SWM model on the LangTable and OGBench tasks.
As shown in Table 2, the sampling - based planning method can be directly used for planning on top of the semantic world model, achieving nearly perfect success rates on the reaching and block separation tasks in both environments.
However, for large - scale models, the sampling - based planning method is computationally expensive, and running MPPI on more challenging tasks that require more samples is not feasible. Therefore, for more complex tasks, consider a scenario where a base policy generates candidate trajectories, which are then refined using the SWM and gradient - based optimization. As shown in Figure 5, this method can refine the candidate trajectories and achieve significant improvements compared to the base policy. On LangTable, the average performance of the SWM increased from 14.4% to 81.6% compared to the base policy; on OGBench, it increased from 45.33% to 76%. The SWM also outperformed the AVD and IDQL baselines on all tasks, demonstrating its effectiveness in planning.
The SWM also demonstrated the ability to handle longer - range tasks by first selecting sub - goals and then planning around them. As shown in Table 1, on multi - step tasks, the average strategy improvement of the SWM reached 52.0%, outperforming the AVD baseline.
Can sub - optimal data improve modeling performance