Why do VLA models ignore language? Crack the instruction-following hallucination and make new breakthroughs in out-of-distribution scene generalization
[Summary] Current VLA models often rely on visual cues rather than language instructions, resulting in poor performance in new scenarios. The paper proposes the LangForce method, which strengthens the model's dependence on language by introducing the log-likelihood ratio loss, improves its generalization ability in out-of-distribution environments, and retains the core language functions.
Visual-Language-Action (VLA) models combine visual understanding, natural language processing, and action generation, enabling robots to follow human instructions. These models utilize pre-trained Visual-Language Models (VLMs) to understand what they see and what they are asked to do, and then translate this understanding into physical actions.
However, current VLA models face a fundamental problem: they often form a "visual shortcut", causing them to ignore language instructions and only rely on visual cues.
This happens because typical robot training datasets create a predictable mapping between what the robot sees and the actions it should perform, making language instructions redundant. For example, seeing a cabinet almost always means "open the cabinet", regardless of the actual instructions given.
In other words, for the training of VLA models, language often fails to provide additional information to the model.
From a Bayesian perspective, a VLA strategy
can be decomposed into:
Here,
represents the visual-only prior (i.e., what actions are possible in this scenario?), while
is the language likelihood (i.e., to what extent can action a explain the instruction
? When
is relatively sharp, the model can predict
from v alone without paying attention to a.
Therefore, the likelihood term
simplifies to
, and the posterior strategy degenerates to the prior:
In other words, the model effectively ignores language instructions and learns a "visual shortcut", which fails when the task is ambiguous or the environment changes.
Paper link: https://arxiv.org/abs/2601.15197
In a recent paper published by Huazhong University of Science and Technology, Harbin Institute of Technology, The Hong Kong University of Science and Technology, Guangzhou, etc., empirical evidence is first provided to support the hypothesis that standard VLA models trained on goal-driven datasets usually learn visual-only strategies
, rather than true language-conditioned strategies
.
Specifically, the researchers conducted three preliminary experiments using the Qwen3VL-4B-GR00T model in starVLA as a representative VLA architecture to reveal this "illusion of instruction following".
In all three experiments, the model was trained by only inputting visual observations v and masking the language instructions
.
Experiment 1: Identifying Visual Shortcuts in Recognition Tests
The researchers first trained a standard VLA model on the Humanoid robot desktop operation data of PhysicalAI-Robotics-GR00T-X-Embodiment-Sim (the name of the HuggingFace dataset) and evaluated it on 24 tasks of the RoboCasa benchmark. Since the training and test scenarios are very similar, the visual-only model achieved a success rate of 44.6% on all 24 tasks, which is very close to the baseline (47.8%) under language conditioning.
This small gap indicates that the model can succeed without relying on language instructions because the training and evaluation scenarios and tasks are highly similar, enabling the model to learn an approximately deterministic mapping from vision to actions. The following figure provides a relevant example.
Experiment 2: Failure in Divergent Situations
To further study this behavior, the researchers trained a VLA model on the classic LIBERO benchmark, which contains four subsets: Spatial, Object, Long, and Goal. The same model was jointly trained on all four training sets and evaluated on all four test sets.
The results show that on three subsets (Spatial: 95.7%, Object: 92.7%, Long: 95.3%), the performance of the visual-only model is comparable to that of the full VLA model. In these subsets, each visual scene corresponds to a single task. However, on the LIBERO Goal subset, the success rate of the visual-only model drops sharply to 12.4%.
The key difference is that LIBERO Goal itself is divergent: during training, the same object configuration may correspond to multiple valid tasks. For example, a scene containing multiple bowls, a stove, and a drawer may correspond to "put the bowl in the drawer" or "put the bowl on the stove".
Experiment 3: Catastrophic Failure in Out-of-Distribution Generalization
Finally, the researchers tested the generalization ability of the model by training it on the high-quality BridgeDataV2 dataset (diverse, in-the-wild scenes) and evaluating it on SimplerEnv (simulation, OOD).
When training on the Bridge dataset, the action loss of the visual-only model is 0.13, which is comparable to the loss of 0.08 of the full language-conditioned model (as shown in Figure 2(b)). This indicates that even in diverse, real-world scenarios, the model can still identify visual shortcuts (e.g., specific lighting or background features corresponding to specific actions), thereby minimizing the training objective without truly understanding language instructions.
However, this reliance on visual shortcuts has a catastrophic impact on generalization ability.
When evaluated on SimplerEnv, a simulation environment with visually distinct characteristics, the visual-only baseline method achieved a success rate close to 0%, confirming that the low training loss achieved in the Bridge tasks is due to overfitting to specific domain visual patterns rather than learning generalizable operation skills.
Therefore, when these specific visual cues are absent in out-of-distribution (OOD) environments, the strategy will completely fail.
Information Collapse
The researchers formalize the "visual shortcut" as the collapse of the conditional mutual information (CMI) between instructions and actions. Ideally, a robust VLA strategy should maintain a high
, which means that action selection can significantly reduce the uncertainty about instructions. However, CMI is limited by the conditional entropy of language:
In goal-driven datasets, the deterministic mapping
implies
.
Therefore,