Unify the VLA paradigm. The Hong Kong University of Science and Technology has open-sourced the StarVLA Lego-style architecture, significantly reducing the reproduction cost.
Reported by New Intelligence Yuan
Editor: LRST
【Introduction】Currently, the VLA (Vision-Language-Action) track of embodied intelligence is stuck in a typical "fragmentation" quagmire: different teams adopt heterogeneous action decoding paradigms, strongly coupled data pipelines, and incompatible evaluation protocols, making it difficult to compare methods horizontally and resulting in extremely high reproduction costs. The open-source project StarVLA does not choose to pile up computing power or blindly chase high rankings. Instead, it directly addresses the pain points from the system abstraction level and proposes a "Lego-style" unified architecture of Backbone-Action Head.
Although VLA models have become the mainstream paradigm of embodied general intelligence, academic research is facing a triple "Tower of Babel" dilemma:
- Architecture fragmentation: Autoregressive discrete tokenization, parallel continuous regression, flow matching denoising, dual-system reasoning... Different action decoding paradigms use completely different code implementations and interface assumptions.
- Strongly coupled pipelines: Most existing open-source frameworks are "single-method customized". Data preprocessing, training loops, and evaluation protocols are deeply bound, resulting in modules that cannot be reused across projects.
- Inconsistent evaluation standards: Each paper only reports results on disjoint benchmark subsets, and the preprocessing and inference protocols are opaque, making fair comparison almost impossible.
This fragmentation has seriously slowed down the iteration rhythm of embodied foundation models.
The Hong Kong University of Science and Technology has open-sourced a new project, StarVLA. The core insight is that VLM-based and World-Model-based are not fundamentally opposing paradigms but variants of different auxiliary learning signals (L_aux) under the same strategic framework.
Based on this, the team has built a highly modular and unified-interface open-source base, allowing researchers to freely combine backbone networks and action heads like building Lego and verify the impact of a single design variable under fully controlled conditions.
Open-source address: https://github.com/starVLA/starVLA
Project homepage: https://starvla.github.io
Paper link: https://arxiv.org/abs/2604.05014
Architecture decoding, Policy-Centric "Lego" abstraction
StarVLA introduces a unified policy-centric formula at the system level, mapping multimodal observations, language instructions, and future action blocks to the same computational graph:
Among them
is the multimodal historical observation, ℓ is the language instruction,
is the predicted action block,
is the optional auxiliary output (such as future visual frames, spatial reasoning text, etc.). The training objective is uniformly decomposed into:
Direct VLA:
, pure action supervision.
VLM-based VLA: Introduce language alignment auxiliary objectives (such as subtask planning, spatial grounding).
WM-based VLA: Introduce future observation prediction as an auxiliary objective or implicit prior.
Under this abstraction, StarVLA achieves Bidirectional Modularity:
Plug-and-play Backbone: Supports instruction-tuned VLMs such as Qwen3-VL and InternVL, as well as world models such as Cosmos-Predict2. They can be connected to the unified representation contract with only a lightweight adaptation layer.
Plug-and-play Action Head: Four representative action decoders are built-in, sharing the same forward() and predict_action() interfaces:
StarVLA-FAST: Autoregressive discrete token generation
StarVLA-OFT: Lightweight MLP parallel continuous regression
StarVLA-π: Inter-layer Cross-DiT flow matching denoising
StarVLA-GR00T: System 2 (slow reasoning) + System 1 (fast action) dual-system architecture
All variants share the same data interface, training loop, and evaluation pipeline. The paradigm can be switched by simply replacing the Backbone or Action Head. This completely eliminates the "hidden variable interference" when comparing different methods.
Training paradigm, moving from single-benchmark fine-tuning to multimodal collaboration
StarVLA abstracts the training strategy into a reusable configuration decoupled from the architecture, supporting three core paradigms:
1. Supervised fine-tuning of behavior cloning (SFT)
A complete distributed training script (Accelerate + DeepSpeed ZeRO-2) is provided, supporting full-parameter fine-tuning and sub-module freezing. The optimizer uses independent learning rates for multiple parameter groups, bfloat16 mixed precision, and cosine decay scheduling to ensure stable training of heterogeneous components.
2. Multi-objective collaborative training (Co-Training)
Pure action fine-tuning can easily lead to "catastrophic forgetting" of the VLM backbone.
StarVLA has a built-in dual-data stream collaboration mechanism: alternately execute VLA action forward and VLM language modeling forward, and dynamically balance action learning and multimodal representation retention through trainer.loss_scale.vlm. Experiments show that collaborative training can significantly improve spatial grounding ability and bring a 4% - 10% success rate gain on WidowX and Google Robot.
3. Cross-embodiment mixed training
Through the LeRobotMixtureDataLoader, users can declare any combination of robot datasets and sampling weights in YAML. The framework automatically handles action space alignment and embodiment label tracking. This design turns "cross-embodiment pre-training" from a customized script into a standardized configuration.
Evaluation and deployment, the Server-Client architecture connects Sim2Real
To avoid benchmark dependency from polluting the model environment, StarVLA adopts a lightweight WebSocket Server-Client evaluation abstraction:
- The model side only exposes the predict_action() interface and starts the policy service after loading the checkpoint.
- The evaluation side (such as LIBERO, SimplerEnv, RoboTwin 2.0 official environments) encapsulates the observation dictionary through an independent Client, communicates with msgpack, and returns normalized actions.
- No code modification is required for real robot deployment: Simply replace the robot controller with a Client, provide camera observations and instructions in the same format, and it can be seamlessly migrated to the physical world.
Currently, 7 major mainstream benchmarks (including LIBERO, SimplerEnv, RoboTwin 2.0, RoboCasa-GR1, BEHAVIOR-1K, CALVIN, etc.) have been integrated, along with a complete benchmark-specific adapter to implement post-processing logic such as action anti-normalization, Chunk splitting, and Delta/Absolute conversion.
Performance and efficiency, proof of strong generalization under minimal configuration
StarVLA deliberately avoids complex data engineering and online optimization (such as DAgger). By simply fine-tuning on the official demonstration sets of benchmarks with publicly available VL pre-trained weights, it can achieve highly competitive performance:
More importantly, replacing the Backbone hardly affects the performance: When replacing Qwen3-VL-4B with Cosmos-Predict2-2B, the average score of LIBERO still remains stable above 95.2%, verifying the generalization robustness of the architecture.
In the cross-benchmark Generalist setting, a single model jointly trains LIBERO + SimplerEnv + RoboTwin 2.0 + RoboCasa-GR1. The average success rate of RoboCasa increases from the optimal 48.8% of the Specialist to 57.3%, proving the feasibility of All-in-One training under a unified pipeline.
In terms of computational efficiency: Tests on a single node with 8×A100 show that when the Per-GPU Batch Size = 8, the GPU utilization reaches 92%, and the sample throughput is 56.6 samples/s. When expanding to a multi-node setting with 256 GPUs, the communication overhead only has a single jump from 8