Eine Monatsarbeit wird in einer Woche erledigt, die Trainingsgeschwindigkeit des Weltmodells von NVIDIA steigt um 400 %.
On the way to AGI, the World Model is regarded as the key component to enable AI to truly understand and predict the physical world.
The World Motion Model (WAM) DreamZero recently presented by NVIDIA has reached the top in two robot benchmarks, RoboArena and MolmoSpaces, right after its release and has attracted great attention in the field of Embodied Intelligence.
In contrast to traditional models like VLA, WAM takes videos containing complete spatio - temporal information as the central learning material. It follows a pattern where it first understands how the world changes before deciding how to act. As a result, the model naturally gains the enormous physical experiences contained in Internet videos.
It no longer has to learn through numerous repetitions of a single movement, but can learn the physical laws of the world from diverse data. Thus, it can also act stably in previously unseen environments and tasks.
Direct comparison of the current best VLA model with the DreamZero world model in terms of task success rate, generalizability, trans - corporeality, etc.
The above table directly shows that the DreamZero model has significant advantages over the best open - source VLA model π0.5 in terms of task success rate, task generalizability, improvement of success rate after training, and generalizability across real robot bodies. It has achieved more than a two - fold increase in the success rate.
The paradigm change has not only significantly reduced the learning costs but also freed the adaptation of the robot form and the expansion of its capabilities from the need to have a large amount of specific data. It provides a practical way for the cooperation of multiple robot types, rapid implementation, and cost - effective iteration.
However, the multimodal WAM model with diffusion architecture also poses an enormous challenge to computing power and graphics memory.
According to the official open - source training code of DreamZero, training 247.5 million video frames with 8 H100 GPUs takes 25 days. The high training costs and long duration are the main barriers to repeating the experiments in the industry.
To support the more efficient implementation of research projects, Wuwen Xinqiong, in cooperation with Tsinghua University and others, has developed the large - scale Deep Reinforcement Learning platform RLinf, which now also deeply supports DreamZero training.
In addition, thanks to its strong system optimization capabilities, RLinf has thoroughly rebuilt and accelerated the training pipeline of DreamZero.
Compared with the baseline training script of DreamZero, RLinf has achieved almost a four - fold acceleration of the training throughput and better convergence.
How does RLinf achieve almost a four - fold training acceleration and optimally utilize the computing power of the GPU? The following explains the central optimization ideas and logic behind this process.
Code Link: https://github.com/RLinf/RLinf
Hugging Face Link: https://huggingface.co/RLinf
Usage Documentation Link: https://rlinf.readthedocs.io/zh - cn/latest/rst_source/examples/embodied/sft_dreamzero.html
01
Revealing the Core Secret
Behind the Almost Four - Fold Acceleration
3 Optimization Dimensions
To overcome the performance limitations of the official script, the RLinf system optimization team has thoroughly optimized the compute graphs, the FSDP2 parallel optimization and global parameter setting, and the data processing pipeline.
Extreme Optimization of Operators/Compute Graphs: Torch Compile + CUDA Graph
The operators and the scheduling overhead at the Python level are often the "invisible killers" that limit the peak performance of the GPU.
In RLinf, we have deeply integrated the technologies of torch.compile and CUDA Graph:
Torch Compile: Through the optimization of the underlying compilation, the operators are deeply fused (Kernel Fusion), including inefficient operators like WanRMSNorm and adaLN - zero in the diffusion architecture.
CUDA Graph: The compute graph is fixed to eliminate the CPU scheduling limitation during GPU launch. In the DreamZero training, the kernel launch in the CausalWanSelfAttention part is relatively dense, and CUDA Graph can effectively optimize this.
Through this optimization technology, the DreamZero 5B and 14B models have achieved a training acceleration of 50% (from 1.8 s/Step to 1.2 s/Step) and 34% (from 9 s/Step to 6.7 s/Step) respectively under the unchanged configuration of mbs = 1 (here mbs means mbs per GPU, the same below).
Combined Optimization of Computing Power and Graphics Memory: Unleashing Comprehensive Performance Settings
The support for arbitrary microbatch sizes, parallelization types, parameter setting, and recompute (activation recomputation) are indispensable means for performance adjustment in the training of large models in the industry.
However, there are obvious technical limitations in the baseline script of DreamZero. For example, the DeepSpeed - Zero2 - Offload parallelization method is used by default, and the image encoder is executed piece - by - piece without batch processing. This significantly reduces the possibilities for performance adjustment.
The RLinf team has strengthened the technical foundation and completely solved these problems. It has developed a robust and highly configurable optimization matrix:
Stable Adaptation to FSDP2: FSDP2 is the latest ZeRO implementation of the PyTorch official team and the standard parallelization strategy of RLinf for medium - sized models. So far, the DeepSpeed strategy in the official DreamZero code has had certain limitations. Due to the compatibility problems between ZeRO3 and the context management mechanism of the causal conv in the VAE module, developers often had to resort to the less powerful ZeRO2 - Offload mode. In addition, the post - backward hook in the backward propagation phase of DeepSpeed caused a high CPU overhead, which limited the overall training throughput. By migrating to the FSDP2 training backend, we have completely solved these architecture conflicts and performance limitations. Users can now flexibly switch between different sharding strategies according to the graphics memory requirements to ensure an efficient and stable training process.
Flexible Microbatch Setting: In the first version where FSDP2 trains the DreamZero model, the combination of microbatch size (mbs), recompute (activation recomputation), and FSDP2 strategy often led to complex conflicts in the underlying compute graphs. In addition, the lack of batch processing in the image encoder consumed part of the acceleration gain with larger mbs. RLinf has completely solved the compatibility problem when mbs > 1 through technical efforts and enables the image encoder to perform batch processing efficiently. This improvement has made the training platform more flexible. Users can set any mbs and thus fine - tune the parameters to achieve the best balance between graphics memory occupancy and execution performance. For example, in the DreamZero 5B model training, when the recompute function was disabled, the throughput increased by 85% when mbs was increased from 1 to 2, while the step time increased from 1.2 s/Step to 1.3 s/Step.
Deep Coordination between Recompute Mechanism and Accelerating Operators: Due to the compatibility limitations of the PyTorch framework in complex parallelization strategies, RLinf has achieved the stable decoupling and coordination of recompute (activation recomputation), CUDA Graph, and FSDP2 through in - depth technical optimizations. This improvement has made recompute a reliable and quantifiable performance adjustment dimension. In a hardware environment with limited graphics memory, the system can achieve a significant graphics memory gain with a small computing time investment to support larger parallelization tasks and significantly increase the overall training throughput. In the DreamZero 5B model training, the microbatch size could be increased from 2 to 32 when recompute was enabled. This increased the throughput from 1.7 samples/sec/GPU to 4.4 samples/sec/GPU, which corresponds to an increase of 158%. It can be seen that enabling recompute significantly increases the mbs and thus improves the efficiency of the operators.
Through the global parameter setting of FSDP2, mbs, and recompute, we have increased the training performance of the DreamZero 5B model by 266% to 4.4 samples/sec/GPU on the basis of the first operator optimization (i.e., 1.2 samples/sec/GPU).
Overcoming the I/O Throughput Limitation: Efficient Video Data Processing Pipeline
With the significant increase in computational intensity (i.e., the above two optimizations), the data loading efficiency gradually becomes the new limitation for the overall training throughput.
In the DreamZero training, the decoding and pre - processing process of video data consumes an enormous amount of CPU resources.
Traditional solutions like PyAV cannot meet the high throughput requirements. Simply increasing the num_workers in the dataset class to increase the speed is often only a symptomatic solution. Too many data reading processes compete strongly for CPU resources, which leads to delays in the kernel launch of the main training thread and slows down the GPU execution speed.
To find the optimal solution between "decoding speed" and "system resource consumption", the RLinf team has thoroughly tested the performance of common video processing libraries:
Although Decord performs slightly better in pure decoding speed, Torchcodec shows better stability in CPU utilization with the same performance.
This allows us to reserve enough computing capacity for the main training thread and start more num_workers to process the data in parallel.
Compared with the original PyAV solution, the decoding time per video was shortened by almost 400 ms. In the DreamZero training with multiple camera perspectives (left, right, and wrist perspectives), the total video decoding time was saved by 1.2 s.
This performance improvement on the I/O side provides enough data to further tap the computing potential of the GPU.
02
Performance Test
The Leap from "Works" to "Extremely Efficient"
In an End - to - End Scenario
To verify the overall effectiveness of the above multi - dimensional optimizations, we have conducted strict end - to - end tests on DreamZero models of different sizes on the Droid dataset (single samples contain three camera perspectives: left, right, and at the wrist, video size: 33 Frames × 480 × 640).
DreamZero - 14B: Leap in Throughput for Large Models
For the 14B model, the graphics memory pressure is so high that the official baseline script usually has to use the DeepSpeed - ZeRO - Offload strategy. This leads to significant computing and communication losses as well as CPU in - and out - exchange costs.
Compared with the original DeepSpeed strategy, RLinf has achieved a 2.7 - fold acceleration for the 14B model. Even compared with unoptimized FSDP2, the throughput was increased by 35%.
DreamZero - 5B: Maximum Utilization of Computing Power
For the 5B model, the advantage of RLinf is that it can stably start larger microbatch sizes (mbs) through efficient recomputation logic and, in combination with other compute graph optimizations, fully utilize the computing power of the GPU.