Accomplish a Month's Work in a Week: NVIDIA's World Model Training Speed Surges by 400%

Nvidia's world action model, DreamZero, requires 8 H100 GPUs to train for a full 25 days. RLinf has carried out a full - link system - level reconstruction from operator fusion to I/O, boosting the training throughput by nearly 4 times. That means a one - month job can be completed in just one week.

On the path to AGI, the World Model is regarded as the key puzzle piece for AI to truly understand and predict the physical world.

The World Action Model (WAM) DreamZero recently released by NVIDIA has topped two robot benchmarks, RoboArena and MolmoSpaces, upon its release, attracting great attention in the field of embodied intelligence.

Different from traditional models such as VLA, WAM uses videos, which carry complete spatio - temporal information, as its core learning material. It adopts a model of "first understand how the world changes, then decide how to act", enabling the model to naturally acquire a vast amount of physical experience contained in Internet videos.

It no longer needs a large number of repeated demonstrations to learn a single action. Instead, it can learn the physical laws of the world from diverse data, thus maintaining stable execution ability in unseen environments and tasks.

Intuitive comparison between the current optimal VLA model and the DreamZero world model in terms of task success rate, generalization ability, cross - ontology, etc.

The above table intuitively shows that compared with the open - source optimal VLA model π0.5, the DreamZero model has obvious advantages in task success rate, task generalization ability, the improvement effect of post - training on the success rate, and the generalization ability across real - machine ontologies, achieving a success rate improvement of more than 2x.

Its paradigm innovation not only significantly reduces the learning cost but also makes the form adaptation and skill expansion of robots no longer limited by a large amount of exclusive data, providing a feasible path for multi - model collaboration, rapid deployment, and low - cost iteration.

However, the WAM multimodal model with the Diffusion architecture as the main body also poses a huge challenge to computing power and video memory.

Referring to the official open - source DreamZero training code, using 8 H100s to train 247.5 million frames of data, the complete training cycle lasts up to 25 days. The high training cost and time - consuming nature have become the main threshold for industry reproduction.

To help frontier research be implemented more efficiently, RLinf, a large - scale reinforcement learning framework jointly launched by Wuwen Xinqiong and Tsinghua University, has officially launched in - depth support for DreamZero training.

Going a step further on the basis of achieving function adaptation, relying on the powerful underlying system optimization ability of RLinf, the training pipeline of DreamZero has been deeply reconstructed and accelerated.

Compared with the baseline training script provided by the DreamZero official, RLinf has successfully achieved nearly 4 times the training throughput acceleration and has better convergence results.

How does RLinf squeeze every bit of computing power from the GPU to achieve 4 times the training acceleration? Next, we will disassemble the core optimization ideas and logic behind it in this article.

Code link: https://github.com/RLinf/RLinf

Hugging Face link: https://huggingface.co/RLinf

Usage documentation link: https://rlinf.readthedocs.io/zh - cn/latest/rst_source/examples/embodied/sft_dreamzero.html

01 Core Revelation

Behind the Nearly 4 - fold Acceleration

3 Major Optimization Dimensions

To break the performance bottleneck of the official script, the RLinf system optimization team has carried out in - depth optimization on the computational graph, FSDP2 parallel optimization and global parameter tuning, and data processing pipeline.

Extreme Operator/Computational Graph Optimization: Torch Compile + CUDA Graph

The operator and scheduling overhead at the Python level is often the "invisible killer" that limits the peak performance of the GPU.

In RLinf, we have deeply integrated torch.compile and CUDA Graph technologies:

Torch Compile: Through underlying compilation optimization, operators are deeply fused (Kernel Fusion), including inefficient operators in the Diffusion architecture such as WanRMSNorm and adaLN - zero.

CUDA Graph: Solidify the computational graph to eliminate the CPU scheduling bottleneck of GPU launch. In the training of DreamZero, the kernel launch of the CausalWanSelfAttention part is relatively dense, and CUDA Graph can effectively optimize it.

Through this optimization technology, the DreamZero 5B and 14B models have achieved 50% (from 1.8s/step to 1.2s/step) and 34% (from 9s/step to 6.7s/step) training acceleration respectively without changing the original configuration of mbs = 1 (here mbs refers to mbs per gpu, the same below).

Joint Optimization of Computation and Video Memory: Unlock All - around Performance Tuning

Supporting parameter tuning of any Microbatch Size, parallel mode, and Recompute (activation recomputation) is an essential performance tuning method when training large models in the industry.

However, in the official baseline of DreamZero, there are obvious engineering limitations. For example, the zero2 offload parallel method of DeepSpeed is used by default, and the image encoder executes sample by sample without batching, which greatly reduces the performance tuning space.

The RLinf team has strengthened the engineering foundation at the bottom, completely fixed these pain points, and delivered a robust and highly configurable tuning matrix:

Stable adaptation to FSDP2: FSDP2 is the latest ZeRO implementation launched by the PyTorch official team and is also the default parallel scheme of RLinf for medium - scale large models. Previously, the DeepSpeed scheme used in the official DreamZero code had certain limitations: due to the compatibility conflict between ZeRO3 and the context maintenance mechanism of causal conv in the VAE module, developers were often forced to fall back to the lower - performance ZeRO2 offload mode. In addition, the post - backward hook in the backpropagation stage of DeepSpeed generated high CPU - side overhead, restricting the overall training throughput. By migrating to the FSDP2 training backend, we have completely solved the above architectural conflicts and performance bottlenecks. Users can now flexibly switch between different sharding strategies according to the video memory configuration requirements to ensure the efficiency and stability of the training process.

Flexible Microbatch settings: In the initial version of FSDP2 supporting DreamZero model training, the combination of Microbatch Size (mbs), Recompute (activation recomputation), and FSDP2 strategies often triggered complex underlying computational graph conflicts, and the image encoder not batching would swallow up part of the acceleration benefits of increasing mbs. Through engineering efforts, RLinf has completely solved the incompatibility problem when mbs > 1 coexists with the above features, and enabled the image encoder to execute in batches efficiently. This improvement makes the training system more flexible: users can configure any mbs without restrictions, so as to perform fine - grained parameter tuning according to the video memory level and computational throughput requirements of hardware resources, and achieve a better engineering balance between video memory occupancy and execution efficiency. For example, when training the DreamZero 5B model, without enabling Recompute, increasing mbs to 2 hardly changes the single - step time, from 1.2s/step to 1.3s/step, and the throughput increases by 85%.

Deep collaboration between the Recompute mechanism and acceleration operators: In response to the compatibility limitations of the PyTorch native framework under complex parallel strategies, RLinf has achieved stable decoupling and collaboration between Recompute (activation recomputation), CUDA Graph, and FSDP2 through in - depth underlying engineering optimization. This improvement transforms Recompute into a highly reliable and quantifiable performance tuning dimension. In a hardware environment with limited video memory, the system can exchange a small amount of computational time for significant video memory space release, thereby supporting larger - scale parallel tasks and significantly improving the overall training throughput. In the training of DreamZero 5B, without enabling Recompute, the single - card mbs can only be increased to 2, and the best speed is about 1.2s/step, that is, 1.7 samples/sec/gpu. With Recompute, the single - card mbs can be increased to 32, obtaining 7.2 s/step, that is, 4.4 samples/sec/gpu. The throughput is increased by 158% under the same computing power. It can be seen that enabling Recompute allows mbs to be significantly increased, thereby greatly improving the operator efficiency.

Through the above global parameter tuning of FSDP2, mbs, and Recompute, in the training of the DreamZero 5B model, we have further improved the training performance by 266% on the basis of the first operator optimization (that is, 1.2 samples/sec/gpu), reaching 4.4 samples/sec/gpu.

Breaking the I/O Throughput Bottleneck: Efficient Video Data Processing Pipeline

With the significant improvement of computational density (that is, the above two optimizations), data loading efficiency has gradually become a new bottleneck restricting the overall training throughput.

In the training practice of DreamZero, the decoding and pre - processing of video data consume a great deal of CPU resources.

Traditional solutions (such as PyAV) are difficult to support high - frequency throughput requirements in terms of decoding performance; simply increasing the num_workers of the dataset to "exchange quantity for speed" often treats the symptoms rather than the root cause - too many data reading processes will severely compete for CPU resources, leading to delays in the kernel launch of the training main thread and slowing down the execution rhythm of the GPU.

To find the optimal solution between "decoding speed" and "system resource overhead", the RLinf team has conducted in - depth performance benchmarks on mainstream video processing libraries:

Although Decord has a slight advantage in pure decoding speed, Torchcodec shows better CPU occupancy stability while maintaining the same - level performance.

This allows us to reserve enough computational margin for the training main thread and support opening more num_workers to process data concurrently.

Compared with the native PyAV scheme, the decoding time of a single video is shortened by nearly 400ms. In the training scenario of DreamZero with multiple perspectives (left perspective, right perspective, and wrist perspective videos), the video decoding time is saved by 1.2s in total.

This performance improvement at the I/O end provides sufficient data "ammunition" for further squeezing the GPU computing potential.

02 Performance Test

End - to - End Leap from "Runnable" to "Extremely Efficient"

To verify the comprehensive effectiveness of the above multi - dimensional optimization, we conducted a strict end - to - end test on different - scale models of DreamZero on the Droid dataset (each sample contains left, right, and wrist perspectives, and the video specification is 33 frames × 480 × 640).

DreamZero - 14B: Throughput Leap with a Large Number of Parameters

For the 14B large model, due to the huge video memory pressure, the official baseline usually has to adopt the DeepSpeed ZeRO - offload scheme, which leads to serious waste of computation/communication and CPU swap - in/swap - out overhead.

For the 14B model, RLinf has achieved 2.7 times the acceleration compared with the native DeepSpeed scheme; even compared with the unoptimized FSDP2, the throughput has been further improved by 35%.

DreamZero - 5B: Extreme Squeezing of Computing Power Density

For the 5B medium - scale model, the advantage of RLinf lies in its ability to stably increase the Microbatch Size (mbs) through efficient recomputation logic and cooperate with other computational graph tuning to completely release the GPU computing power.

Through RLinf tuning, the training throughput has soared from 1.1 samples/sec/gpu of the official code to 4.44 samples/sec/gpu, achieving an amazing 5.84 times performance leap compared with the FSDP2 Base with many limitations.

The single - step time and throughput of the 14B model and 5B model were tested using 8xH100 throughout. The 14B model was tested with MBS = 1 and GBS = 8 because the intermediate dimension of this model is large, and using mbs = 1 can also achieve better operator efficiency and cover the communication overhead of FSDP2. For the 5B model, we used GBS = 256. The FSDP2 Base version cannot increase the MBS due to some PyTorch bugs, resulting in limited throughput. This is mainly because the operator efficiency is not high under a small MBS, the CPU overhead is significant, and the FSDP2 communication cannot be covered; we have solved these problems and achieved a large throughput increase.

Training Convergence Effect Test: Pursuing Speed and Ensuring Accuracy

In addition to extreme performance optimization, ensuring the correctness and convergence stability of training is the cornerstone of framework implementation.

We conducted a strict convergence verification on the RLinf version of DreamZero.

The following figure shows the comparison of the Loss curves of the DreamZero 5B model on the LIBERO dataset (configuration: LR = 1e - 5, Global Batch Size = 256, 8 H100s, training for 38 hours).

Loss Curve Comparison Analysis: The orange line (RLinf) and the blue line (official Baseline) in the figure show a consistent convergence trend. It is worth noting that the Loss of the official code fluctuates more violently during the training process because it reads data in units of Episodes; while RLinf realizes random sampling at the Step granularity within Episodes through underlying reconstruction, effectively smoothing the noise in the training process and improving the stability of gradient updates.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Get a month's work done in a week, NVIDIA's world model training speed soars by 400%

01

Core Revelation