Overnight, the myth of Sora is shattered. The H200 single graphics card can generate an image in 5 seconds. An open - source AI project by an all - Chinese team has set the video circle on fire.
A single H200 can generate a 5 - second video in just 5 seconds.
Recently, three major institutions, UCSD, UC Berkeley, and MBZUAI, have joined forces to introduce the FastWan series of video generation models.
Paper link: https://arxiv.org/pdf/2505.13389
Its core adopts a brand - new training scheme called "Sparse Distillation", which enables efficient generation and boosts the video denoising speed by 70 times.
Based on the FastVideo architecture, FastWan2.1 - 1.3B can denoise in just 1 second on a single H200 and generate a 480p 5 - second video within 5 seconds.
On a single RTX 4090, it takes 21 seconds to generate a video, with a denoising time of 2.8 seconds.
If only the DiT processing time is calculated
The upgraded FastWan2.2 - 5B can generate a 720P 5 - second video in just 16 seconds on a single H200.
The weights, training scheme, and dataset of the FastWan model are all open - sourced.
Now, real - time AI video generation has finally been achieved.
Sparse Distillation: AI Videos Enter the Ultra - Fast Mode
What exactly is "Sparse Distillation" that enables the model to generate videos so quickly?
For a long time, video diffusion models have become the mainstream in the field of AI video generation. For example, Sora uses a diffusion model + Transformer architecture.
Although these models are powerful, they have long been troubled by two major bottlenecks:
1. When generating videos, a large number of denoising steps are required.
2. The quadratic computational cost of attention when processing long sequences, which is inevitable for high - resolution videos.
Take Wan2.1 - 14B as an example. The model needs to run 50 diffusion steps. To generate a 5 - second 720P video, it needs to process over 80,000 tokens, and attention operations even consume more than 85% of the inference time.
At this moment, "Sparse Distillation" becomes a powerful weapon.
As the core innovation of FastWan, it realizes the joint training of sparse attention and denoising step distillation in a unified framework for the first time.
Its essence is to answer a fundamental question: When applying extreme diffusion compression, such as replacing 50 steps with 3 steps, can the acceleration advantage of sparse attention be retained?
Previous studies thought it was not feasible, but the latest paper has rewritten the answer through "Video Sparse Attention" (VSA).
Why does traditional sparse attention fail in distillation?
Currently, existing methods such as STA and SVG rely on the redundancy in multi - step denoising to prune the attention map, usually only sparsifying the later denoising steps.
However, when distillation compresses 50 steps to 1 - 4 steps, the redundancy they rely on completely disappears.
Experiments have confirmed that the performance of traditional schemes degenerates sharply under settings with less than 10 steps - although sparse attention itself can bring a 3 - fold acceleration, distillation can achieve a gain of over 20 times.
To make sparse attention truly valuable in production, it must be compatible with distillation training.
Video Sparse Attention (VSA) is the core algorithm of dynamic sparse attention, which can autonomously identify key tokens in the sequence.
Different from schemes that rely on heuristic rules, VSA can directly replace FlashAttention during the training process, learn the optimal sparse pattern in a data - driven way, and maintain the generation quality to the maximum extent.
During the step distillation process, when the student model learns to denoise with fewer steps, VSA does not rely on the redundancy of multi - step denoising to prune the attention map, but can dynamically adapt to the new sparse pattern.
This makes VSA the first sparse attention mechanism fully compatible with distillation training. Moreover, they even achieved the synchronous training of VSA and distillation!
As far as the team knows, this is a major breakthrough in the field of sparse attention.
Three Components, Fully Compatible
Based on the Video Sparse Attention (VSA) technology, the team innovatively proposed the sparse distillation method.
This is a post - training technique for models that combines sparse attention training and step distillation.
Its core idea is to let a "fewer - step + sparsified" student model learn to match the output distribution of a "full - step + dense - computation" teacher model.
As shown in the following figure, the overall framework of this technology includes the following key elements:
Sparse Student Network (Driven by VSA, trainable)
Real Score Network (Frozen, full attention)
Pseudo - Score Network (Trainable, full attention)
These three components are all initialized based on the Wan2.1 model.
During training, the student network after sparse distillation receives a noisy video input and performs single - step denoising through VSA to generate an output.
This output will be re - added with noise and then input into two full - attention scoring networks respectively - each of them performs a full - attention denoising.
The difference between the outputs of the two branches constitutes the distribution matching gradient, which optimizes the student network through backpropagation; at the same time, the pseudo - score network will be updated according to the diffusion loss of the student output.
The ingenuity of this architecture lies in: the student model uses VSA to ensure computational efficiency, while the two scoring networks maintain full attention to ensure high - fidelity training supervision.
The ingenuity of this architecture lies in: this design decouples the runtime acceleration (student model) from the distillation quality (scoring network), making sparse attention compatible with aggressive step reduction strategies.
More broadly, since sparse attention only acts on the student model, this scheme can be adapted to various distillation methods, including consistency distillation, progressive distillation, or GAN - based distillation loss.
So, how does FastWan achieve distillation?
High - quality data is crucial for any training scheme, especially for diffusion models. Therefore, the researchers chose to use the high - quality Wan model to autonomously generate a synthetic dataset.
Specifically, 600,000 480P videos and 250,000 720P videos were generated using Wan2.1 - T2V - 14B, and 32,000 videos were generated using Wan2.2 - TI2V - 5B.
When using DMD for sparse distillation, three large models with 14 billion parameters need to be loaded into the GPU memory simultaneously:
· Student model
· Trainable pseudo - score model
· Frozen real - score model
Two of these models (the student model and the pseudo - score model) need to be continuously trained, which requires storing the optimizer state and retaining gradients. Coupled with the characteristic of long sequence length, memory efficiency becomes a key challenge.
Therefore, the key solutions they proposed are:
1. Implement parameter sharding of the three models across GPUs through FSDP2, significantly reducing memory overhead.
2. Apply activation checkpointing technology to alleviate the high activation memory generated by long sequences.
3. Fine - control the gradient calculation switches at each stage of distillation (such as when updating the student model/pseudo - score model).
4. Introduce gradient accumulation to increase the effective batch size under limited video memory.
The sparse distillation of Wan2.1 - T2V - 1.3B runs for 4000 steps on 64 H200 GPUs, consuming a total of 768 GPU hours.
Generate Videos in Seconds with One Card
In the Scaling experiment, the research team pre - trained a video DiT model with 410 million parameters, with a latent space dimension of (16, 32, 32).
While maintaining a sparsity of 87.5%, the loss value obtained by VSA is almost the same as that of the full - attention mechanism.
At the same time, it reduces the FLOPS of attention calculation by 8 times and the end - to - end training FLOPS by 2.53 times.
Scaling from 60 million to 1.4 billion parameters further confirms that VSA can always achieve a better "Pareto frontier" than the full - attention mechanism.
To evaluate the actual effect of VSA, the team fine - tuned Wan - 1.3B on the synthetic data of the video latent space (16×28×52) generated by Wan - 14B.
As shown in Table 2, the model using VSA even outperforms the original Wan - 1.3B in the VBench score.
Under extreme sparse conditions, when compared with the training - free attention sparsification method SVG, VSA performs better even with a higher sparsity, verifying the effectiveness of sparse attention training.
In practical applications, the DiT inference time of Wan - 1.3B is reduced from 31 seconds in the full - attention mode to 18 seconds in the VSA mode.
The fine - grained block sparse kernel of VSA is closer to the theoretical limit in long - sequence scenarios, achieving nearly a 7 - fold acceleration compared to FlashAttention - 3.
Even when accounting for the computational overhead of the coarse - grained stage, VSA still maintains an acceleration advantage of over 6 times.
In contrast, FlexAttention with the same block sparse mask (64×64 block size) only achieves a 2 - fold acceleration.
The results show that when applying VSA to the Wan - 1.3B and Hunyuan models (Figure 4a), the inference speed is increased by 2 - 3 times.
As shown in Figure 5 below, the research team also detected that the block sparse attention generated by the fine - tuned 1.3 - billion - parameter model in the coarse - grained stage is highly dynamic.
Finally, the team also conducted a qualitative experiment on the model. The following figure shows that as the training progresses, the model gradually adapts to the sparse attention mechanism and finally restores the ability to generate coherent videos