Generate one video in 2 seconds with a single card. Tsinghua University collaborates with Shenshu to open source TurboDiffusion. It's time for video DeepSeek.
Now, generating a video is faster than watching one.
Thanks to a new open - source framework, video generation can be accelerated by over 200 times while maintaining quality!
Moreover, it can even run on a single graphics card. Let's feel the magic:
Yes, originally, it took about 184 seconds to generate a 5 - second 480P video using a 1.3B model on a single RTX 5090.
Now, it only takes 1.9 seconds, a 97 - fold increase in speed!
Behind this is the new open - source video generation acceleration framework, TurboDiffusion, jointly developed by the TSAIL Lab of Tsinghua University and Shengshu Technology.
Once the new framework was released, netizens were excited and exclaimed:
We have entered an era where more videos are generated than watched.
Even Meta researchers and Berkeley professors have shown their support:
Generate a video in 2 seconds
In the past, although video generation was amazing, its slowness was always a major pain point.
To generate a few seconds of high - quality video, the model often had to run on high - end graphics cards with large memory for minutes to dozens of minutes. This delay greatly limited the creative inspiration of creators and the possibility of real - time interaction.
The emergence of TurboDiffusion is to solve this problem.
Let's take a look at a set of data.
On a single RTX 5090, for a text - to - video model of 1.3B size:
- Original generation: It takes about 184 seconds (over 3 minutes) to generate a 480P video.
- TurboDiffusion: Only 1.9 seconds are needed.
In calculation, compared with the original model, TurboDiffusion achieves an acceleration of about 97 times!
If the model is larger, such as a 14B image - to - video model with a resolution of 720P, the effect is also immediate. It only takes 38 seconds to complete:
For a 720P text - to - video model, it only takes 24 seconds:
For a 14B 480P image - to - video model, it takes 9.9 seconds:
More importantly, this acceleration is almost lossless.
On the self - developed Vidu model of Shengshu Technology, after using TurboDiffusion, the dynamic fluency, light and shadow texture, and instruction - following ability of the video still maintain a very high level.
When generating a high - quality video with a 1080P resolution and 8 - second duration, compared with video generation without any inference acceleration optimization, TurboDiffusion can reduce the end - to - end generation delay from 900s to 8s.
The acceleration effects of TurboDiffusion for different sizes and resolutions can be summarized as follows:
Moreover, TurboDiffusion is very easy to operate. It provides out - of - the - box optimization solutions for current mainstream video generation models.
On GitHub, the TurboDiffusion project also provides specific operation details and methods:
So, the question is, how is this speed achieved?
Compress time to the limit in four steps
The slowness of video generation models (usually based on the Diffusion Transformer architecture) mainly lies in the large number of steps (sampling cycles), high computational power requirements (Attention calculation), and narrow memory bandwidth (weight transfer).
To address this, the TurboDiffusion team has integrated four key technologies, each precisely targeting the performance bottlenecks of diffusion model inference.
First is SageAttention.
The attention mechanism is one of the most time - consuming parts of the diffusion model. The traditional implementation uses FP16 (half - precision floating - point), which has a large computational volume and high memory usage.
TurboDiffusion introduces the team's self - developed SageAttention2++, a low - bit quantization attention scheme.
It compresses weights and activation values to INT8 or even INT4. At the same time, through outlier smoothing and thread - level quantization technology, it avoids precision collapse.
In terms of results, the attention calculation speed is increased by 3 - 5 times, the memory usage is halved, and the image quality remains almost unchanged.
Second is Sparse - Linear Attention (SLA).
If SageAttention speeds up single - step calculations, SLA reduces the burden from the algorithm logic.
SLA combines sparsity (only focusing on important pixels) and linear complexity (preventing the computational volume from growing exponentially with resolution).
The most remarkable thing is that sparse calculation and low - bit acceleration are orthogonal. This means that SLA can be directly stacked on top of SageAttention, further squeezing out several times of additional acceleration space during the inference process.
The third technique is rCM step distillation.
Traditional diffusion models need to go through dozens or even hundreds of iterations to denoise and generate images.
TurboDiffusion introduces rCM (Score - regularized Continuous - time Consistency Models) for step distillation.
rCM is one of the most advanced distillation schemes at present. Through it, a video that originally required dozens of steps to generate can now achieve almost the same quality in only 1 - 4 steps.
Finally, there is W8A8 quantization + custom operators.
In addition to the attention mechanism, the linear layer in the model also accounts for a large amount of computation. TurboDiffusion uses W8A8 quantization (8 - bit weights and 8 - bit activation values) for it and processes it in blocks at a block granularity of 128×128, making full use of the INT8 Tensor Core of the RTX 5090.
Moreover, the team has rewritten basic operators such as LayerNorm and RMSNorm using Triton/CUDA to eliminate the overhead of the default PyTorch implementation.
These four technologies are closely linked: distillation reduces the number of steps, quantization reduces the load, and SLA and SageAttention reduce the computational power requirements. Ultimately, they result in an amazing 200 - fold acceleration.
These 4 core technologies are independently developed by the TSAIL team of Tsinghua University in collaboration with Shengshu Technology. Their significance goes far beyond the improvement of technical indicators. They have bridged the last mile from research to the implementation of video generation models:
Consumer - level deployment becomes possible: On a single RTX 5090, the time for generating a 720P video is reduced from hundreds of seconds to dozens of seconds, truly achieving second - level video output and providing available tools for individual creators and small and medium - sized enterprises.
Cloud inference costs drop sharply: A 100 - fold reduction in inference latency means that the same computing power can serve 100 times more users, significantly reducing the operating costs of cloud providers and SaaS platforms.
It promotes AIGC product innovation: New scenarios such as real - time video editing, interactive video generation, and automatic production of AI short dramas become possible, giving rise to new product forms.
It is friendly to domestic chips: The characteristics of low - bit, sparsity, and operator customization are naturally compatible with the architecture of domestic AI chips, helping to make China's AI infrastructure independently controllable.
Among them, SageAttention is the world's first technical solution to achieve quantization acceleration for attention calculation and has been widely deployed and applied in the industry.
For example, SageAttention has been successfully integrated into NVIDIA's inference engine Tensor RT and has been deployed and implemented on mainstream GPU platforms such as Huawei Ascend and Moore Threads S6000.
In addition, domestic and international leading technology companies and teams such as Tencent Hunyuan, ByteDance Doubao, Alibaba Tora, Shengshu Vidu, Zhipu Qingying, Baidu PaddlePaddle, Kunlun Wanwei, Google Veo3, SenseTime, and vLLM have all applied this technology in their core products, creating considerable economic benefits with its excellent performance.
Video generation is getting closer to real - time
From 1 hour to 2 seconds, TurboDiffusion has achieved not only a technological breakthrough but also a paradigm shift.
It proves that high - quality AI videos do not have to come at the cost of efficiency. When the generation speed enters the range of human reaction time (<5 seconds), AI is no longer just a post - production tool but becomes a creative partner - when you speak, it moves; when you draw a sketch, it tells a story.
Perhaps this is the true meaning of the real - time generation era: the delay in creation is eliminated, and imagination becomes the only limit.
Now, we are only 2 seconds away from that era.
TurboDiffusion project address: https://github.com/thu-ml/TurboDiffusion?tab=readme-ov-file
Paper address: https://arxiv.org/pdf/2512.16093
This article is from the WeChat official account “QbitAI”, author: Jin Lei. Republished by 36Kr with permission.