StartseiteArtikel

Eine Karte kann ein Video in 2 Sekunden generieren. Tsinghua University arbeitet mit Shengshu zusammen und macht TurboDiffusion Open Source. Die Zeit von Video DeepSeek ist da.

量子位2025-12-25 20:09
Die Videoqualität ist fast unverändert.

Now one can generate a video faster than one watches videos.

Thanks to a new open-source framework, video generation can be accelerated by up to over 200 times while maintaining the same quality!

And this can even be done with a single graphics card. Here's a preview:

Right, originally it took about 184 seconds on a single RTX 5090 to generate a 5-second 1.3B-480P video.

Now it only takes 1.9 seconds – a 97-fold acceleration!

Behind this success is the new open-source framework for video generation acceleration TurboDiffusion, which was jointly developed by the TSAIL Laboratory of Tsinghua University and Shenshu Technology.

As soon as the new framework became known, netizens were excited and shouted:

We are in an era where more videos are generated than watched.

Even researchers from Meta and professors from the University of California, Berkeley have supported the framework:

Generate a video in just 2 seconds

In the past, video generation was impressive but slow – a persistent problem.

To generate a high-quality video of a few seconds, the model often had to run for minutes to hours on a powerful graphics card with a large amount of memory. This delay significantly limited the creativity of artists and the possibility of real-time interaction.

The goal of TurboDiffusion is to solve this problem.

Let's take a look at some data directly.

On a single RTX 5090 and for a 1.3B text-to-video model:

  1. Original generation: Generating a 480P video took about 184 seconds (over 3 minutes).
  2. TurboDiffusion: Only 1.9 seconds.

Compared with the original model, TurboDiffusion achieves a 97-fold acceleration!

For a larger model, such as a 14B image-to-video model with a resolution of 720P, the result is even more obvious: It only takes 38 seconds:

A 720P text-to-video model also only takes 24 seconds:

A 14B-480P image-to-video model takes 9.9 seconds:

The most important thing is that this acceleration is almost lossless.

In the self-developed Vidu model of Shenshu Technology, the dynamic fluidity, the light and shadow quality, and the compliance with instructions remain at a very high level after using TurboDiffusion.

When generating a high-quality video with a resolution of 1080P and a length of 8 seconds, TurboDiffusion can reduce the end-to-end generation delay from 900 seconds to 8 seconds, compared with video generation without any inference acceleration optimization.

The acceleration effects of TurboDiffusion for different sizes and resolutions can be summarized as follows:

In addition, TurboDiffusion is very easy to use. It provides optimized solutions for the current common video generation models, which are immediately ready for use.

In the GitHub project of TurboDiffusion, the detailed operating instructions and methods are also specified:

Now the question is: How is this speed achieved?

Reduce time to a minimum in four steps

The slowness of video generation models (mostly with diffusion transformer architecture) mainly lies in the large number of steps (sampling loops), the high computing power (attention calculations), and the limited memory (weight transfer).

For this, the TurboDiffusion team has integrated four key technologies, each targeting the performance bottlenecks of the inference of diffusion models.

First, SageAttention comes into play.

The attention mechanism is one of the most time-consuming parts of diffusion models. The traditional implementation uses FP16 (half-precision floating-point numbers), which consumes a lot of computing power and memory.

TurboDiffusion introduces the self-developed SageAttention2++, a low-bit quantization method for attention.

It compresses the weights and activations to INT8 or even INT4 and at the same time avoids a loss of accuracy through outlier smoothing and thread-level quantization techniques.

The result is a 3- to 5-fold acceleration of the attention calculation, a halving of the memory consumption, and almost unchanged image quality.

Next is Sparse-Linear Attention (SLA).

While SageAttention accelerates the individual steps, SLA reduces the load at the algorithmic level.

SLA combines sparsity (only paying attention to important pixel points) and linear complexity (the computing power does not grow explosively with the resolution).

The best thing about it is that the sparse calculation and the low-bit acceleration are orthogonal. This means that SLA can be directly built on top of SageAttention and thus achieve additional acceleration several times during inference.

The third trick is rCM step distillation.

Traditional diffusion models have to go through several dozen or even hundreds of iterations to denoise and generate an image.

TurboDiffusion introduces rCM (Score-regularized Continuous-time Consistency Models) for step distillation.

rCM is one of the most advanced distillation methods. With it, a video that originally required several dozen steps can now be generated in only 1 - 4 steps with almost the same quality.

Last but not least is W8A8 quantization + custom operators.

In addition to the attention, the linear layer in the model also takes up a large part of the computing power. TurboDiffusion applies W8A8 quantization (8 bits for weights and 8 bits for activations) to this and processes it in blocks of size 128×128 to fully utilize the INT8 Tensor Core of the RTX 5090.

In addition, the team has rewritten basic operators such as LayerNorm and RMSNorm in Triton/CUDA to eliminate the overhead of the standard implementation in PyTorch.

These four technologies are closely related: The distillation reduces the number of steps, the quantization reduces the load, and SLA and SageAttention reduce the computing power. In the end, this leads to the amazing 200-fold acceleration.

These four core technologies were independently developed by the TSAIL group of Tsinghua University in cooperation with Shenshu Technology. Their significance goes far beyond the improvement of technical indicators. They bridge the gap between research and practical application of video generation models:

Consumer-friendly implementation becomes possible: On a single RTX 5090, the time for 720P video generation drops from several hundred seconds to just a few seconds. In this way, private individuals and small and medium-sized enterprises can finally use practical tools.

The costs for cloud inference drop drastically: A 100-fold reduction in inference delay means that the same computing power can serve 100 times more users. This significantly reduces the operating costs of cloud providers and SaaS platforms.

It drives the innovation of AIGC products: Real-time video editing, interactive video generation, automatic production of AI short films, and other new scenarios become possible and lead to new product forms.

It is compatible with Chinese chips: The characteristics such as low-bit, sparsification, and custom operators are naturally suitable for the architecture of Chinese AI chips and contribute to the independence and controllability of the Chinese AI infrastructure.

In particular, SageAttention is the world's first technology method that realizes the quantization acceleration of attention calculations and is already being used on a large scale in the industry.

For example, SageAttention has been successfully integrated into the NVIDIA inference engine Tensor RT and implemented on mainstream GPU platforms such as Huawei Ascend and Moore Threads S6000.

In addition, leading technology companies and teams such as Tencent Hunyuan, ByteDance Doubao, Alibaba Tora, Shenshu Vidu, Zhipu Qingying, Baidu PaddlePaddle, Kunlun Wanwei, Google Veo3, SenseTime, and vLLM have used this technology in their core products and achieved considerable economic benefits thanks to its excellent performance.

Video generation gets closer to real-time

From one hour to 2 seconds: TurboDiffusion has not only achieved a technological breakthrough but also brought about a paradigm shift.

It proves that high-quality AI videos do not have to be generated at the expense of efficiency. When the generation speed falls within the range of human reaction time (< 5 seconds), AI will no longer be just a post-processing tool but a creative partner – when you speak, it reacts; when you draw a sketch, it tells a story.

Perhaps this is the true meaning of real-time generation time: The delays in creativity are eliminated, and the only limitation is the imagination.

And now we are only 2 seconds away from this time.

TurboDiffusion project address: https://github.com/thu-ml/TurboDiffusion?tab=readme-ov-file

Paper address: https://arxiv.org/pdf/2512.16093

This article is from the WeChat account