VAE gets another blow. Tsinghua University and Kuaishou's SVG diffusion model debuts, boosting training efficiency by 6200% and generation speed by 3500%.
Just after Xiesaining announced VAE's retirement from the image generation field, the teams from Tsinghua University and Kuaishou Keling brought the potential diffusion model SVG without VAE.
This method has achieved a 62-fold improvement in training efficiency and a 35-fold improvement in generation speed.
Why is VAE being abandoned one after another? The main reason is the defect of semantic entanglement - semantic features are placed in the same latent space. Adjusting one value will "affect the whole situation". For example, if you only want to change the color of a cat, the body shape and expression will also change.
Different from the RAE of Xiesaining's team, which simply reuses the pre-trained encoder, modifies the DiT architecture, and focuses on generation performance, SVG achieves multi-task generality through semantic + detail dual branches + distribution alignment.
Let's take a closer look below.
Actively construct a feature space that integrates semantics and details
In the traditional "VAE + diffusion model" image generation paradigm, the core role of VAE is to compress high-resolution images into low-dimensional latent space features (which can be understood as simplified codes of images) for subsequent diffusion models to learn the generation logic.
However, this will cause the image features of different categories and semantics to be intertwined in a chaotic manner. For example, the feature boundaries between cats and dogs are blurred.
This directly leads to two problems:
First, the training efficiency of the diffusion model is extremely low. It takes millions of steps of iteration to barely sort out the feature logic.
Second, the generation process is cumbersome. It often takes dozens or even hundreds of steps of sampling to output a clear image.
Moreover, the generated feature space has a single use. Except for image generation, it can hardly be adapted to other visual tasks such as image recognition and semantic segmentation.
Facing the dilemma of VAE, the RAE technology of Xiesaining's team chooses an approach that focuses extremely on generation. It directly reuses mature pre-trained encoders such as DINOv2 and MAE without additionally modifying the encoder structure. It only restores image details by optimizing the decoder and specifically modifies the diffusion model architecture.
Finally, it has achieved a leapfrog improvement in generation efficiency and quality. Simply put, it focuses all on "generating images quickly and well".
The SVG technology of the Tsinghua & Kuaishou Keling team takes a route that combines generation and multi-task generality. The core difference lies in the construction logic of the feature space.
RAE directly reuses pre-trained features, while SVG actively constructs a feature space that integrates semantics and details.
Specifically, SVG chooses the DINOv3 pre-trained model as the semantic extractor.
The reason is that after large-scale self-supervised learning, DINOv3 can accurately capture the high-level semantic information of images, making the feature boundaries of different categories such as cats, dogs, and cars clearly distinguishable, and fundamentally solving the problem of semantic entanglement.
However, the team also found that the features extracted by DINOv3 are more focused on macro semantics and will lose high-frequency details such as color and texture. Therefore, they specially designed a lightweight residual encoder to supplement the details and specifically learn these ignored detail information.
In order to make the "semantics" and "detail supplement" perfectly integrated, SVG also adds a key distribution alignment mechanism.
This mechanism adjusts the detail features output by the residual encoder through technical means to make their numerical distribution completely match the semantic features of DINOv3, avoiding the detail information from disturbing the semantic structure.
The experimental data also confirms the importance of this mechanism. After removing the distribution alignment, the FID value of the images generated by SVG (the core indicator to measure the similarity between the generated image and the real image, the lower the value, the better) increased from 6.12 to 9.03, and the generation quality declined significantly.
The experimental results show that SVG comprehensively surpasses the traditional VAE solution in terms of generation quality, efficiency, and multi-task generality.
In terms of training efficiency, on the ImageNet 256×256 dataset, the SVG-XL model only needs to be trained for 80 epochs. When there is no classifier guidance, the FID reaches 6.57, far exceeding the SiT-XL (22.58) based on VAE of the same scale. If the training is extended to 1400 epochs, the FID can be as low as 1.92, approaching the level of the current top-generation models.
In terms of inference efficiency, in the ablation experiment, when sampling for 5 steps, the gFID of SVG-XL is 12.26, while that of SiT-XL (SD-VAE) is 69.38 and that of SiT-XL (VA-VAE) is 74.46. This shows that SVG-XL can achieve good generation quality with fewer sampling steps.
Not only for image generation, the feature space of SVG inherits the capabilities of DINOv3 and can be directly used for tasks such as image classification, semantic segmentation, and depth estimation without fine-tuning the encoder. For example, in the ImageNet-1K classification task, the Top-1 accuracy reaches 81.8%, which is almost the same as the original DINOv3. In the ADE20K semantic segmentation task, the mIoU reaches 46.51%, approaching that of a dedicated segmentation model.
Team introduction
The team is led by Zheng Wenzhao, who is currently a postdoctoral fellow at the University of California, Berkeley. Previously, he obtained his doctorate from the Department of Automation at Tsinghua University, and his research focuses on the fields of artificial intelligence and deep learning.
Shi Minglei and Wang Haolin, also from the Department of Automation at Tsinghua University, are currently pursuing their doctoral degrees. Their research focuses on multi-modal generation models.
Among them, Shi Minglei revealed that he is also founding a company focusing on artificial intelligence applications.
△
Ziyang Yuan, Xiaoshi Wu, Xintao Wang, and Pengfei Wan are from the Kuaishou Keling team.
Among them, Pengfei Wan is the person in charge of the Kuaishou Keling video generation model.
From the RAE of Xiesaining's team to the SVG of Tsinghua and Kuaishou, although the technical routes have different focuses, it can be seen from their breakthroughs that the feature space of pre-trained visual models may already have the ability to replace VAE.
Paper address: https://arxiv.org/abs/2510.15301
Code address: https://github.com/shiml20/SVG
This article is from the WeChat official account "QbitAI", author: Wen Le. It is published by 36Kr with authorization.