HomeArticle

New work from He Kaiming's team: Diffusion models may be misused.

量子位2025-11-19 19:18
Let the diffusion model return to the essence of "denoising".

He Kaiming has returned to simplicity once again.

The latest paper directly overturns the mainstream approach of diffusion models—instead of having the model predict noise, it directly generates clean images.

If you are familiar with He Kaiming's works, you will find that this is his typical path of innovation. Instead of proposing a more complex architecture, he breaks the problem down to its original form, letting the model do what it is best at.

In fact, diffusion models have been popular for many years, and their architectures have become increasingly complex. For example, they predict noise, predict velocity, align latent variables, stack tokenizers, add VAEs, and add perceptual loss...

But it seems that everyone has forgotten that diffusion models were originally denoising models.

Now, this new paper brings this issue back to the table. Since it is called a denoising model, why not directly denoise?

Therefore, following ResNet, MAE, etc., He Kaiming's team has come up with another conclusion of "the simplest way is the best": diffusion models should return to the original—directly predict images.

Diffusion models may have been misused

Although the current mainstream diffusion models are designed with the idea of "denoising" in mind, during training, the targets predicted by neural networks are often not clean images, but noise, or a velocity field that is a mixture of images and noise.

In fact, predicting noise is very different from predicting clean images.

According to the manifold hypothesis, natural images are distributed on a low-dimensional manifold in a high-dimensional pixel space, which are clean data with certain patterns; while noise is uniformly dispersed throughout the high-dimensional space and does not have such a low-dimensional structure.

To put it simply, imagine the high-dimensional pixel space as a huge 3D room, and clean natural images are actually crowded on a 2D screen in the room. This is the manifold hypothesis—although natural data seems to have a high dimension, it is actually concentrated on a low-dimensional "surface (manifold)".

But noise is different. It is like snowflakes filling the entire 3D room and is not on the screen; the velocity field is the same, half on the screen and half off the screen, also deviating from the rules of the "manifold".

This leads to a core contradiction. When dealing with high-dimensional data, for example, dividing an image into large patches of 16x16 or even 32x32, requiring a neural network to fit high-dimensional noise without patterns requires a large model capacity to retain all information, which can easily cause the model training to collapse.

On the contrary, if the network is made to directly predict clean images, in essence, it is to let the network learn how to project noise points back to the low-dimensional manifold. This requires much less network capacity and is more in line with the original design of neural networks to "filter noise and retain signals".

Therefore, this paper proposes an extremely simple architecture JiT—Just image Transformers.

As its name suggests, this is a pure Transformer for processing images, and its design is very simple. It does not use a VAE to compress the latent space like common diffusion models, nor does it design any tokenizers. It does not require the alignment of pre-trained features such as CLIP or DINO, and does not rely on any additional loss functions.

Starting completely from pixels, use a pure Transformer to denoise.

JiT is like a standard ViT. It directly inputs the original pixels after dividing them into large patches (the dimension can be as high as 3072 or even higher). The only change is to set the output target to directly predict clean image patches.

The experimental results show that in the low-dimensional space, the performance of predicting noise and predicting the original image is comparable; but once entering the high-dimensional space, the traditional noise prediction model completely collapses, and the FID (the lower the better) index soars exponentially, while JiT, which directly predicts the original image, remains stable.

The model also has excellent scalability. Even if the patch size is increased to 64x64, making the input dimension as high as more than ten thousand dimensions, as long as it persists in predicting the original image, high-quality generation can be achieved without increasing the network width.

The team even found that artificially introducing a bottleneck layer at the input end for dimensionality reduction not only does not cause the model to fail, but actually improves the generation quality because it conforms to the essence of manifold learning to filter noise.

This extremely simple architecture achieves SOTA-level FID scores of 1.82 and 1.78 on ImageNet 256x256 and 512x512 without relying on any complex components or pre-training.

Author introduction

The first author of this paper is Li Tianhong, one of He Kaiming's first students. He graduated from the Yao Class of Tsinghua University with a bachelor's degree. After obtaining a master's and a doctor's degree from MIT, he is currently conducting postdoctoral research in He Kaiming's group.

His main research directions are representation learning, generative models, and the synergy between the two. His goal is to build an intelligent vision system that can understand the world beyond human perception.

Previously, he developed the self-conditional image generation framework RCG with He Kaiming as the first author, and he has also participated in many of the team's latest research projects.

It can also be said that he is a scholar who loves Hunan cuisine and has even displayed the recipes on his personal homepage.

Paper link: https://arxiv.org/abs/2511.13720

This article is from the WeChat official account "QbitAI". Author: Wen Le. Republished by 36Kr with permission.