Ist DiT mathematisch und formal falsch? Xie Saining: Man soll nicht Wissenschaft im Kopf treiben

Es ist gut, dass es akademische Kontroversen gibt, aber man soll nicht das Feuer schüren.

“Guys, DiT is wrong!”

Recently, a post on X sparked a lot of discussions. A blogger claimed that DiT has architectural flaws and attached a screenshot of a research paper.

Figure 1. We introduced TREAD, a training strategy that can significantly improve the training efficiency of backbone networks in token-based diffusion models. When applied to the standard DiT backbone network, we achieved a 14/37-fold increase in training speed on the unguided FID metric, while also converging to better generation quality.

In the figure, the horizontal axis represents the training time (in hours on an A100 GPU, on a log scale, from 100 hours to 10,000 hours), and the vertical axis represents the FID score (the lower the better, indicating higher quality of the generated images).

The blogger believes that the core message of this figure is not the speed advantage of TREAD, but the premature stabilization of DiT's FID, suggesting that DiT may have “hidden architectural flaws” that prevent it from continuing to learn from the data.

The paper mentioned by the blogger was published in January this year (updated to v2 in March). It introduced a new method called TREAD. This work significantly improved the training efficiency and the quality of generated images through an innovative “token routing” mechanism without changing the model architecture, thus significantly outperforming the DiT model in terms of both speed and performance.

Specifically, during the training process, TREAD uses a “partial token set” vs. a “full token set”. It saves information through predefined routing and reintroduces it to deeper layers, skips some computations to reduce costs, and is only used in the training phase. The standard settings are still used during inference. This is similar to methods like MaskDiT but more efficient.

Paper title: TREAD: Token Routing for Efficient Architecture-agnostic Diffusion Training

Paper link: https://arxiv.org/abs/2501.04765

Code: https://github.com/CompVis/tread

The blogger further criticized DiT in subsequent responses and explained how TREAD exposed these issues.

The blogger pointed out that the paper revealed the design flaws of the DiT model. Specifically, the research found that during the training process, if some computational units in the model were replaced with an “Identity Function” — that is, these units did no computation and simply “passed through” the data, which was equivalent to being temporarily disabled — the final evaluation score of the model would actually improve.

Then the blogger pointed out two “suspicious” designs of DiT:

The entire architecture uses “Post-LayerNorm”

The blogger believes that DiT uses a technology (Post-LayerNorm) that is known to be less stable to handle a task with extremely drastic numerical range changes (the diffusion process).

adaLN-zero

The blogger believes that although this model claims to be a “Transformer” architecture overall, when dealing with the most critical “guidance information” (i.e., conditional data), it does not use a powerful Transformer but a very simple MLP network (Multilayer Perceptron).

More specifically, adaLN-zero completely covers the input of the attention unit and injects an arbitrary bias to cover the output, which limits the model's expressive ability and is equivalent to “hating the attention operation”, thus weakening the overall potential of DiT.

The blogger also mentioned a LayerNorm study related to an earlier paper and pointed out that the bias and gain parameters of LayerNorm may have a greater impact on gradient adjustment rather than truly improving the model's performance. He believes that adaLN-zero takes advantage of this. It is called “gradient adjustment”, but in fact, it seems like “secretly injecting overfitting bias into a small model”.

Paper title: Understanding and Improving Layer Normalization

Paper link: https://arxiv.org/abs/1911.07013

After reading this post, Xiesen Shen, an assistant professor of computer science at New York University and the author of DiT, couldn't hold back.

In 2022, Xiesen Shen published the DiT paper, which was the first time a diffusion model was combined with a Transformer.

Paper title: Scalable Diffusion Models with Transformers

Paper link: https://arxiv.org/pdf/2212.09748

After DiT came out, the Transformer gradually replaced the U-Net in the original diffusion model and produced high-quality results in image and video generation tasks.

Its core idea is to use a Transformer instead of a traditional convolutional neural network as the backbone network of the diffusion model.

This method has become the basic architecture of Sora and Stable Diffusion 3, and also established DiT's academic status.

When the DiT paper was first published, it was continuously questioned and even rejected by CVPR 2023 on the grounds of “lack of innovation”.

This time, in the face of the claim that DiT is “mathematically and formally wrong”, Xiesen Shen posted some responses on Twitter.

From the words, Xiesen Shen was a bit emotional about this post:

I know the original post is baiting for clicks, but I'll still take the bait...

To be honest, every researcher's dream is to find that their architecture is wrong. If it's always problem-free, that's a real big problem.

Every day, we try to break DiT with methods like SiT, REPA, and REPA-E. But this requires making assumptions, conducting experiments, and verifying them, rather than doing science in a pretend way in your head... Otherwise, the conclusions you draw are not only wrong but not even worthy of being called wrong.

No wonder Xiesen Shen's tone was a bit unfriendly. Some statements of the original post blogger may have been a bit provocative:

Xiesen Shen also replied to some of the questions raised in the original post from a technical perspective. After refuting some of the questions in the original post, he also pointed out that there are currently some inherent flaws in the DiT architecture.

As of today, the problems with DiT:

TREAD is closer to stochastic depth. I think its convergence comes from the regularization effect, which makes the representation ability stronger (note that the inference process is standard — all modules will process all tokens). This is a very interesting piece of work, but it's completely different from what the original post said.
Lightning DiT is a proven and robust upgraded version (combining swiglu, rmsnorm, rope, and patch size = 1). It should be used preferentially if possible.
There is no evidence that post-norm has a negative impact.
The biggest improvement in the past year has been in internal representation learning: REPA was the earliest, but now there are many methods (such as corrections at the tokenizer level: VA-VAE / REPA-E, splicing semantic tokens into noisy latent variables, decoupled architectures like DDT, or regularization means such as dispersion loss and self-representation alignment).
Always prefer random interpolation / flow matching (SiT should be the baseline here).
Use AdaLN-zero for time embeddings; but when dealing with more complex distributions (such as text embeddings), use cross-attention.
However, use it in the right way — adopt the PixArt-style shared AdaLN; otherwise, 30% of the parameters will be wasted.
The real “inherent flaw” in DiT is actually the sd-vae: This is an obvious but long-neglected problem — it is bloated and inefficient (it takes 445.87 GFlops to process a 256×256 image?). It is not end-to-end. Methods like VA-VAE and REPA-E only partially fix it, and more progress is on the way.

Some netizens were also interested in the technical details mentioned in the response, and Xiesen Shen replied to their relevant questions.

The iterative progress of algorithms is always accompanied by doubts about existing algorithms. Although the saying goes “you can't make an omelette without breaking eggs”, DiT is still in the spotlight, isn't it?

This article is from the WeChat public account “Almost Human” (ID: almosthuman2014), author: Leng Mao +0. It is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Ist DiT mathematisch und formal falsch? Xie Saining antwortet: Man soll nicht Wissenschaft im Kopf treiben.