HomeArticle

The younger generation is really formidable. He Kaiming's team has released new achievements, and one of the co-first authors is a sophomore from the Yao Class at Tsinghua University.

量子位2025-12-04 10:18
Single-step generation models are making a comeback with renewed strength.

After proposing MeanFlow (MF) in May this year, He Kaiming's team recently introduced the latest improved version -

Improved MeanFlow (iMF). iMF successfully addresses the three core issues of the original MF in training stability, guidance flexibility, and architecture efficiency.

By reformulating the training objective as a more stable instantaneous velocity loss and introducing flexible classifier-free guidance (CFG) and efficient in-context conditioning, it significantly enhances the model's performance.

In the ImageNet 256x256 benchmark test, the iMF-XL/2 model achieved a FID score of 1.72 in 1-NFE (single-step function evaluation), a 50% improvement over the original MF, demonstrating that a single-step generation model trained from scratch can achieve results comparable to multi-step diffusion models.

The first author of MeanFlow, Geng Zhengyang, remains the same. Notably, the co-first author, Yiyang Lu, is currently a sophomore - from Tsinghua University's Yao Class, and He Kaiming also signed at the end.

Other collaborators include: Adobe researcher Zongze Wu, Eli Shechtman, and Zico Kolter, the director of the Machine Learning Department at CMU.

Reconstruct the prediction function and return to a standard regression problem

The core improvement of iMF (Improved MeanFlow) is to transform the training process into a standard regression problem by reconstructing the prediction function.

In the original MeanFlow (MF) (left in the above figure), it directly minimizes the loss of the average velocity. Here, Utgt is the target average velocity derived from the MeanFlow identity and the conditional velocity e-x.

The problem here is that the derived target Utgt contains the derivative term of the network's own predicted output, and this "self-dependence of the target" structure makes the optimization extremely unstable and has a large variance.

Based on this, iMF constructs the loss from the perspective of instantaneous velocity, making the entire training stable.

Notably, the network output remains the average velocity, while the training loss becomes the instantaneous velocity loss to obtain stable and standard regression training.

It first simplifies the input to a single noisy data z and cleverly modifies the calculation method of the prediction function internally.

Specifically, in iMF, for the calculation of the composite prediction function V (representing the prediction of the instantaneous velocity), the tangent vector input required for the Jacobian vector product (JVP) term is no longer the external e-x but the marginal velocity predicted by the network itself.

Through this series of steps, iMF successfully removes the dependence of the composite prediction function V on the target approximation e-x. At this time, iMF then sets the target of the loss function as the stable conditional velocity e-x.

Finally, iMF successfully transforms the training process into a stable and standard regression problem, providing a solid optimization foundation for the learning of the average velocity.

In addition to improving the training objective, iMF also comprehensively enhances the practicality and efficiency of the MeanFlow framework through the following two major breakthroughs:

Flexible classifier-free guidance (CFG).

One major limitation of the original MeanFlow framework is that to support single-step generation, the guidance scale of classifier-free guidance (CFG) must be fixed during training, which greatly limits the ability to optimize image quality or diversity by adjusting the scale during inference.

iMF solves this problem by internalizing the guidance scale as a learnable condition.

Specifically, iMF directly provides the guidance scale as an input condition to the network.

During the training phase, the model randomly samples different guidance scales from a power distribution biased towards smaller values. This approach enables the network to adapt to and learn the average velocity fields under different guidance intensities, thereby unlocking the full flexibility of CFG during inference.

In addition, iMF also extends this flexible conditioning to support the CFG interval, further enhancing the model's control over sample diversity.

Efficient in-context conditioning architecture

The original MF relies on the adaLN-zero mechanism with a large number of parameters to handle multiple heterogeneous conditions (such as time steps, category labels, and guidance scales).

When the number of conditions increases, simply summing all condition embeddings and passing them to the adaLN-zero for processing becomes inefficient and parameter redundant.

iMF introduces improved in-context conditioning to solve this problem.

Its innovation lies in encoding all conditions (including time steps, categories, and CFG factors) into multiple learnable tokens (instead of a single vector) and directly concatenating these condition tokens with the tokens in the image latent space along the sequence axis, and then inputting them together into the Transformer block for joint processing.

The greatest benefit brought by this architectural adjustment is that iMF can completely remove the adaLN-zero module with a large number of parameters.

This enables iMF to significantly optimize the model size while improving performance. For example, the size of the iMF-Base model is reduced by about 1/3 (from 133M to 89M), greatly enhancing the model's efficiency and design flexibility.

Experimental results

iMF demonstrates excellent performance in 1-NFE on the most challenging ImageNet 256x256.

The FID of iMF-XL/2 under 1-NFE reaches 1.72, pushing the performance of the single-step generation model to a new height.

The performance of iMF trained from scratch is even better than many fast-forward models distilled from pre-trained multi-step models, proving the superiority of the iMF framework in basic training.

The following figure shows the results of 1-NFE (single-step function evaluation) generation on ImageNet 256x256.

The FID of iMF at 2-NFE reaches 1.54, further narrowing the gap between the single-step model and multi-step diffusion models (FID is about 1.4 - 1.7).

One more thing

As mentioned above, the first author of IMF continues the core team of the previous work Mean Flow (selected as an Oral at NeurIPS 2025) - Geng Zhengyang.

He graduated from Sichuan University with a bachelor's degree and is currently pursuing a doctorate at CMU, under the supervision of Professor Zico Kolter.

The co-first author is Yiyang Lu, a sophomore from Tsinghua University's Yao Class. He is currently researching computer vision at MIT under the guidance of Professor He Kaiming. Previously, he studied robotics under the guidance of Professor Xu Huazhe at the Institute for Interdisciplinary Information Sciences of Tsinghua University.

Part of the content of this paper was completed by them at MIT under the guidance of Professor He Kaiming.

In addition, the other authors of the paper include: Adobe researcher Zongze Wu, Eli Shechtman, J. Zico Kolter, the director of the Machine Learning Department at CMU, and Professor He Kaiming.

Among them, Zongze Wu graduated from Tongji University with a bachelor's degree and obtained a doctorate from the Hebrew University of Jerusalem. He is currently a research scientist at the Adobe Research Institute in San Francisco.

Similarly, Eli Shechtman also comes from Adobe. He is a senior principal scientist in the Adobe Research Image Laboratory. He joined Adobe in 2007 and served as a postdoctoral researcher at the University of Washington from 2007 - 2010.

J. Zico Kolter is the supervisor of Geng Zhengyang, the first author of the paper. He is a professor in the School of Computer Science at CMU and the director of the Machine Learning Department.

The last author of the paper is the famous machine learning scientist He Kaiming. He is currently a tenured associate professor at MIT.

His most well-known co-work is