He Kaiming's first language model: 105M parameters, not following the old autoregressive path of GPT
He Kaiming has also entered the field of language model development.
However, this time, the model his team is working on does not follow the well - known autoregressive paradigm of "next token prediction" like the one behind ChatGPT.
Instead, it follows a new approach that has been extremely popular in the image field in the past few years and is now being increasingly applied to text generation: Diffusion Language Model (DLM).
In the latest paper, He Kaiming's team introduced a brand - new continuous diffusion language model: ELF: Embedded Language Flows.
Different from many language models that perform diffusion at the token level, ELF keeps the entire generation process in the continuous embedding space. It is only in the last step that it is re - discretized to convert the representation back into tokens.
With this design, ELF, using only 105M parameters, 45B training tokens, and 32 sampling steps, outperformed a number of mainstream diffusion language models.
The most intuitive indicator is that on OpenWebText, it directly reduced the Generative Perplexity to 24.
Here is a simple introduction to generative perplexity. In essence, it uses a powerful language model to "check the work" of the generated results to see if these texts are similar to the corpora written by real humans.
The lower the value, the higher the generation quality, and the less the output of the model has an "AI flavor", making it more natural.
In comparison with mainstream diffusion language models, ELF achieved a lower generative perplexity with nearly 10 times fewer training tokens and fewer sampling steps.
It can be said that for a long time, the progress of diffusion language models has almost all occurred on the side of Discrete DLM.
ELF is the first to prove that the continuous method not only works but also has good results.
What exactly does ELF do?
To understand ELF, we first need to understand what diffusion language models are currently doing.
There are mainly two technical routes for diffusion language models. One is the discrete school represented by MDLM and Duo, which performs diffusion directly in the token space, and each step deals with discrete random variables.
The other is the continuous school, including Diffusion - LM, CDCD, and DiffuSeq. They map tokens into continuous embeddings and perform denoising in the continuous space.
In previous studies, discrete approaches such as MDLM, LLaDA, and Dream 7B have taken the lead. The reason is simple: language itself is discrete.
He Kaiming's team's judgment is the opposite of this seemingly common - sense understanding.
The problem may not be that "language must be discrete". Instead, the problem may be that previous researchers did not fully implement the continuous approach.
Methods like Diffusion - LM perform denoising in the embedding space, but they calculate the token - level cross - entropy at each step, tying the continuous trajectory to the vocabulary.
Later, LD4LG and Cosmos follow the latent diffusion route. Although the denoising process is continuous, they need to train a separate decoder to convert the latent back into tokens, which adds an extra module.
Based on this, ELF keeps all denoising in the continuous embedding space; it is only in the last step t = 1 that it is projected back into tokens.
Specifically, during training, discrete tokens are first encoded into continuous embeddings and then added with noise to become z_t. The model is either responsible for restoring it to a clean embedding (MSE) or directly predicting tokens (CE).
During inference, the model starts from Gaussian noise z_0 and performs denoising in the continuous space all the way. It is only in the last step that it switches to the decode mode and projects the embeddings back into tokens.
ELF is the first to completely separate the two problems of "continuous representation" and "discrete output" that were previously considered to require repeated alignment:
The intermediate denoising is completely handled in the continuous space; the final language generation is only carried out in the last step of discretization.
There is no need to force alignment with the vocabulary at each step, and there is no need to train an additional decoder. For the first time, the entire generation process truly achieves:
Continuous is continuous, and discrete is discrete.
This is precisely the key to ELF's ability to outperform many diffusion language models with fewer sampling steps and fewer training tokens.
ELF is not "diffuse first, then decode".
In terms of specific implementation, ELF also solves three problems:
How to convert tokens into continuous embeddings? How to perform denoising in the continuous space? And how to convert them back into tokens?
Converting tokens into continuous embeddings
To apply continuous diffusion to language, the first step is to convert discrete tokens into continuous representations.
In the paper, ELF first splits the text into a token sequence and then maps it to the continuous embedding space. There are actually multiple options for how to perform this mapping.
By default, ELF uses the T5 pre - trained encoder to generate bidirectional contextual embeddings. The paper also tests different schemes such as jointly trained embeddings and random embeddings later.
It is worth noting that this encoder is only used during the training phase and does not add an extra module during inference.
Performing Flow Matching in the continuous embedding space
After obtaining the continuous representation, ELF performs Flow Matching in the embedding space.
Simply put, Flow Matching defines a continuous flow trajectory from noise to real data:
When t = 0, it is Gaussian noise;
When t = 1, it is a clean embedding;
All intermediate states are linear interpolations between the two, which is the rectified flow in the paper.
In traditional Flow Matching, the network usually directly predicts the "velocity field" v.
However, ELF does not do this. Instead, it follows the idea proposed by He Kaiming's team half a year ago in "Back to Basics: Let Denoising Generative Models Denoise":
Directly predict the clean embedding x, that is, x - prediction.
The training objective is to minimize the mean square error (MSE) between the predicted embedding and the real embedding.
The paper gives two reasons for using x - prediction:
First, it is more stable in high - dimensional representations, such as 768 - dimensional or even higher token embeddings; second, it is naturally aligned with the goal of "predicting clean tokens" in the last step.
The paper also specifically mentions that although it is theoretically possible to first predict the velocity v and then convert it into x, it is difficult to share weights between denoising and decoding later.
In experiments, they also found that once the weights are shared, the performance of v - prediction deteriorates significantly.
Converting from continuous embeddings back to discrete tokens
The final output of language generation is still discrete tokens.
So, ELF only needs to project the continuous embeddings back into the token space at the last time step (t = 1).
However, unlike many latent diffusion methods, ELF does not train an additional decoder for this step. Instead, it directly regards the last step as a continuous - to - discrete decoding.
In other words, the decoder and the previous denoiser are actually the same network.
To prevent the training of the last step from being too simple (because theoretically, when t→1, the input is already very close to the clean embedding), ELF adds an additional token - level corruption in the last step to construct a perturbed input.
Subsequently, the same network outputs a clean embedding, which is then projected into token logits through a learnable unembedding matrix W.
The training objective is the standard token - level cross - entropy loss. The entire network shares the same set of parameters and additionally receives a binary mode token: denoising mode/decoding mode.
During inference, ELF starts from Gaussian noise and performs denoising in the continuous space all the way. It is only in the last step t = 1 that it switches to the decode mode and outputs the final tokens through argmax.
It is worth mentioning that in ELF, one of the most commonly used techniques in image generation, CFG (classifier - free guidance), has also been adopted.
ELF uses self - conditioning as the conditional signal and applies training - time CFG (simulating two inferences in one forward pass without additional inference overhead), directly borrowing the solution from the image field.
Experimental comparison
In the experimental part, ELF basically answered a question that has been hanging over the past two years:
Can continuous diffusion language models compete? The answer is: not only can they compete, but for the first time, they win in terms of quality, speed, and training cost simultaneously.
As mentioned at the beginning, in the OpenWebText generation task, without distillation, ELF reduced the generative perplexity to 24 with only 32 sampling steps.
Previously, mainstream discrete diffusion models usually needed to run 1024 steps to approach this level.
Even more impressively, when achieving this result, ELF only used 45B training tokens.
For comparable competitors, the number is generally over 500B. In other words, with an order of magnitude fewer sampling steps and an order of magnitude less training data, the effect is even better.
ELF also did not fall behind in conditional generation tasks where many diffusion models are most likely to lag.
Whether it is WMT14 machine translation or XSum text summarization, ELF stably outperforms existing diffusion language models and even suppresses many autoregressive baselines.
The conclusion given at the end of the paper is actually quite conservative: ELF achieves a strong trade - off between generation quality, sampling efficiency, and training cost.
In plain language, the continuous school is not incapable of competing. It's just that the continuous approach has not been fully implemented before.
Author introduction
Finally, let's introduce the authors of this article.
The two first authors of this paper made equal contributions, and the order of their names was determined by a coin toss.
Hu Keya, one of the two first authors of this article, is a first - year doctoral student in EECS at MIT. She is also one of the first batch of doctoral students supervised by He Kaiming at MIT and is currently jointly supervised by He Kaiming and Jacob Andreas.
△
She graduated from the ACM class of Shanghai Jiao Tong University. Her current research interests mainly focus on the intersection of language and vision, and she is committed to building agents with higher data efficiency and stronger generalization ability.
It is worth mentioning that on He Kaiming's MIT homepage, Hu Keya is ranked first among the graduate students, so to speak, she is the senior sister in the group.