Just now, the new work of He Kaiming's team, the "Embedded Language Flow" ELF, has arrived.
"Language is discrete, but language models don't have to be."
Last year, a project named LLaDA sparked quite a bit of discussion in the AI circle. This language model based on the principle of "masked diffusion" claims to be able to compete with autoregressive large models of the same scale (i.e., the word-by-word generation models represented by GPT) in several benchmark tests.
Once the news came out, the previously niche research direction of Diffusion Language Model (DLM) suddenly came into the view of more people.
We know that text consists of discrete tokens, while diffusion models are naturally good at handling continuous data. This makes the mainstream technology in the field of visual generation inherently difficult to apply to large language models.
After LLaDA demonstrated the feasibility of diffusion models, various teams followed up one after another. Researchers generally admit that diffusion models do have great potential in text generation - they naturally support parallel decoding and can theoretically be much faster than autoregressive models that output word by word. They are also easier to implement tasks such as "filling in the blanks" and "two-way modification" that are difficult for autoregressive models to complete.
In this general direction, researchers have taken two paths:
- Discrete Diffusion Language Model (Discrete DLM): Define the diffusion process directly in the token space. For example, cover tokens with MASK and then gradually restore them (MDLM), or diffuse tokens towards a uniform distribution and then gradually correct them (Duo). This path has been the mainstream in recent years and has better results.
- Continuous Diffusion Language Model (Continuous DLM): First map tokens to continuous embedding vectors, perform denoising in the continuous space, and finally convert them back to tokens. This path is theoretically more elegant, but its actual effect has long lagged behind the discrete school.
The new paper by He Kaiming's team chose the latter, which is obviously much more difficult.
The model they proposed is called ELF (Embedded Language Flows). The core idea can be summed up in one sentence: Move the diffusion process into the continuous vector space and only translate the result into words in the last step.
Tweet by Linlu Qiu, co-first author of the paper
The experimental results show that this idea is not only feasible but also surprisingly effective: with less than one-tenth of the training data of other methods, the generation quality is already comprehensively leading.
Paper title: ELF: Embedded Language Flows
Paper address: https://arxiv.org/pdf/2605.10938v1
Code repository: https://github.com/lillian039/ELF
He Kaiming's answer: Turn into words only in the last step
This paper is from an eight-person team at MIT. Two of them are co-first authors (Hu Keya and Linlu Qiu), and the corresponding author is He Kaiming, one of the iconic figures in the field of computer vision.
The name He Kaiming is not unfamiliar to readers who have a slight understanding of the history of deep learning. In 2015, he proposed the Residual Network (ResNet) at Microsoft Research Asia, which solved the bottleneck of difficult training of deep neural networks at one stroke. This paper is still one of the most cited papers in the field of AI to date, and the residual connection structure he proposed has penetrated into almost all modern AI systems such as Transformer, AlphaGo Zero, and AlphaFold. In 2024, he joined MIT from Meta AI and began to systematically study generative models.
"When I see a paper by He Kaiming, I click in."
ELF is the most unique innovation of this team in the field of language generation so far.
Since diffusion models are best at handling continuous spaces, why not let them complete the whole journey in the continuous space and only do a "translation" at the end?
Specifically, this is how ELF works:
First, each word in a sentence is converted into a set of continuous high-dimensional vectors through a pre-trained encoder (the T5 encoder is used in the paper). This vector not only represents a single word but also captures the "context embedding" of the context semantics.
Then, use "Flow Matching", a continuous diffusion framework that has become popular in image generation in recent years, to perform denoising on these vectors: starting from a cloud of Gaussian noise, push the noise step by step towards the clean embedding vectors along the learned velocity field.
Finally, and only in this last step, ELF maps the denoised continuous vectors back to the vocabulary through an "anti-embedding layer" and outputs specific words.
Different from previous continuous diffusion language models, ELF never converts continuous vectors back to the word space midway during the entire denoising process. By not interrupting the continuity of the flow, it gives the diffusion dynamics the greatest degree of freedom. And because the whole process is in the vector space, various technologies developed in the field of image diffusion can be used almost intact, such as "Classifier-Free Guidance (CFG)".
One network, two modes
Another notable ingenuity in the design of ELF is that one network undertakes both the "denoising" and "decoding" functions, and the switch is controlled by a "mode token".
During training, 80% of the time of the same network is used to learn denoising (MSE loss), and the remaining 20% of the time is used to learn how to map the final embedding vectors back to words (cross-entropy loss).
During inference, before the last step, the network is always in the denoising mode; at the last moment, it switches to the decoding mode and translates the continuous vectors into words for output. In this way, there is no need to train an independent decoder additionally, and the whole process is concise and unified.
In addition, ELF also introduces a "Self-Conditioning" mechanism: when the network denoises at each step, it can use the prediction result of its previous step as a reference input instead of guessing from scratch. This not only improves the generation quality but also provides a ready-made source of "conditional signals" for CFG with almost no additional computational burden.
Experimental results: Crush opponents with one-tenth of the training volume
The experimental results of the paper are quite convincing.
The benchmark test selected by the researchers is the standard setting commonly used in the field of diffusion language models: train on the OpenWebText corpus and measure the quality with generative perplexity (the lower the value, the better, indicating that the generated text is more fluent and natural) and vocabulary entropy (the higher the value, the better, indicating that the generation diversity is more abundant).
ELF achieved a perplexity of 24 with only 320 sampling steps. In contrast, even after specialized "distillation" training to accelerate inference, the current mainstream discrete diffusion language models (such as MDLM and Duo) perform worse than ELF with the same number of steps, and ELF did not do any distillation at all.
The gap in training costs is even more significant. According to the paper's statistics, mainstream methods such as MDLM, Duo, and FLM each used about 50 billion tokens of training data, while ELF only used about 45 billion - about one-tenth of theirs.
In the more practical conditional generation tasks, ELF also performs outstandingly. On the WMT14 German-English machine translation benchmark, ELF achieved a BLEU score of 26.4, exceeding that of autoregressive models of the same scale (25.2) and opponents such as MDLM (18.4) and CDCD (24.9). On the XSum news summarization task, ELF also ranked first in the three indicators of ROUGE-1, ROUGE-2, and ROUGE-L.
Conclusion
In the past two years, the research progress of diffusion language models has almost all been concentrated in the discrete space - more sophisticated masking strategies, more efficient decoding methods, and larger-scale training. The continuous diffusion route has been in a relatively marginal position because of the natural tension with the "discrete nature" of language.
The emergence of ELF provides a different reference point: Continuous diffusion is not an obstacle to language modeling but may be an advantage that has not been fully developed. The flow in the continuous space is smoother, it is easier to borrow technologies accumulated in the field of image generation, and it is also easier to do guidance and control. The good scalability shown by ELF in the scale test (from 100 million parameters to 650 million parameters, the quality continues to improve) also shows that there is still considerable room on this path.
Of course, the current evaluation of ELF mainly stays on medium-scale models and academic benchmark tests. Whether it can form a real competition with the current strongest autoregressive large models on a larger scale and more extensive tasks remains to be verified in the future. But based on the current results, it at least clearly answers an open question:
The continuous diffusion language model seems to have finally found the right method.
For more details, please refer to the original paper.
This article is from the WeChat official account "MachineHeart" (ID: almosthuman2014). The author is MachineHeart, which focuses on large models. It is published by 36Kr with authorization.