New work on "diffusion models" by He Kaiming's team: Discrete decoding in the "last mile"
In the field of image and video generation, diffusion models have become the mainstream. However, when it comes to text generation, why do they tend to have problems such as garbled characters and repeated words?
Because text is essentially composed of discrete tokens, while diffusion models are better at handling continuous data. In the past, to apply diffusion models to text generation, researchers mainly adopted two methods:
1. Discrete diffusion language models: Define the diffusion process directly in the discrete token space. For example, mask the tokens with MASK and then gradually restore them, or first perturb the tokens to a nearly uniform distribution and then correct them step by step. This method has been the mainstream in recent years and has a stronger overall effect.
2. Continuous diffusion language models: First map the tokens to continuous embedding vectors, then perform denoising in the continuous space, and finally map them back to discrete tokens. This approach is more natural in theory and closer to the practice of visual diffusion models, but its actual effect has long lagged behind the discrete method.
To solve this problem, the team led by He Kaiming, an associate professor at the Massachusetts Institute of Technology and a distinguished scientist at Google DeepMind, introduced "Embedded Language Flows" (ELF), which is a type of diffusion model that operates in the continuous embedding space based on continuous-time flow matching.
Different from existing diffusion language models, ELF stays in the continuous embedding space for most of the time steps and only maps to discrete tokens through a shared weight network at the final time step. This form enables it to directly reuse mature technologies in image diffusion models.
Paper link: https://arxiv.org/abs/2605.10938
The research results show that continuous diffusion language models can be highly competitive even with minimal discretization. Without using distillation, ELF achieves a lower generation perplexity with fewer sampling steps, and the number of training tokens required is only one-tenth of that of previous methods.
Figure | Without using distillation, ELF achieves a lower generation perplexity than previous DLM with fewer sampling steps. Meanwhile, the number of training tokens for ELF is reduced by 10 times.
Generate continuously first, then decode discretely
The core approach of ELF is to first map discrete tokens to the continuous embedding space, and model the denoising trajectory from Gaussian noise to clean embeddings using continuous-time flow matching in this space. At the last time step, the model switches to the decoding mode and decodes the result back to discrete tokens.
Figure | Schematic diagram of ELF. The orange dots represent the data representation in the continuous embedding space, and the purple lines show the denoising trajectory from Gaussian noise to clean embeddings. Discretization is only completed through the shared weight network at the final time step (t = 1).
During the training phase, the research team uses a pre-trained T5 encoder to convert text tokens into continuous embeddings with contextual information. Each embedding corresponds to a token, but it is not a specific word in the vocabulary itself, but a vector representation of the token in the context. Subsequently, ELF models the denoising process in the continuous embedding space and models the continuous flow path from noise to clean embeddings in the continuous embedding space.
During the inference phase, ELF no longer calls the encoder. The model gradually generates text representations in the continuous embedding space and switches to the decoding mode at the final time step, outputting tokens through the shared weight network and a learnable inverse embedding matrix.
The key in the design of ELF is to use a single network to perform both denoising and decoding functions and distinguish them through binary mode tokens. The model enters the denoising branch and the decoding branch at a ratio of 80% and 20% respectively, corresponding to the use of MSE loss and cross-entropy loss.
In addition, the research team also introduced a self-conditioning mechanism. During inference, the model uses the prediction from the previous step as the condition for denoising in the next step, rather than making predictions from scratch. This not only improves the generation quality but also provides a ready-made source of conditional signals for CFG with almost no additional computational burden.
Figure | During the training process, discrete tokens are first encoded into clean embeddings x and then perturbed into z_t. ELF then uses z_t to predict x̂. The model can be trained using one of two losses: the denoising loss L_MSE or the per-token cross-entropy loss L_CE. During the inference process, ELF starts from Gaussian noise z_0 and iteratively denoises the embeddings from z_t to z_{t+1}. Only at the last step does ELF switch to the decoding mode and project the final embeddings back to discrete tokens through the inverse embedding layer.
Fewer sampling steps, lower training budget
The research team tested ELF in three types of tasks: unconditional text generation on OpenWebText (OWT), machine translation on the WMT14 German-to-English task, and news summarization on XSum.
In unconditional generation, the main model size of ELF-B is 105M. In the OWT system-level comparison, without using additional distillation, ELF-B reduces the generation perplexity to 24 with only 32 sampling steps, outperforming other discrete and continuous diffusion language model baselines included in the comparison. In terms of training budget, ELF uses approximately 45.2B effective training tokens. In contrast, baselines such as MDLM, Duo, and LangFlow use approximately 524.3B, the distilled versions MDLM+SDTT and Duo+DCD use 550.5B, and FMLM uses 576.7B.
Figure | System-level comparison. ELF-B outperforms discrete and continuous diffusion language models under similar experimental settings (a); it also shows comparable competitiveness against baselines that require additional distillation training (b); meanwhile, it uses significantly fewer training tokens (c).
In conditional generation, ELF-B achieves a BLEU score of 26.4 on the WMT14 German-to-English task; on the XSum summarization task, ROUGE-1, ROUGE-2, and ROUGE-L reach 36.0, 12.2, and 27.8 respectively. Compared with autoregressive models and diffusion language models of similar scale, ELF-B achieves the highest results in both tasks.
Figure | Results of machine translation and summarization tasks. The research team evaluated ELF-B on the WMT14 German-English (De-En) translation and XSum summarization tasks and compared it with baseline models of similar parameter scale. † indicates that the results are directly taken from existing work and is also the default result source for the De-En task; ‡ indicates the results reproduced by the research team using the public code library and is also the default result source for the XSum task. For XSum, when available, the research team also reported the standard error on different evaluation samples. ELF achieves the best performance in both task settings.
Secondly, ablation experiments show that context embeddings obtained using a pre-trained encoder perform better than ordinary token embeddings and learnable embeddings. The denoiser-decoder with shared weights has an effect similar to that of a separately trained decoder but with a simpler process. In terms of sampling methods, the sampler inspired by SDE outperforms the ODE sampler in few-step generation. The research team pointed out that after the model is expanded from 105M to 342M and 652M, it has a lower generation perplexity at similar diversity; at a similar generation perplexity, the text diversity is higher.
Figure | Ablation experiments on key design choices.
Limitations and future directions
The research team pointed out that the current ELF model still has limitations, mainly in the following aspects:
1. The model scale is still limited
The current evaluation models are mainly of sizes 105M, 342M, and 652M, and ELF has not been directly compared with large-scale instruction models such as GPT-4, Claude, and Llama. Therefore, ELF demonstrates its competitiveness among similar diffusion language models, not an overall replacement for mainstream autoregressive large models.
2. The task scope is still limited
In the research experiments, the generative perplexity on OpenWebText is a proxy indicator and cannot directly represent real user preferences. WMT14 and XSum can illustrate the performance in translation and summarization but cannot cover complex reasoning, long-context dialogue, code generation, and multi-round interaction.
3. The continuous space depends on a pre-trained encoder
The research team tested an encoder trained from scratch and non-contextual embeddings, but the pre-trained contextual embeddings still performed the best. This result indicates that the effect of ELF partly comes from the existing pre-trained encoder rather than learning the continuous language space completely from scratch.
4. The real deployment cost has not been verified
The research team reported the number of sampling steps, training token budget, and automatic indicators but did not report the end-to-end latency, throughput, or video memory cost in real services, nor did they directly compare with the deployment schemes of mature autoregressive models. Therefore, whether the ELF model saves more sampling steps and training tokens needs to be verified in real deployment.
This article is from the WeChat official account “Academic Headlines” (ID: SciTouTiao), author: Academic Headlines. Republished by 36Kr with permission.