Neues Werk des Teams von He Kaiming über "Diffusionsmodelle": Diskrete Dekodierung im "letzten Kilometer"
In the field of image and video generation, diffusion models have become the mainstream. But why do problems such as misspellings and word repetitions easily occur in text generation?
Because texts essentially consist of discrete tokens, while diffusion models are better at processing continuous data. In the past, researchers have mainly applied two methods to use diffusion models for text generation:
1. Discrete diffusion language models: Define the diffusion process directly in the discrete token space. For example, a token is covered with a MASK and gradually restored, or the tokens are first perturbed into an approximately uniform distribution and then gradually corrected. This method has been the mainstream in recent years and has achieved better overall results.
2. Continuous diffusion language models: The tokens are first mapped into continuous embedding vectors, then the denoising is performed in the continuous space, and finally the vectors are mapped back to discrete tokens. This approach is theoretically more natural and closer to the approach of visual diffusion models, but the practical results have long lagged behind those of the discrete methods.
To solve this problem, the team of He Kaiming, an associate professor at the Massachusetts Institute of Technology and an outstanding scientist at Google DeepMind, has proposed "Embedded Language Flows" (ELF). This is a class of diffusion models that are based on continuous time flow adaptation in the continuous embedding space.
In contrast to existing diffusion language models, ELF remains in the continuous embedding space for most time steps and is only mapped to discrete tokens via a network with shared weights in the last time step. This form enables the direct reuse of established techniques from image diffusion models.
Link to the study: https://arxiv.org/abs/2605.10938
The research results show that continuous diffusion language models can also be competitive with minimal discretization. ELF has achieved a lower generation perplexity with fewer sampling steps and without the use of distillation. The required number of training tokens is only one-tenth of that of previous methods.
Figure | ELF has achieved a lower generation perplexity than previous discrete diffusion language models (DLM) with fewer sampling steps and without distillation. At the same time, the number of training tokens has been reduced by a factor of 10.
First continuous generation, then discrete decoding
The core of ELF is to map discrete tokens into a continuous embedding space and model the denoising trajectory from Gaussian noise to clean embeddings in this space using continuous time flow adaptation (Flow Matching). In the last time step, the model switches to the decoding mode and decodes the result back into discrete tokens.
Figure | Schematic representation of ELF. Orange dots represent the data in the continuous embedding space, and purple lines show the denoising trajectory from Gaussian noise to clean embeddings. The discretization occurs only in the last time step (t = 1) via a network with shared weights.
During the training phase, the research team uses a pre-trained T5 encoder to convert text tokens into continuous embeddings with context information. Each embedding corresponds to a token, but it is not itself a specific word from the vocabulary, but the vector representation of the token in the context. Then ELF models the denoising process in the continuous embedding space and the continuous flow path from noise to clean embeddings.
During the inference phase, ELF no longer calls the encoder. The model gradually generates the text representation in the continuous embedding space and switches to the decoding mode in the last time step to output tokens via a network with shared weights and a learnable inverse embedding matrix.
The key in the design of ELF is to use a network for denoising and decoding and distinguish these functions via a binary mode token. The model goes 80% into the denoising branch and 20% into the decoding branch, using the MSE loss function and the cross-entropy loss function.
In addition, the research team has introduced a self-conditioning mechanism. During inference, the model uses the prediction of the previous step as a condition for the denoising of the next step, instead of starting from scratch. This not only improves the generation quality but also provides an existing source for conditional signals for Classifier-Free Guidance (CFG) and causes almost no additional computational costs.
Figure | During training, discrete tokens are first encoded into clean embeddings x and then perturbed to z_t. ELF uses z_t to predict x̂. The model can be trained with one of the two loss functions: the denoising loss L_MSE or the token-wise cross-entropy loss function L_CE. During inference, ELF starts from Gaussian noise z_0 and gradually denoises the embeddings from z_t to z_{t + 1}. Only in the last step does ELF switch to the decoding mode and project the final embedding into discrete tokens via an inverse embedding layer.
Fewer sampling steps, lower training effort
The research team has tested ELF in three types of tasks: unconditional text generation on OpenWebText (OWT), machine translation on the WMT14 German-English task, and news summarization on XSum.
In unconditional generation, the main model of ELF-B has a size of 105M. In a system-level comparison on OWT, ELF-B has reduced the generation perplexity to 24 by using only 32 sampling steps and without additional distillation. This is better than other discrete and continuous diffusion language models considered in the comparison. In terms of training effort, ELF uses about 45.2M effective training tokens. In comparison, the values for baseline models such as MDLM, Duo, LangFlow are about 524.3M, for the distilled versions MDLM + SDTT and Duo + DCD 550.5M, and for FMLM 576.7M.
Figure | System-level comparison. ELF-B is better than discrete and continuous diffusion language models under similar experimental settings (a); it also shows comparable competitiveness against baseline models that require additional distillation training (b); at the same time, it uses significantly fewer training tokens (c).
In conditional generation, ELF-B has achieved a BLEU score of 26.4 on the WMT14 German-English task; in news summarization on XSum, the ROUGE-1, ROUGE-2, and ROUGE-L values have reached 36.0, 12.2, and 27.8. Compared to autoregressive models and diffusion language models of similar size, ELF-B has achieved the best results in both tasks.
Figure | Results of machine translation and news summarization. The research team has evaluated ELF-B on the WMT14 German-English translation task and the XSum summarization task and compared it with baseline models of similar parameter size. † means that the results are directly taken from existing works and are also the standard source for the results of the German-English task; ‡ means that the results have been reproduced by the research team using public code libraries and are also the standard source for the results of the XSum task. For XSum, if available, the standard errors for different evaluation samples are also reported. ELF has achieved the best performance in both tasks.
Secondly, the ablation experiments show that the context embeddings obtained with a pre-trained encoder perform better than normal token embeddings and learnable embeddings. The denoising decoder with shared weights has a similar performance to a separately trained decoder, but the process is simpler. In the sampling method, the SDE-based sampler is better than the ODE sampler with few sampling steps. The research team has found that after scaling from 105M to 342M and 652M, the model has a lower generation perplexity with similar diversity; with similar generation perplexity, the text diversity is higher.
Figure | Ablation experiments for the most important design decisions.
Limitations and future directions
The research team has found that the ELF model currently still has limitations, which mainly lie in the following points:
1. The model size is still limited
The currently evaluated models mainly have sizes of 105M, 342M, and 652M. ELF has not been directly compared with large instruction models such as GPT - 4, Claude, and Llama. Therefore, ELF only proves the competitiveness within the class of diffusion language models, not the complete substitution of mainstream autoregressive models.
2. The task scope is still limited
In the research experiments, the generation perplexity on OpenWebText is a proxy measure and does not directly represent the preferences of real users. The results on WMT14 and XSum can show the performance in translation and summarization, but they do not cover complex inference, long-context dialogues, code generation, and multi-step interactions.
3. The continuous space relies on a pre-trained encoder
The research team has tested an encoder trained from scratch and non-contextual embeddings, but the pre-trained context embeddings have still shown the best performance. This shows that the effect of ELF partly relies on the existing pre-trained encoder and is not completely learned from a continuous language space from scratch.
4. The costs for real implementation have not been validated
The research team has reported the number of sampling steps, the training token budget, and automatic metrics, but not the end-to-end latency, the throughput, or the GPU memory costs in a real service. It has also not been directly compared with the implementation of established autoregressive models. Therefore, it still needs to be validated in a real implementation whether the ELF model requires fewer sampling steps and training tokens.