Diffusion Won't Die, BERT Will Live Forever: Karpathy's Late-Night Reflection on Whether the Autoregressive Era Should End

Google's Overlooked Gem and IBM's Prophecy: An Article Awakens Karpathy, Diffusion Models May Be the Next Step for LLMs.

The Irresistible Temptation for Karpathy!

Nathan Barry, a former Apple employee and a computer science graduate student at the University of Texas at Austin (UT Austin), came to an astonishing conclusion:

Essentially, BERT is just one step in text diffusion!

Based on the "enhanced BERT" RoBERTa, he successfully transformed the representation learning algorithm into a generative algorithm:

After reading the post, Karpathy, a founding employee of OpenAI and the former AI director at Tesla, fell into deep thought:

Human thinking may be more autoregressive - a step-by-step process. But it's hard to say that there isn't a mechanism more like diffusion in our mental latent space.

Perhaps there's room for interpolation or further generalization between the two.

This part of the generative logic in the LLM architecture remains a relatively "flexible" part.

However, Karpathy has been busy developing the ultimate practical project "ChatGPT for $100 to Take Home" for Eureka Labs' "LLM 101n" course recently, so he had to "reluctantly give it up":

Now I have to resist the urge to train nanochat with a diffusion model and not deviate from the main task to work on side quests.

By the way, shortly afterwards that day, he was sparked with new ideas by DeepSeek-OCR.

The Overlooked Gem of Google

When Nathan Barry first read the papers on language diffusion models, he was surprised to find that their training objectives were just a generalization of masked language modeling (MLM).

Since BERT in 2018, everyone has been accustomed to masked language modeling.

Preprint: https://arxiv.org/abs/1810.04805

An idea immediately popped into his mind: Can we fine-tune a BERT-like model to also do text generation?

Out of curiosity, he conducted a quick verification experiment. Then he found that someone had already done it - DiffusionBERT is basically this idea, but more rigorously implemented.

It's worth mentioning that about 3 years ago, DiffusionBERT was proposed by researchers from domestic universities, 100% made in China!

Preprint link: https://arxiv.org/abs/2211.15029

Initially, diffusion models became popular in the field of image generation.

In image generation, a diffusion model first adds Gaussian noise to an image step by step (forward process), and then trains a neural network to iteratively denoise it (reverse process).

Applying this idea to the text field means we need to find ways to add noise to text and then eliminate it in stages.

The simplest implementation is a mask-based noise processing process:

In the forward process, the initial text remains intact. In each iteration step, according to a preset schedule (from 0% to 100%), a certain proportion of words are randomly replaced with the special <MASK> token.

In the reverse (denoising) process, the model is trained to predict the correct original words for each <MASK>. This is similar to the masked language model (MLM), but with a dynamic masking rate.

To solve the problems in previous methods, BERT proposed masked language modeling (Masked LM).

The specific approach is: Randomly mask 15% of the words in each training input sequence and only predict these masked words. Expressed in graphical language:

In other words, the MLM training objective of BERT can actually be regarded as a special case of text diffusion, except that it uses a fixed masking rate.

As long as we introduce a dynamic masking rate range from 0 to 1, we can naturally expand BERT's training objective into a complete text generation process.

Expansion is Everywhere: Transforming Self-supervised Models into Generative Models

The RoBERTa model, released in 2019, is an enhanced upgrade based on the original BERT.

Preprint: https://arxiv.org/abs/1907.11692

It adjusted the hyperparameters, expanded the training corpus, and simplified the training objective -

It only retains MLM (masked language modeling) and removes the "next sentence prediction" task.

Nathan Barry used the open-source library of HuggingFace to load RoBERTa's pre-trained weights, tokenizer, and the Trainer class to fine-tune the model, using the WikiText dataset. The core code (the full code can be found in the original article) is roughly as follows:

In the current implementation, 10 diffusion steps are set. For each training batch, a masking ratio p is randomly sampled from [1.0, 0.9, ..., 0.1], and then the tokens at this ratio are masked. This logic is encapsulated in the custom diffusion_collator:

During inference, start with an input vector of length 256: The first 16 positions are the token IDs of the prompt, and the remaining 240 are all <MASK>. Then, gradually reduce the masking ratio, and make predictions, sampling, and re-masking at each step. The process is as follows:

The corresponding simplified code is as follows:

After 30 minutes of training on an H200 graphics card, the model generated the following text based on the following prompt:

...dominion over Europe beginning about the early 19th. There conflict took place on the island, between British and Irish Ireland. British officials administered British Ireland, a Celtic empire under the control of the Irish nationalist authorities, defined as a dominion of Britain. As the newly Fortic states acquired independent and powerful status, many former English colonies played their part in this new, British @-@ controlled colonial system. Following this period the Non @-@ Parliamentaryist Party won its influence in Britain in 1890, led by the support of settlers from the Irish colonies. Looking inwards, Sinclair, Lewis questioned, and debated the need to describe " The New Britain "

The prompt was: Following their victory in the French and Indian War, Britain began to assert greater...

The generated text looks surprisingly coherent! Most of the "oddities", Nathan Barry attributes to the formatting issues of the WikiText dataset itself - such as spaces around punctuation marks and hyphens "-" being processed as @-@.

The data shows that GPT-2 has a slight edge in output coherence and generation speed (about 9 seconds compared to 13 seconds).

But RoBERTa Diffusion is unoptimized, and such results are already quite surprising.

This proof of concept is undoubtedly very successful - if combined with emerging technologies such as AR-Diffusion and skip-step diffusion and deeply optimized, both the generation quality and inference speed will be greatly improved.

The Return of Diffusion Models

Experiments have proven that masked language models represented by RoBERTa (originally designed for fill-in-the-blank tasks) can completely transform into full-fledged generative engines by reconstructing variable-rate masking into a discrete diffusion process.

By progressively implanting <MASK> tokens to contaminate the text and training the model to iteratively denoise under increasing masking intensities, the standard MLM objective is successfully transformed into a progressive text generation process.

It's worth noting that even without adjusting the model architecture, RoBERTa can generate visually coherent text just by fine-tuning the training objective.

This strongly confirms an important insight: Essentially, BERT-based models are text diffusion models trained with a fixed masking rate.

Karpathy liked Nathan Barry's short post:

Although the post is short, it explains how simple text (discrete) diffusion models can be.

...

Many papers on diffusion models seem quite obscure, but if you strip away the mathematical formalism, you often end up with a simple basic algorithm.

For example, methods closer to flow matching in continuous spaces or discrete space schemes like this essentially still use the classic Transformer architecture, but with a bidirectional attention mechanism -

According to the noise schedule, iteratively resample and re-mask all tokens on the "token canvas" until a complete sample is generated in the final step.

The autoregressive generation process is like constantly appending tokens on the token canvas, only referring to the existing context on the left each time;

While diffusion-based generation involves repeatedly setting tokens on the entire token canvas, relying on bidirectional attention for refreshing and updating each time.

From the perspective of the entire large language model (LLM) technology stack, there is still great potential in the generative field, with room for optimization and innovation.

Earlier this year, at the 2025 I/O Conference, Google DeepMind released an experimental extended language model - Gemini Diffusion.

In terms of speed, diffusion language models have an obvious advantage. So much so that some net

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Diffusion won't die, BERT will live forever. Karpathy had a late-night reflection: Should the autoregressive era come to an end?

The Overlooked Gem of Google

Expansion is Everywhere: Transforming Self-supervised Models into Generative Models

The Return of Diffusion Models