Completely Rewriting Transformer: The Emergence of "Energy-Driven Architecture" - Is the Era of General Inference Approaching?

The EBT architecture uses energy to drive dynamic inference, with a 35% performance improvement, and supports multimodality.

UIUC, Stanford, and Harvard jointly proposed a brand - new "Energy - Driven Transformer (EBT)" architecture, breaking through the traditional feed - forward inference method. It mimics human System 2 thinking through energy minimization, and its pre - training scalability performance is up to 35% higher than that of Transformer++. The new revolution of the next - generation AI infrastructure has arrived!

After Transformer has dominated the AI world for more than a decade,

The era of Attention is fading out, and true thinking is just beginning —

The Energy - Based Transformer (EBT), jointly proposed by top institutions such as UIUC, Stanford, and Harvard, has made a stunning debut.

It is the first to introduce the Transformer architecture into the Energy - Based Models (EBM) framework, completely breaking the old paradigm of "feed - forward equals inference".

Paper link: https://arxiv.org/pdf/2507.02092

EBT is neither a lightweight fine - tuning nor an improvement of RNN, but a completely different inference mechanism:

The model no longer "spits out the answer" all at once. Instead, like humans, it starts from a vague guess and gradually optimizes the inference path.

EBT is more efficient in training, more accurate in inference, and more robust to OOD (Out of Distribution) data. It significantly outperforms the feed - forward Transformer (Transformer++) in terms of training efficiency and improvement amplitude:

Moreover, EBT shows amazing scalability performance in multi - modal tasks such as text and image, and is expected to achieve unsupervised cross - modal general inference.

"One - time generation" vs "Dynamic optimization"

The traditional Transformer is a typical "feed - forward predictor". Each inference process is completed in one go, from the input prompt, through a fixed forward propagation path, to the output result.

Regardless of whether the problem is simple or complex, the model completes the inference with a fixed calculation path and steps and cannot be flexibly adjusted according to the difficulty.

Each token makes only one decision and does not "regret" or "modify".

This is like a student answering questions who can only "write it all down at once without any changes".

In this mode, the model can neither "check the answer" nor "correct the thinking", let alone "think deeply".

EBT completely subverts this mechanism.

EBT optimizes each prediction through multiple rounds:

It does not directly output tokens but starts from a random initial prediction.
The model calculates the "energy value" between this prediction and the context (low energy corresponds to high compatibility, and high energy corresponds to low compatibility).
Through gradient descent on the energy, it continuously updates the prediction and gradually "adjusts it to be more appropriate".

This process lasts for multiple rounds until the energy converges, that is, the model considers this prediction "reasonable enough".

In this way, each token finally obtained by EBT is the result of dynamic calculation and multi - step correction, gradually converging to the optimal answer like "going downhill" in the energy topographic map.

That is to say, the model's "thinking" is modeled as a small - scale optimization task. Instead of outputting the answer all at once, it repeatedly tries, verifies, updates, and converges.

This "energy minimization" process is EBT's unprecedented System 2 Thinking — a slower, more accurate, and more general human - like deep - thinking ability.

"Three major leaps" of EBT

The thinking process of EBT endows it with fundamental breakthroughs in three key capabilities.

Dynamic calculation

The traditional Transformer model is static: each token and each prediction use a fixed calculation path and depth. Regardless of whether the problem is simple or complex, the calculation amount is the same.

EBT has the ability of dynamic calculation resource allocation. Like humans, it can quickly handle simple problems and invest more thinking in difficult problems.

In other words, EBT can dynamically decide whether to "think a few more steps" or "converge quickly".

Uncertainty

Moreover, the design of EBT's prediction energy allows it to express uncertainty in a continuous space.

Although Transformer can use softmax to represent the "probability distribution" in discrete token outputs, it is difficult to express uncertainty in continuous modalities such as images and videos.

EBT's energy modeling of the prediction context naturally expresses the "credibility" of the prediction through the energy level.

This ability enables EBT to identify which positions in continuous tasks such as images and videos "are worth thinking more about".

Self - verification

With the support of the energy score, EBT is naturally equipped with explicit self - verification ability.

Each time it makes a prediction, it calculates the "energy score" that measures the matching degree of the context.

This score can not only be used to judge whether the answer is reliable but also generate multiple candidate answers and select the one with the lowest energy as the final result.

This mechanism completely gets rid of the dependence on external scorers or reward functions and introduces the "reflection" link into the model structure itself.

In contrast, the traditional architectures are almost completely defeated in terms of "thinking ability".

Whether it is the Feed Forward Transformer or RNN, they lack the ability of dynamic calculation allocation, cannot model the uncertainty in the continuous space, and are far from verifying the prediction results.

Even the Diffusion Transformer, which is highly sought after in generative models, only makes a breakthrough in "dynamic calculation", and the other two aspects are still blank.

In contrast, EBT is the closest solution to the "human - like thinking process" so far.

The more it thinks, the more accurate it gets! Transformer can't keep up.

EBT not only amazes everyone with its theoretical characteristics but also shows amazing performance in actual experiments.

Regardless of the amount of data, the batch size, or the depth of the model, EBT learns faster, more economically, and more stably than the classic Transformer++.

Specifically, to achieve the same perplexity, EBT's decline rate is 35.98% faster. That is to say, it only needs about 2/3 of the training corpus, which is more cost - effective in the case of "data bottlenecks".

In the distributed large - batch training environment, EBT's training convergence speed is 28.46% faster than that of Transformer++, and the depth expansion efficiency is increased by 5.29%, ensuring that the efficiency does not lag behind.

EBT also shows stronger robustness on OOD (Out of Distribution) data.

EBT can significantly alleviate the problem of the decline in generalization performance through "multi - round inference" and "self - verification".

In contrast, the performance of the traditional Transformer++ hardly changes with the number of inferences.

This means that even if EBT's pre - training indicators are slightly worse than those of Transformer, once it starts to "think", it can catch up and "become more accurate as it thinks".

This mechanism of "thinking leads to generalization" is unique among all current mainstream large - model architectures.

The more it thinks, the better! Transformer can't catch up.

As long as the "input" and "candidate predictions" are clearly defined, EBT can think and optimize in an unsupervised manner.

EBT's design does not rely on supervision, additional rewards, or is not limited to text or programming. It is naturally suitable for any modality and task.

For text, EBT can automatically learn the rules of different words: simple words have low energy, and difficult words have high energy, thereby naturally expressing semantic uncertainty.

In image tasks, EBT bids farewell to the hundreds of steps of generative inference of the Diffusion model. With only 1% of the inference steps, it can surpass the performance of the Diffusion Transformer (DiT) in image denoising and classification.

It goes without saying that EBT can handle the "uncertainty" prediction and attention adjustment of video frames.

This unified, flexible, and efficient inference mechanism is likely to be the key to "general intelligence".

After all, the ultimate question about large models always exists: Can they really "think"?

EBT may be one of the first architectures qualified to answer this question.

References

https://x.com/AlexiGlad/status/1942231878305714462

https://x.com/du_yilun/status/1942236593479102757

https://arxiv.org/pdf/2507.02092

This article is from the WeChat public account "New Intelligence Yuan". The author is Haili. 36Kr is published with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Completely rewriting Transformer, the "energy-driven architecture" has emerged, and is the era of general inference coming?