HomeArticle

Is the Transformer, which has dominated AI for a decade, about to be shattered by its own creator?

新智元2026-05-27 09:56
Nothing can stop humanity's yearning for AGI.

An 80 - minute boxing - style debate! The co - inventor of Transformer steps onto the stage in person to defend his work, while three challengers on the opposite side point out five major flaws. This is the most intense head - on confrontation in the AI architecture field in the past decade. Is the foundation of the architecture that has dominated the golden decade of AI already shaky?

Why has Transformer dominated AI for so long?

Can new architectures really break through the shortcomings in long - context, memory, and reasoning?

What exactly does the so - called "post - Transformer" mean? Is it a stronger memory mechanism, more efficient sequence modeling, or a complete overhaul from training to the system?

On May 5th, in San Francisco, Pathway held a boxing - ring - style debate.

This is not a metaphor; it's a real ring.

On one side is Łukasz Kaiser, the co - inventor of Transformer. On the other side are the proponents of the "post - Transformer era" with new architectures.

Note a detail: Llion Jones, one of the two co - inventors of the attention mechanism and one of the "Eight Sons of Transformer", sits on the opposite side of Kaiser.

The topic is simple: What will the next - generation AI architecture look like?

The venue was filled with researchers, entrepreneurs, and investors. The winner is determined not by voting but by a "clapometer" - a device that scores based on applause. The one with louder applause wins.

This is a head - to - head confrontation with no holds barred and names called.

When the referee announced the start of the game, the myth that has dominated the global AI architecture for nearly a decade was for the first time pulled onto the defendant's stand by its creator for self - defense.

This heavyweight showdown in the intellectual world starts with the five major flaws of Transformer.

Tired of Transformer for a Long Time

Five Major Flaws

The identity of Łukasz Kaiser adds significant weight to this debate.

He is the co - inventor of Transformer.

He is one of the authors of the 2017 paper "Attention Is All You Need", which changed the entire AI landscape. Later, he participated in the actual engineering development of ChatGPT, the GPT series, and o1.

He is the person involved. He is here today to defend his work.

The three challengers on the opposite side also have impressive backgrounds.

Llion Jones, another co - inventor of Transformer and co - founder of Sakana AI.

Adrian Kosowski, Chief Science Officer of Pathway and inventor of the BDH architecture.

Matthias Lechner, Chief Technology Officer of Liquid AI and co - inventor of the MIT liquid neural network.

This is an extremely rare scene in the history of technology. People who created the same thing have fundamental differences about its future.

Kaiser started with an analogy.

He said that the attention mechanism of Transformer is like a librarian's card - indexing system.

You enter the library, state the content you're looking for (query). The librarian opens the card catalog (key), finds the corresponding bookshelf location, and takes out the book for you (value).

Simple. Efficient. Global retrieval.

But the challengers ask: What if this library has 100 million books? If you have to go through all the cards for each query, can this system still hold up?

This is O(n²), the Damocles' sword hanging over Transformer.

The three challengers didn't simply say "Transformer is no good". They identified five specific open problems that the current Transformer architecture cannot solve at the design level.

Each one hits the nail on the head.

The most pointed metaphor from the challengers points directly to the memory and continuous - learning flaws of Transformer: "Groundhog Day".

In the movie "Groundhog Day", every time the protagonist wakes up, the world resets, and all yesterday's memories are gone.

Currently, Transformer is the same.

During each forward pass, its weights are completely frozen.

Even if you talk to it for ten hours today and it learns wonderful new knowledge, when the next conversation starts, it is still an amnesiac idiot.

Currently, the industry is desperately stuffing RAG (Retrieval - Augmented Generation) and long - context (KV Cache) to solve this problem.

But this is not an architecture - level solution. It's like putting a band - aid on a wound with expensive computing power.

The five major flaws, each significant on its own, together form a complete indictment.

But an indictment is not a verdict.

Kaiser's Trump Card

If You Can Do It, Show the Curve

Facing the five attacks, Kaiser didn't refute them one by one.

He didn't say that O(n²) is not a problem, didn't say that catastrophic forgetting doesn't exist, and didn't say that Transformer is perfect.

He threw out a sentence that became the core of the entire debate:

Unless Post - Transformer proves a better scaling curve, Transformer remains the mainstream.

The power of this statement lies in that it shifts the burden of proof back to the challengers.

What is a scaling curve?

Simply put, it's "how much the AI's capabilities improve when more computing power and data are invested".

The core reason why Transformer has dominated for nearly a decade is not that it has no flaws, but that its scaling curve has not been surpassed by any architecture so far.

This is the confidence that allows OpenAI to invest billions of dollars in training GPT and Anthropic to continuously expand the scale of Claude.

Kaiser's logic is extremely clear:

You say Transformer has five problems? I agree.

But there is a gap between something with problems and something that should be replaced. To cross it, you need not five papers but a better scaling curve.

Then, he launched a more detailed defense, with a touch of the rust from the engineering field.

Parallelism is the hard truth.

Last week, on the latest Nvidia hardware, Kaiser re - implemented Transformer and several old - style RNNs and made a comparison.

A very small GRU is 50 times slower than a much larger Transformer.

RNN is indeed elegant, but its sequential execution characteristic is a disaster on current hardware.

If there really is a better architecture, you need 50 times the time to prove it - and most laboratories don't have that patience.

Ten years of engineering accumulation.

It's not just GPU optimization. The entire AI engineering stack, including compilers, training frameworks (PyTorch, JAX), inference engines (vLLM, TensorRT - LLM), and quantization tools, is built around Transformer.

Changing the architecture means starting all over again.

Implicit "continuous learning" has already occurred.

Kaiser pointed out that after large - scale pre - training, the in - context learning shown by Transformer during forward propagation actually perfectly simulates gradient descent in backpropagation mathematically.

In other words, when you say it can't learn, it's actually learning secretly in another way.

His defense is not "Transformer is always the optimal solution", but "Transformer is the optimal solution now, unless you prove otherwise".

Then he threw out a sentence that left the opponents speechless:

Maybe it will be Transformer itself, rather than you, that finds the next architecture.

The whole audience laughed.

But everyone could tell that he was serious.

AI: An Unstoppable Bright Future

In Kaiser's closing statement, he didn't say "Transformer is always the optimal solution". He said, "Currently, Transformer still wins."

The word "currently" is the only crack he left for the challengers.

More subtly, he handed over a weapon that originally belonged to his own camp.

The biggest shortcoming of the post - Transformer camp was "the lack of engineering and hardware verification with large computing power" - new architectures are slow, and no one wants to modify chips for them. But Kaiser himself admitted that this barrier is being broken down:

Now, AI agents have learned to write extremely difficult CUDA and Triton kernel functions.

Even if a new architecture is initially 50 times slower, you just need to give the code to the agent, and it can optimize a dedicated kernel that can almost squeeze out the GPU computing power in a short time.

The barrier of the hardware lottery is being smashed by the agent development ecosystem itself.

This means that once someone runs a more beautiful perplexity curve with a Post - Transformer architecture on extremely long - context tasks of millions or tens of millions of tokens, even with a slight advantage, it will form a fatal blow to the old empire under the magnifying glass of scaling.

Kaiser even actively proposed: A unified test standard should be established - using perplexity to measure the learning ability of all architectures under the same conditions.

"We should reach a consensus on this and then prove that our own architectures are better."

The subtext of this sentence is: The challenge has officially begun.

And Jones' last sentence was even more direct:

Today, I have no reason to doubt my belief that there is something better. When that breakthrough comes, all of us will enter the post - Transformer era, and Łukasz is no exception - because he will