HomeArticle

Can the Transformer support the next generation of Agents?

划重点2025-12-22 16:08
Don't fall asleep in your comfort zone.

On December 18, 2025, the Tencent ConTech Conference and Tencent Hi-Tech Day were officially broadcast. Academicians of the Chinese Academy of Engineering, well-known experts and scholars, founders of leading technology companies, and well-known investors gathered together to discuss the opportunities and challenges in the intelligent era.

During the roundtable forum session, when the host handed the microphone to Zhang Xiangyu, the chief scientist of Jieyue Xingchen, and asked about the future of model architectures, this academic giant dropped a "depth charge": The existing Transformer architecture cannot support the next-generation Agents.

Not long ago, Fei-Fei Li, a professor at Stanford University and known as the "Godmother of AI", bluntly pointed out in an in-depth interview that the existing Transformer architecture may struggle to produce high-level abstractions like the theory of relativity. In the next five years, the industry needs to find a new architectural breakthrough to enable AI to move from statistical correlation to real causal logic and physical reasoning.

Ilya Sutskever, the core creator of the GPT series and a former co-founder of OpenAI, also expressed the same judgment in a recent in-depth interview: The "scaling era" that simply relies on stacking computing power and data is hitting a bottleneck, and the industry is returning to the "research era" that focuses on underlying innovation.

In the past seven years, from Google's BERT to OpenAI's GPT series, and then to the suddenly emerging DeepSeek, almost all AI models that have shocked the world are based on Transformer. It has skyrocketed NVIDIA's market value and enabled countless startups to secure huge amounts of financing.

But now, those who understand it best are starting to question it.

It seems that humanity is once again on the eve of a paradigm revolution. When the marginal effect of the Scaling Law begins to decline, and when models with trillions of parameters still don't know how to navigate the physical world like humans, we have to face this question:

Has the Transformer, which was supposed to lead us to AGI, reached its ceiling?

The Straight-A Student Who Only Knows How to Solve Problems

Before 2017, the mainstream methods for AI natural language processing (NLP) were still RNN (Recurrent Neural Network) and LSTM (Long Short-Term Memory Network). The way they process information is like a diligent reader who has to read one word after another in sequence, which is inefficient and makes it difficult to capture long-distance semantic associations.

In 2017, Google's paper "Attention Is All You Need" suddenly emerged and completely changed everything.

The Transformer architecture abandoned recurrence and introduced the "self-attention mechanism". It no longer reads in sequence but can focus on all the words in a sentence simultaneously and calculate the association weights between them.

This architecture made parallel computing possible. As long as there is sufficient computing power (GPU) and data, the model can demonstrate amazing emergent intelligence capabilities. This is the later Scaling Law.

The combination of Transformer and GPU is like an internal combustion engine meeting oil, directly triggering an artificial intelligence wave on the scale of the third industrial revolution.

However, at the end of the day, Transformer is an ultimate statistician.

Fei-Fei Li pointed out that one of the most significant breakthroughs in generative AI is the discovery of the "next token prediction" objective function. It sounds elegant but is also very limited. The core logic of Transformer is probability prediction based on massive amounts of data. It has read all the books on the Internet, so when you jump off a cliff, it knows the next sentence should be "fall", not "fly".

Ilya also gave a metaphor: The current models are like students who have practiced for ten thousand hours to win a programming competition. He has memorized all the algorithms and techniques, seen all possible exam questions, and covered all blind spots through data augmentation. He seems very strong and can get high scores, but in essence, he is just doing memory retrieval.

In contrast, a truly talented student may have only practiced for a hundred hours, but he has profound taste and intuition and real generalization ability. The current Transformer models are like that straight-A student who memorizes by rote. Once they encounter an unseen field, their performance will be greatly reduced.

Ilya believes that this is because the models lack certain characteristic factors, which make them learn to meet the evaluation criteria but not truly master reasoning.

Fei-Fei Li also gave a similar judgment: "In most current generative videos, the flowing water or swaying trees are not calculated based on Newtonian mechanics but emerge from statistics of massive amounts of data."

In other words, AI has just seen the appearance of flowing water countless times and imitated it. It doesn't understand the tension between water molecules or the acceleration due to gravity.

Transformer is a perfect curve fitter. It can infinitely approach reality but cannot deduce the rules behind reality. Because it only has correlation, not causality.

The Curse of Long Contexts and the Lack of Slow Thinking

In 2025, an obvious trend in the AI industry is long texts. But in Zhang Xiangyu's view, this may be a trap: "Our current Transformer, no matter how many tokens it claims to support when released, basically becomes unusable at 80,000... Even if the context length can be very long, the performance basically degrades at 80,000 tokens in tests."

The so-called degradation here does not mean that the model forgets, but the IQ drops rapidly as the text gets longer.

Zhang Xiangyu revealed the underlying mathematical logic - the information flow of Transformer is unidirectional: "All information can only flow from the (L-1)th layer to the Lth layer. No matter how long the context is, the depth of the model will not increase. It only has L layers." Its thinking depth is fixed and will not become deeper just because the book gets thicker.

This is similar to the value function emphasized by Ilya. He pointed out that the reason why humans are efficient is that we have an internal value function - you don't need to finish a whole game of chess to know that losing a piece is a mistake. You can get a signal in the middle of the process.

The current Transformer lacks this mechanism. It has to lay out all the information flatly and has to rummage through its entire life's ledger every time it makes a decision. It's similar to the fast-thinking intuitive reaction of humans, blurting out without being able to engage in slow thinking.

Ilya believes that true intelligence is not just predicting the next token but being able to pre-judge the pros and cons of a path through an internal value function before taking action. For future Agents, they need to survive in an infinite stream of the world. If they continue to use the Transformer architecture that lays out all memories flatly, it will not only be unsustainable in terms of computation but also illogical.

Visual Aphasia and Physical Blind Spots

The crisis of Transformer is not limited to language and logic but also lies in its powerlessness in understanding the physical world.

Fei-Fei Li believes that "Language alone is not enough to build general artificial intelligence." When the existing Transformer deals with visual tasks, it often simply and crudely transplants the next-word prediction to the next-frame prediction, resulting in a lack of spatio-temporal consistency in the generated videos.

There is also a deeper contradiction here: Sample efficiency.

Ilya raised a question in the interview: Why can a teenager learn to drive in just a dozen hours, while AI needs massive amounts of data for training?

The answer lies in "prior knowledge". Humans have powerful prior knowledge and intuition bestowed by evolution (that is, the value function composed of emotions and instincts). We don't need to see a million car accidents to learn to avoid them. Our biological instincts give us a natural perception of the dangers in the physical world.

He Xiaopeng also expressed a similar insight at the conference: Books can't teach you to walk. Skills in the physical world must be learned through interaction.

The current Transformer models lack a world model based on physical and biological intuition. They try to cover up their lack of understanding of physical laws by exhausting all data. Ilya pointed out that the dividends of pre-training data will eventually be exhausted, and data is limited. When you scale up by 100 times, simple quantitative changes may no longer bring about qualitative changes.

Physical AI needs a "digital container" with a built-in 3D structure, causal logic, and physical laws, rather than a language model that only guesses the next frame based on probability.

Return to the Research Era

If the Transformer may be a dead end, then where is the way forward?

Ilya gave a macro judgment: We are saying goodbye to the "scaling era" (2020 - 2025) and returning to the "research era" (2012 - 2020). This is not a regression in history but a spiral ascent - we now have huge computing power, but we need to find a new formula.

This new formula will not be a simple patchwork of single technologies but a systematic reconstruction.

Fei-Fei Li's World Labs is committed to building a model with "spatial intelligence" and establishing a closed loop of seeing, doing, and imagining. The future architecture is very likely to be a hybrid: the core is highly abstract causal logic (implicit), and the interface is the colorful sensory world (explicit).

Zhang Xiangyu revealed a highly forward-looking "nonlinear RNN" direction. This architecture is no longer unidirectional but can circulate, ruminate, and reason internally. As Ilya envisioned, the model needs to have a "value function" like humans and conduct multi-step internal thinking and self-correction before outputting results.

Ilya believes that the future breakthrough lies in how to enable AI to have the "continuous learning" ability like humans, rather than being a static pre-trained product. This requires a more efficient reinforcement learning paradigm, shifting from simple imitation (Student A) to an expert with intuition and taste (Student B).

If the underlying architecture undergoes a major change, the entire AI industry chain will also face a reshuffle.

The current hardware infrastructure, from NVIDIA's GPU clusters to various communication and interconnection architectures, is largely tailored for Transformer.

Once the architecture shifts from Transformer to nonlinear RNN or other graph-computation combined modes, dedicated chips may face challenges, and the flexibility of general-purpose GPUs will once again become a moat.

The value of data will also be re-evaluated. Video data, sensor data from the physical world, and interaction data from robots will become the new oil.

Conclusion

At the end of the interview, Fei-Fei Li said something meaningful: "Science is the non-linear inheritance of the thoughts of multiple generations."

We often like the single-hero myth. For example, we think Newton discovered the laws of physics, Einstein discovered the theory of relativity, and Transformer opened the AI era. But in fact, science is a river, with countless tributaries converging, changing course, and flowing back.

Transformer is a monument, but it may not be the end. It has shown us the dawn of intelligence, but its innate deficiencies in causal reasoning, physical understanding, and infinite context are destined to make it just a stepping stone on the road to AGI, not the ultimate key.

Fei-Fei Li said the industry needs to find a new architectural breakthrough, and Ilya said the Scaling era is over. Zhang Xiangyu said the Transformer cannot support the next-generation Agents. They are not completely denying its historical achievements but reminding us: Don't fall asleep in the comfort zone.

In the next five years, we may see the Transformer gradually fade into the background and become a sub-module, while a brand-new architecture that integrates spatial intelligence, embodied interaction, and in-depth logical reasoning will take the stage.

For technology companies in this field, this is both a huge challenge and a once-in-a-lifetime opportunity.

This article is from the WeChat official account "Key Points", author: Li Yue. Republished by 36Kr with permission.