HomeArticle

On the path to AGI, the GPU narrative persists, yet Transformer fails to unlock the door.

36氪的朋友们2025-12-22 08:18
What is the ticket to enter the AGI field?

After Google released Gemini 3, the capital market witnessed a "seesaw" game.

Riding on its annual flagship model, Google's market value soared by over $500 billion at one point. On the other hand, NVIDIA, the computing power hegemon, saw its market value evaporate by $600 billion during the same period.

This huge "gap" seems to imply a shift in the wind: When Google's TPU achieved remarkable results with Gemini 3, and there were even rumors that Meta was planning to purchase TPUs, is the computing power moat built by general-purpose GPUs starting to crumble? Is the hardware paradigm shifting from general-purpose GPUs to dedicated ASICs?

At Tencent Technology's 2025 Hi Tech Day, Wang Sheng, a partner at Inno Angel Fund, posed these questions to representatives from domestic model and infrastructure providers such as Muxi Technology, SiliconFlow, and StepStar, initiating an ultimate inquiry into the "stability or transformation" of key AGI infrastructure.

Sun Guoliang from Muxi Technology believes that the narrative of GPUs still holds: "Short - selling by Wall Street might just be a 'bargaining tactic'."

In Sun Guoliang's view, GPUs and ASICs have been in a "super - stable state" for decades. He emphasized that in the current stage of rapid model iteration, the versatility of GPUs is their greatest advantage. "It's difficult to apply a dedicated product to a general - purpose scenario."

When asked about the "open - source vs. closed - source" debate, Hu Jian from SiliconFlow stressed that it is a game of survival rules for the "second - and third - tier players." "Just like Android competing with iOS. When DeepSeek emerged, the market was shaken, and everyone had to follow suit. This is a trend driven by low - level competition."

Hu Jian stated that if models are not open - sourced and intelligence is concentrated in the hands of a few enterprises, customers will have to rely on these giants and bear higher costs and consequences.

On the algorithm side, Zhang Xiangyu, the chief scientist at StepStar, dropped a "bombshell": The existing Transformer architecture cannot support the next - generation Agents.

Zhang Xiangyu pointed out that in a long - text environment, the "intelligence" of the model declines rapidly as the context length increases. For general - purpose Agents that pursue infinite context, the one - way information flow mechanism of the Transformer has inherent flaws. StepStar's research indicates that the future architecture is likely to evolve towards "Non - Linear RNN" (Non - Linear Recurrent Neural Network).

Key highlights from the guests:

Sun Guoliang (Senior Vice President of Muxi Technology)

"Today's AI is using engineering to'reverse - engineer' basic science. Before breakthroughs in mathematics and brain science, we need to rely on GPUs for extensive engineering experiments."

Hu Jian (Co - founder and Chief Product Officer of SiliconFlow)

"If models are not open - sourced and intelligence is concentrated in the hands of a few enterprises, customers will have to rely on these giants and bear higher costs and consequences."

Zhang Xiangyu (Chief Scientist of StepStar)

"Today's Transformer is completely unable to support the next - generation Agents. The real challenge is not computational complexity, but the 'intelligence drop' - the longer the text, the dumber the model."

The following is a transcript of the round - table discussion, edited and abridged without changing the original meaning

01

The Trillion - Dollar Question: GPU or TPU?

Wang Sheng (Partner at Inno Angel Fund, Chairman of Beijing Frontier International Artificial Intelligence Research Institute): Let's start with a recent bombshell event. After Google released Gemini 3, its valuation increased by over $500 billion, and people think Google is back. However, at the same time, NVIDIA's market value evaporated by over $600 billion.

I'd like to ask Guoliang. Your company is at the forefront of domestic GPU development. How do you view this? Is the hardware paradigm starting to shift towards dedicated chips like TPUs or NPUs? Are they in complete competition or a combination of competition and cooperation?

Sun Guoliang: There is no absolute superiority or inferiority among architectures. The most important thing is the application scenario.

In terms of stability or transformation, the two architectures of GPUs and ASICs (dedicated chips) have been in a "super - stable state" for decades. ASICs also include BPUs, APUs, VPUs, DSPs, etc., each having its own advantages in their respective fields.

However, today we are in a stage of rapid model iteration. In this stage, the versatility of GPUs is their greatest advantage. It's difficult to use a dedicated product in a general - purpose scenario because it can't handle the complexity.

Models are being updated at an extremely fast pace, sometimes on a weekly basis and at the latest on a monthly basis. From our perspective, no fundamental model has reached the "convergence" point yet. For a long time to come, rapid model iteration will remain the norm.

Another issue is the fragmentation of scenarios. Customers' application scenarios are diverse and ever - changing. In such a fragmented landscape, GPUs and ASICs will coexist for a long time, but general - purpose GPUs will have better generalization and adaptability.

As for the fluctuations in NVIDIA's market value, to be honest, this might just be a smart "bargaining tactic" by Wall Street. Wall Street has already made its choice and propelled NVIDIA to the top of the world because, at this historical stage, versatility is still the mainstream.

02

The "Stitching" of the Middle Layer: Are Models Converging?

Wang Sheng: Hu Jian, your company is involved in the connection between models on one side and computing power on the other. Will this lead to an explosion of workload? For example, will you need to reconstruct operators, compilers, and computational graphs? Also, from the perspective of customer usage, are models diverging or converging?

Hu Jian: SiliconFlow has its own cloud platform. The major difference between us and other domestic AI infrastructure providers is that we extensively use domestic chips, such as those from Moore Threads and Muxi Technology, to serve real customers.

Overall, models follow the "80/20 rule". Although new models emerge every one or two weeks, the calls are highly concentrated on a few models such as DeepSeek, Qianwen, Kimi, and GLM.

Despite the rapid changes in models, the model structures are basically in a "gradually stable state". For example, DeepSeek uses the MLA structure, including the MQA structure, which are mostly variants of the Transformer. This is a significant advantage for domestic chips.

If the scenarios are highly diverse and not based on the Transformer, it will be the era of CUDA because its software stack has been refined over the past decade. However, since the structure is relatively stable now, our core task is to help domestic chips achieve "end - to - end benchmarking" with NVIDIA's chips of the same specifications.

About 70% of the work is relatively standard. For example, quantization - previously, most domestic chips only supported INT8, but now DeepSeek uses FP8, so the quantization solutions are general. Other examples include PD separation and shared transmission of KVCache.

The remaining 30% requires joint optimization based on the performance bottlenecks of different chips. For example, if a chip has weak operators or weak communication capabilities, we need to optimize the operator fusion or communication library. Overall, the model structure is becoming more concentrated, and these optimization solutions have high reusability in large - scale deployment and application.

03

The "Transformation" of Algorithms: Is Transformer the Ultimate Paradigm for AGI?

Wang Sheng: Xiangyu, you are an algorithm expert. I'd like to directly ask: Is the Transformer destined to be the ultimate paradigm for AGI? Currently, there are also paradigms like RetNet and Mamba in the academic community. Do they have value?

Zhang Xiangyu: First, let me give a conclusion: The current model architectures are indeed becoming more stable, but we are probably on the verge of a major transformation.

My latest research conclusion is that today's Transformer is not sufficient to support our journey to the next stage, especially in the era of Agents.

Let me explain the first part. Indeed, most current architectures have converged to the Transformer. Although there are some minor improvements such as Linear Attention and Sparse Attention to improve efficiency, the fundamental modeling capabilities remain the same.

Moreover, we have discovered a significant side effect: the real challenge in long - text scenarios is not computational complexity, but the rapid decline of the model's "intelligence" as the text length increases.

For general - purpose Agents, they should face an "infinite stream" world - the context is infinitely long, including all experiences from childhood to adulthood. However, today's Transformer, no matter how many tokens it claims to support, becomes almost unusable after reaching 80,000 - 120,000 tokens in my tests. Even GPT - 5 might be a bit better, but it will eventually degrade.

What's the underlying reason? The information flow in the Transformer is one - way.

All information can only flow from the (L - 1)th layer to the Lth layer. No matter how long the context is, the depth (L) of the model either doesn't increase or only increases slightly (for some latest architecture variants).

Imagine that human memory has a strong compression mechanism. Every word I say today is a function of all the information I've seen in the past. This complex function cannot be represented by a neural network with a fixed number of layers.

Wang Sheng: I understand what you mean. Have you completed this research?

Zhang Xiangyu: We have obtained very positive results from some small - scale experiments. The future architecture should be a combination of a short - window Transformer (for modeling short - term memory) and a large RNN (Recurrent Neural Network, for modeling episodic memory), and it should be a "Non - Linear RNN". Of course, this poses a huge challenge to system efficiency and parallelism, requiring co - design of hardware and software.

04

Physical Bottlenecks: AI - Accelerated "Controlled Nuclear Fusion" and Ten - Thousand - GPU Clusters

Host/Wang Sheng: Xiangyu's sharing was very impactful. I need some time to digest it. Due to time constraints, let me briefly discuss the issue of energy because we have invested in StarRing Fusion.

After the detonation of the hydrogen bomb, people started exploring "controlled nuclear fusion". This has been going on for over 80 years. Previously, it was always said that "we are still 50 years away from success," but in the past two or three years, the situation has reversed. The most optimistic estimate is 10 - 15 years, and a more objective estimate is 20 years.

How did this happen? This is highly related to AI.

Today, the Tokamak device faces two major problems:

First, how to generate a strong magnetic field to confine the plasma. This depends on materials, which involves AI for Science - people are very optimistic that high - temperature superconductivity and room - temperature superconductivity can be achieved through AI in the next few years, which will solve a major problem.

Second, the control of the plasma. The temperature inside is hundreds of millions of degrees, and there are numerous coils outside. How to control it? It's a "black box" that you can't open and examine. Writing programs in the past was extremely complex, but now with the advent of AI, people suddenly think it's possible through simulated reinforcement learning.

If we don't solve the energy problem, human civilization will be restricted. This is extremely appealing.

We've discussed chips, and now I'd like to discuss networks.

I'd like to hear about the scale of the networks you use in actual model training and running - not just laboratory demos, but real - world results.

Moreover, NVIDIA has a multi - layer network layout, including NVLink, NVLink Switch, and InfiniBand. I'd like to know which layers our independently built networks cover.

Sun Guoliang: I believe the biggest challenge for AI infrastructure is to clearly define the product. Customers need general - purpose computing power that can support large - scale model training, inference, and service, rather than just a single GPU card.

We also have clusters of thousands of GPUs across the country. We have trained various models, including traditional models, MoE models, and non - Transformer architecture models.

Let me add something about energy. If energy is used to solve the computing power problem, China has a huge advantage.

The core reason is that today's models belong to engineering. Engineering is based on mathematical reasoning, and mathematics is derived from physiology and brain science. However, there have not been significant breakthroughs in basic brain science and biology. So, in mathematics, there is no major breakthrough, and in engineering, we are just making "brute - force attempts".

On the contrary, many of our current engineering attempts will "reverse - engineer" the evolution of basic science. This is a cycle. I believe that in the future, domestic computing power, basic energy, and open - source models will have even better prospects.

05

The Ultimate Game of Open - Source vs. Closed - Source

Wang Sheng: Hu Jian, let's talk about the issue of open - source and closed - source. As I understand, many of the models on SiliconFlow are open - source, while US tech giants are mostly using closed - source models. Can open - source models compete with closed - source ones in the future? Are you worried that the most powerful models will be closed - source, squeezing your business space?

Hu Jian: It's easier to answer this question now because we were often asked this by investors when we first started.

When we founded the company, we were firm on two points: open - source will definitely gain momentum, and inference will be the mainstream.

The core factors in the open - source vs. closed - source debate are as follows:

First, the competitive landscape. Usually, second - and third - tier enterprises need to open - source their models to avoid being completely marginalized by the industry leaders. Once they open - source, more people will join in, which can reverse the situation. Just like Android competing with iOS. When DeepSeek emerged, the market was shaken, and everyone had to follow suit. This is a trend driven by low - level competition.

Second, demand. If intelligence is concentrated in the hands of a few enterprises, enterprise customers will have to bear higher costs and consequences. Enterprises have their own unique data and are reluctant to entrust it to closed - source models due to privacy and security concerns. To have more control over data and reduce costs, the demand side will drive the continuous existence of open - source models.

Just like Android has developed its own business model, open - source models will also have similar business models such as advertising or service in the future.

06

AGI on Mobile Phones: From Inference to Autonomous Learning

Wang Sheng: Xiangyu, StepStar has just released an Agent for Android phones, GELab - Zero. Is this more of a test or can it be truly implemented in the mobile phone industry?

Zhang Xiangyu (StepStar): The reason we are working