Get 15% more computing power without adding a single GPU: The large model circle is "operating" on the network.
In the past two years, there has been only one standard move in the computing power arms race across the entire industry: buying more GPUs, building larger clusters, and piling up higher computing power.
However, this path is now being re - examined.
Recently, Zhipu publicly unveiled an architecture innovation verified in a production cluster for the first time - the ZCube networking architecture.
One set of data shows that without adding a single GPU, replacing a single server, or even changing a single line of application code, the cluster's inference throughput directly increased by 15%, and the P99 tail latency of TTFT (Time to First Token) decreased by 40.6%. These figures were obtained from real - world production traffic, not from laboratory simulations.
For a large - model API platform serving millions of developers, this means that the same set of hardware infrastructure can handle 15% more concurrent requests per second, and the queuing waiting time during traffic peaks is significantly reduced. A 40% reduction in P99 tail latency directly determines how much the "lag feeling" perceived by end - users can be reduced.
What attracts more attention within the industry is the change in the cost structure. According to Zhipu's disclosure, the number of switches and optical modules required by the ZCube architecture is one - third less than that of the original solution. The larger the scale, the more significant the absolute value of this savings. In a market where inference demand continues to grow rapidly and computing power supply is generally tight, this kind of efficiency improvement by "not changing the hardware but only the networking" is equivalent to a very low - cost re - evaluation of the existing computing power assets.
Zhipu is not the only one squeezing computing power
Zhipu's public disclosure of technical details this time is limited, but the core logic is clear enough: when thousands or even tens of thousands of GPUs in a cluster process inference requests simultaneously, every cross - card transmission of KV Cache and every data synchronization has to pass through the interconnection network between GPUs. The efficiency ceiling of this network directly determines how much real computing power the GPUs themselves can exert. The idea of ZCube is to re - plan this "road network" from the topological design to eliminate congestion at the root - rather than waiting for congestion to occur and then trying to relieve it.
Almost at the same time, another event added more weight to the judgment in this direction.
OpenAI, in collaboration with five giants including NVIDIA, AMD, Intel, Microsoft, and Broadcom, officially released the MRC (Multi - Path Reliable Connection) network protocol. This is an open network protocol for ultra - large - scale AI clusters and has been deployed in all of OpenAI's largest supercomputing clusters, including the Oracle supercomputer in Abilene, Texas, and the Microsoft Fairwater supercomputer, for training cutting - edge models such as ChatGPT.
Looking at these two events together, they point to the same judgment: when GPU clusters leap from the ten - thousand - card level to the one - hundred - thousand - card level, the network is no longer a passive "connector" but a core variable restricting overall efficiency.
However, their technical paths are completely different. MRC optimizes the "traffic rules" at the protocol layer; ZCube reconstructs the "road network" at the architecture layer - eliminating the structural root causes of congestion from the topological design. One is software - based, and the other is hardware - based, but they achieve the same goal through different means.
If we broaden our perspective, we will find that the trend of "not piling up hardware but digging for efficiency from infrastructure and system architecture" is gradually becoming a subtle shift in the industry.
From the hardware side, NVIDIA's latest generation of Blackwell Ultra architecture, through the NVFP4 precision format and attention layer acceleration, achieves a throughput several times that of the basic version GB200 in the DeepSeek - R1 inference task. Google's seventh - generation TPU Ironwood has a single - chip training and inference performance more than four times that of its predecessor, Trillium.
From the perspective of chip startups, a group of non - GPU architectures specifically designed for inference are also accelerating their penetration. Groq, which focuses on ultra - low latency, has its LPU achieving a speed of 300 tokens per second on Llama 2 70B, ten times faster than the H100 cluster. Cerebras, a wafer - scale chip company, claims that its inference speed has exceeded that of NVIDIA Blackwell in multiple tests.
From the perspective of the model architecture itself, Tongyi Qianwen's Qwen3 - Next compresses the training cost to less than one - tenth of the previous level through a hybrid attention mechanism and a highly sparse MoE design, and the inference context throughput is increased by more than ten times. The sparse attention technology launched by DeepSeek makes the long - text inference speed of the new - version model two to three times faster than that of the previous generation, and the API call cost is almost halved.
These explorations have a common feature: they no longer rely solely on the lever of "buying more cards" but seek greater output multiples from the existing computing power stock and limited new investment.
When "buying cards" is no longer the only answer
This shift from "piling up hardware" to "digging for efficiency" is having a substantial impact on the upstream supply chain.
The most direct variable comes from the network equipment side. The ZCube solution reduces the usage of switches and optical modules by one - third, and the MRC protocol promotes the replacement of the traditional three - to four - layer architecture with a two - layer switch networking - the combination of the two means that the procurement logic of AI clusters will undergo a structural adjustment: the demand for high - end switches will shift from "more layers" to "fewer layers and greater port density", and optical modules will accelerate the concentration towards 800G and above rates.
In fact, market data is already verifying this trend. According to LightCounting statistics, the shipment volume of 800G optical modules will double year - on - year in 2025, and 1.6T optical modules will start to be shipped; it is expected that the shipment volume of 800G will double again in 2026, and the 1.6T will jump from a small base in 2025 to tens of millions of port levels.
From the perspective of the capital market, AI network infrastructure is being upgraded from a "supporting project" for ten - thousand - card clusters to a core value link in the industrial chain. Some institutions predict that the total sales of data center switches will increase by 86% year - on - year in 2026. The total capital expenditure plans of the four major cloud providers, Google, Amazon, Microsoft, and Meta, in 2026 will reach hundreds of billions of dollars. Coupled with the long - term trend of the MRC protocol promoting the acceleration of Ethernet to replace InfiniBand in supercomputing clusters, the 800G/1.6T optical module industrial chain, high - density Ethernet switches, related chips, and connector segments are entering a window period of demand structure reshaping.
Zhang Youyu, the secretary - general of the AI Special Committee of the Beijing Computer Society and a specially - appointed researcher at Peking University, told a reporter from Science and Technology Innovation Board Daily that when looking at the long - term, Zhipu's public ZCube practice has two implications in the context of the industry.
The first is at the technical level. It verifies with real - world production data that in clusters of thousands or even tens of thousands of cards, the network architecture itself can be an independent efficiency lever, and the marginal transformation cost is extremely low. When the entire industry is burning money on GPU procurement, this kind of efficiency improvement with "a little effort to achieve great results" is obviously more cost - effective than placing another chip order.
The second is at the business level. For platform - type companies with a large stock of GPUs, hardware depreciation is a fixed cost. Whoever can squeeze more token output from existing assets can widen the cost advantage in a market where API prices are continuously falling. Zhipu's 15% increase in throughput and one - third savings in network hardware, on the scale of millions of concurrent requests, correspond to a considerable optimization of operating costs.
This article is from the WeChat official account "Science and Technology Innovation Board Daily", author: Li Mingming, published by 36Kr with authorization.