HomeArticle

Switch revolution in the era of AI supernodes

半导体产业纵横2026-03-31 20:26
The battle for technology and market share around AI switches has already begun.

The parameter scale of AI large models continues to grow. The physical limits of single - card computing power and video memory are forcing the continuous expansion of the scale of AI training clusters. In this AI computing power arms race, network performance has long become the key to determining the efficiency of cluster computing power release. For AI models with ultra - large parameter scales, a higher network bandwidth can directly and significantly compress the completion cycle of model training.

The technological foundation for AI computing power release: RDMA

To break through the network performance bottleneck of AI clusters, RDMA technology has become the industry - recognized solution, and the starting point of all this stems from the solution to the communication bottleneck in the era of general - purpose GPU computing.

GPU Direct RDMA is a software - hardware collaborative innovation technology jointly developed by Nvidia and Mellanox in 2009. At that time, GPUs had shifted from graphics rendering to general - purpose computing (GPGPU) and became the core accelerator for HPC. Although the computing power of GPUs was continuously improving, since the data transfer between GPUs in different nodes of the cluster still had to be handled by the CPU, there was a communication bottleneck. As a result, the advantages of GPU computing power could not be fully realized, leading to low overall cluster efficiency. NVIDIA clearly recognized at that time that this problem had to be solved, so it began to explore the solution of direct communication between GPUs and network cards, GPU Direct over InfiniBand, with its partner Mellanox. Subsequently, this technical solution gradually matured and was officially released in 2012 along with the Kepler - architecture GPUs and CUDA 5.0, and was officially named GPU Direct RDMA.

Before that, data transmission in traditional data centers was always troubled by the inherent defects of the TCP/IP architecture. In the traditional transmission scheme, memory data access and network data transmission belong to two sets of semantic collections, and the core work of data transmission highly depends on the CPU: the application first applies for resources and notifies the Socket, then the kernel - mode driver completes the TCP/IP packet encapsulation, and finally the data is sent to the opposite end through the NIC network interface. The data needs to go through multiple copies in the sending node, including the Application Buffer, Socket Buffer, and Transport Protocol buffer in sequence. After reaching the receiving node, it also needs to go through the same number of reverse memory copies. Only after decapsulation can it be written into the system's physical memory.

This traditional transmission method brings three problems: First, multiple memory copies lead to high transmission delays; Second, the packet encapsulation of the TCP/IP protocol stack is all done by the driver software, resulting in an extremely high CPU load. Its performance directly becomes the bottleneck of transmission bandwidth, delay, and other performance indicators; Third, the frequent switching of the application between the user mode and the kernel mode further amplifies the delay and jitter of data transmission, seriously restricting the network transmission performance.

RDMA (Remote Direct Memory Access) technology emerged precisely to solve the above - mentioned pain points. Through host offloading and kernel bypass technology, two applications can achieve reliable direct memory - to - memory data communication over the network: after the application initiates data transmission, the RNIC hardware directly accesses the memory and sends the data to the network interface. The NIC of the receiving node can directly write the data into the application's memory without the in - depth intervention of the CPU and the kernel throughout the process.

With these features, RDMA has become one of the core interconnection technologies in fields such as high - performance computing, big data storage, and machine learning, which have strict requirements for low latency, high bandwidth, and low CPU usage. The standardization of the RDMA technology protocol also provides a unified specification for the interconnection of devices from different manufacturers, promoting the technology from concept to large - scale commercial use. Currently, the mainstream implementation solutions of RDMA are divided into three categories: InfiniBand protocol, iWARP protocol, and RoCE protocol (including two versions, RoCE v1 and RoCE v2).

As the parameters of AI models have jumped from billions to trillions, while the memory capacity of a single GPU continues to expand, the data transmission efficiency between servers has become a key factor in determining the system's expansion ability and whether the model training goals can be achieved. The value of RDMA technology is becoming more and more prominent. Whether the system can efficiently access the memory and resources of other servers directly determines the system's scalability, and the ability to directly access remote memory can directly improve the overall training performance of AI models. It is precisely with the help of RDMA technology that data can be quickly delivered to GPUs, ultimately effectively shortening the Job Completion Time (JCT).

The battle between InfiniBand and Ethernet

In the development history of AI intelligent computing networks, the mature Ethernet solution was first used for inter - cabinet interconnection. As the demand for low latency increased, InfiniBand quickly rose due to its performance advantages. As a representative of the native RDMA protocol, InfiniBand is led and promoted by Mellanox, a subsidiary of NVIDIA. It can provide an extremely low transmission latency of less than 2 microseconds and achieve zero packet loss, making it a performance leader in the RDMA field.

In order to transfer the RDMA advantages of InfiniBand to the Ethernet ecosystem, the RoCE protocol emerged. Among them, RoCE v1 can only operate within a Layer 2 subnet, while RoCE v2 achieves inter - subnet routing through IP/UDP encapsulation, greatly improving the deployment flexibility. Although its latency of about 5 microseconds is still higher than that of native InfiniBand, it enables Ethernet to support the high - bandwidth and low - latency requirements of AI training.

In order to shake InfiniBand's dominant position in the AI field, in June 2025, industry giants such as Broadcom, Microsoft, and Google jointly launched the UEC 1.0 specification, aiming to reconstruct the Ethernet protocol stack to make its performance approach that of InfiniBand, marking a full - scale counterattack of Ethernet against InfiniBand. The Ultra Ethernet Consortium (UEC) clearly stated that the UEC 1.0 specification can provide high - performance, scalable, and interoperable solutions at all levels of the entire network stack composed of network cards, switches, optical fibers, and cables, thereby achieving seamless integration of multiple suppliers and accelerating innovation in the entire ecosystem. This specification not only adapts to the modern RDMA capabilities of Ethernet and IP but also supports the end - to - end scalability of millions of devices, while completely avoiding the problem of vendor lock - in.

Currently, domestic technology companies such as Alibaba, Baidu, Huawei, and Tencent have all joined the UEC alliance to jointly promote the implementation of the standard. In addition to participating in global standardization construction, domestic enterprises are also simultaneously researching and developing self - controllable horizontal expansion architectures, all with low latency and zero packet loss as the core goals, directly benchmarking against the performance of InfiniBand.

From the perspective of industrial implementation, the advantages and disadvantages of the two technical routes are very clear. The RoCE v2 solution is based on the Ethernet architecture. It not only has the high - bandwidth and low - latency transmission performance of RDMA but also has strong device interconnection compatibility and adaptability. It is flexible to deploy and has significant cost advantages. Compared with InfiniBand, the Ethernet - based RDMA solution has great advantages in terms of low cost and high scalability.

Network availability directly determines the stability of GPU cluster computing power, and the explosion of AI technology is driving data center switches to continuously iterate towards higher speeds. The exponential growth of the number of parameters in AI large models has led to a large - scale increase in computing power requirements, but a large cluster does not necessarily mean large computing power. In order to compress the training cycle, distributed training technology is commonly used in large - model training, and RDMA is the core of bypassing the operating system kernel and reducing the communication latency between cards. Currently, the two mainstream implemented solutions are InfiniBand and RoCE v2. Among them, the InfiniBand solution has lower latency but higher costs, and the supply chain is highly concentrated in NVIDIA. According to the prediction of Dell‘Oro Group, by 2027, the market share of Ethernet in the AI intelligent computing network will officially exceed that of InfiniBand.

The explosion of super - nodes ushers in a golden development period for high - end switches

As the parameter scale of AI large models enters the trillion - level, the demand for computing power has shifted from simple GPU stacking to the reconstruction of the full - dimensional system architecture. Constrained by the physical power consumption density of a single chip, interconnection bandwidth, and memory capacity bottlenecks, the marginal benefit of computing power growth continues to decline. Current research and engineering practices both show that the system - level collaborative architecture (such as high - bandwidth domain interconnection) is the main technical path to break through the performance ceiling of a single chip. The fundamental reason is that the physical limit of a single chip has become the core bottleneck restricting the development of computing power.

When the model scale far exceeds the computing power and video memory capacity of a single chip, traditional distributed training faces problems such as a sharp increase in communication overhead and a significant decline in computing power utilization. In this context, relying on high - speed lossless interconnection technology to logically integrate dozens or even hundreds of GPU chips into a unified computing unit, forming an equivalent "supercomputer" externally, has become the recognized breakthrough direction for the next - generation computing power architecture by global mainstream AI infrastructure manufacturers and research institutions.

The explosion of AI super - nodes has opened up a new incremental space for the switch market. Compared with traditional servers, AI servers have new GPU modules, which need to be efficiently interconnected with servers and switches through dedicated network cards to complete high - speed communication between nodes. This has added a back - end network (Back End) level to the AI server networking on the basis of the traditional architecture. The number of network ports of a single server has significantly increased, directly driving the demand for the entire industrial chain, including high - speed switches, network cards, optical modules, and optical fiber cables.

At the same time, the large - scale deployment of super - nodes accelerates the horizontal expansion (Scale out) of the network architecture. The networking of ultra - large clusters with tens of thousands, hundreds of thousands, or even millions of cards has spawned a huge demand for high - speed switches. As the parameters of AI models continue to expand, the cluster scale has rapidly jumped from hundreds and thousands of cards to tens of thousands and hundreds of thousands of cards, promoting the continuous evolution of the networking architecture from two - layer to three - layer and four - layer, further amplifying the market gap for high - speed switches.

The rapid development of the global AI industry has put unprecedentedly strict requirements on the networking architecture, network bandwidth, and network latency of AI clusters, and has also promoted the continuous iterative upgrading of Ethernet switches, the core communication equipment, in the directions of high speed, multiple ports, white - boxing, and optical switches. The deep industrial foundation and large number of ecological manufacturers of Ethernet itself also provide room for the continuous increase of its market share in the AI network. Although InfiniBand still dominates the AI back - end network market with its low - latency, congestion control, and adaptive routing mechanisms, with the continuous optimization of the Ethernet deployment solution and the accelerated improvement of the Ultra Ethernet Consortium ecosystem, the market share of the Ethernet solution will continue to rise in the future, directly driving the growth of demand for Ethernet switches.

The whole industry enters the game, and domestic and foreign manufacturers compete for the AI switch market

The huge market opportunity of AI switches has attracted the full - scale layout of global technology giants and domestic manufacturers. A battle for technology and market around AI switches has begun, involving from chips to complete machines, from traditional equipment manufacturers to Internet companies.

Among international giants, NVIDIA has the most aggressive layout. Its Spectrum - x platform is an Ethernet solution optimized for ultra - large - scale cluster scenarios. With this product, NVIDIA achieved a cross - border breakthrough in the traditional IT switch market in less than three years. At the same time, NVIDIA has fully shifted its next - generation Rubin AI platform to the CPO (Co - Packaged Optics) architecture and announced the entry into the mass - production stage, making CPO officially become the "standard configuration" for future AI data centers.

Broadcom also launched the world's first 102.4 Tbps switch chip, Tomahawk 6, last year. This single - chip series provides a switching capacity of 102.4 Tbps, which is twice the bandwidth of current Ethernet switches on the market. Tomahawk 6 is designed for the next - generation scalable and expandable AI network. By supporting 100G / 200G SerDes and co - packaged optical modules (CPO), it provides higher flexibility. It offers the most comprehensive AI routing functions and interconnection options in the industry, aiming to meet the needs of AI clusters with more than one million XPUs.

Domestic traditional equipment manufacturers have also quickly followed up and successively launched flagship products.

Huawei released two flagship products in 2025: the industry's highest - density 128×800GE 100T box - type Ethernet switch, CloudEngine XH9330, which breaks through the scale limit of AI clusters with its industry - leading high - density port design; the industry's first 128×400GE 51.2T liquid - cooled box - type Ethernet switch, CloudEngine XH9230, which helps enterprises build green, energy - efficient, and ultra - large - scale all - liquid - cooled computing power clusters.

H3C, a subsidiary of Tsinghua Unigroup, was the first to release the 1.6T intelligent computing switch, H3C S98258C - G, in 2024. It supports the all - optical network 3.0 solution, with a single - port rate exceeding 1.6T and a total switching capacity of 204.8T, which can meet the communication needs of 32,000 AIGC nodes. This product is equipped with a self - developed intelligent computing engine, with a latency as low as 0.3 microseconds. It has passed the verification of international customers such as Google and has become the core supplier of OCS complete machines. In addition, the company also launched the world's first 51.2T 800G CPO silicon - optical data center switch, laying the foundation for the technological iteration of 1.6T products.

Ruijie Networks completed the demonstration of a 51.2T switch commercial interconnection solution based on CPO technology. This solution, with its ultra - high integration, significant energy - efficiency improvement, and maintainability design, perfectly meets the high - speed interconnection needs of AI training and ultra - large - scale computing clusters, providing a feasible path for the future upgrade of 800G and 1.6T networks. Its 51.2T CPO switch uses the Broadcom Bailly 51.2Tbps CPO chip and achieves 128 400G FR4 optical switching ports in a 4RU space, greatly increasing the device port density and bandwidth capacity. The core highlight is that through the co - packaging of the optical engine and the switching chip, the electrical interconnection path is significantly shortened, reducing signal attenuation and transmission power consumption.

ZTE launched a domestic ultra - high - density 230.4T frame - type switch and a full - series of 51.2T/12.8T box - type switches. Their performance is at the leading level in the industry and has been commercially used on a large scale in the hundreds/thousands/millions - card intelligent computing clusters in fields such as operators, the Internet, and finance.

In addition to traditional switch manufacturers, Internet companies have also entered the game and started the process of self - developing switches, becoming an important force that cannot be ignored in the market.

Tencent started the research and development of CPO switches as early as 2022 and launched and lit up the industry's first 25.6T CPO data center switch, Gemini, in the same year. This product integrates a 12.8T optical engine, provides 16 800G optical interfaces, and the remaining 12.8T switching capacity is provided through 32 QSFP112 pluggable interfaces on the panel.

ByteDance officially launched a self - developed 102.4T switch on the Volcengine platform to support the new - generation HPN 6.0 architecture, which can meet the efficient interconnection needs of hundreds of thousands of GPU clusters. This switch supports LPO on all ports and deploys 128 800G OSFP ports in a 4U space.

Alibaba exhibited a self - developed 102.4T domestic switch at the Yunqi Conference, and was the first to apply the 3.2T NPO technology to the new - generation domestic four - chip switch. This device integrates 4 domestic 25.6T switching chips, with a total switching capacity of 102.4T. It can also smoothly evolve to a 409.6T platform by upgrading to 4×102.4T chips.

Compared with linearly - driven pluggable optical modules (LPO), near - packaged optics (NPO) can provide higher bandwidth density and reduce