NVIDIA, AMD, Intel, and Broadcom join forces to plug the loopholes in GPU computing power waste.
On May 7th, Xin Dongxi reported that last night, OpenAI jointly released a new open network protocol, MRC (Multi-Path Reliable Connection), with AMD, Broadcom, Intel, Microsoft, and NVIDIA. This protocol can help large AI training clusters run faster and more reliably. OpenAI released MRC through the Open Compute Project (OCP).
MRC has been deployed on all of OpenAI's supercomputers used for training cutting-edge models, including the Oracle Cloud Infrastructure (OCI) site in Abilene, Texas, USA, and the Microsoft Fairwater supercomputer.
MRC is a new network protocol built into the latest 800Gb/s network interfaces. It can split a single data transmission into hundreds of paths, bypass faulty links in microseconds, and simplify the network control plane architecture.
OpenAI's official blog mentioned that when training a cutting-edge large model for ChatGPT and Codex recently, they had to restart four primary core switches. In the past, restarting switches required extreme caution from the operations team. After introducing MRC, they can even restart the switches without prior coordination with the operations team of the cluster training task.
Before building the infrastructure project Stargate, OpenAI had developed and maintained the first three generations of supercomputers with its partners over several years. This made them realize that to efficiently utilize computing power on supercomputers and successfully complete tasks, it is necessary to significantly reduce the complexity of each layer of the stack, including redesigning the network.
Many netizens in the comment section of OpenAI's official X account affirmed the release of MRC, saying that it is a real infrastructure improvement and marks the shift of infrastructure competition to the era of standardized cluster communication efficiency.
Paper link: https://cdn.openai.com/pdf/resilient-ai-supercomputer-networking-using-mrc-and-srv6.pdf
01 Solving Network Problems, MRC Provides Three Key Benefits for Expanding Supercomputers
When training large models, a single step may involve millions of data transmissions. A delayed transmission can cause fluctuations throughout the job, leading to idle GPUs. Network congestion, link, and device failures are the most common causes of transmission delays and jitter.
As the scale of computing power infrastructure increases, these problems occur more frequently and are more difficult to solve. There are two key network challenges: to minimize the probability of network congestion and reduce the impact of network failures on the training work itself.
Based on this, OpenAI jointly developed MRC with several chip companies. The goal is to create a network that can provide highly predictable performance even in the event of failures to ensure the continuous progress of training tasks.
MRC is an extension of Aggregated Ethernet RDMA (RoCE). RoCE is a standard developed by the InfiniBand Trade Association, which enables hardware-accelerated remote direct memory access between GPUs and CPUs. MRC borrows the technology developed by the Ultra Ethernet Consortium (UEC) and expands its capabilities based on SRv6 source routing to support large-scale AI network architecture networking.
This network architecture has supported the training of multiple OpenAI models based on the hardware of NVIDIA and Broadcom.
AMD contributed congestion control technology to MRC to improve its actual performance. AMD has also collaborated with leading cloud service providers to deploy MRC on a large scale in test clusters. Before the development of the MRC specification, AMD had a pre-standard implementation of an improved RoCEv2 transmission protocol, which evolved into today's MRC standard. AMD's official press release mentioned that it is one of the first and only companies to implement MRC on 400G network cards. They can seamlessly transition to the application of AMD Pensando "Vulcano" 800G AI NIC, which also supports the MRC transmission protocol.
MRC is a new transmission protocol that was first verified and optimized on NVIDIA Spectrum-X Ethernet. Its fault bypass technology can detect network path failures in just a few microseconds and automatically reroute traffic in hardware. NVIDIA's official blog mentioned that this bypass failure technology is particularly important for AI training clusters because thousands of GPUs must remain synchronized. Even a brief network interruption can slow down or even interrupt the entire training task.
Broadcom Thor Ultra is an 800Gbps high-performance Ethernet network card designed for AI loads and multi-plane architecture networks. This product is based on several generations of RoCE network card technologies and newly supports MRC and advanced RoCE technologies. Broadcom's official blog stated that it has invested this technology and experience in the collaborative R & D of the MRC ecosystem. Thor Ultra integrates a high-bandwidth line rate programmable data path implemented using the Network Programming Language (NPL) to achieve advanced congestion control (based on the sender and receiver), load balancing, and reliable transmission, which can reduce system cost and complexity.
Intel posted on its official X account that with the help of MRC technology, Intel is building a multi-plane Ethernet networking architecture. This architecture can achieve large-scale cluster deployment, reduce the number of switch layers, lower power consumption, and improve overall reliability.
MRC brings three key advantages to the expansion of supercomputers:
First, this technology can build a multi-plane high-speed network capable of supporting a supercomputer with a scale of 100,000 GPUs using only two layers of Ethernet switches. This architecture has sufficient redundancy to withstand network failures smoothly. At the same time, it consumes less power compared to a three - or four - layer single - plane network of the same scale.
Second, MRC's adaptive packet scattering has excellent load balancing capabilities, making network congestion in the core almost non - existent.
This reduces the throughput fluctuations between data streams in synchronous training. Eliminating abnormal delays is the core key to optimizing the performance of synchronous training. At the same time, even if multiple tasks share the same supercomputer cluster, they will not interfere with each other's performance.
Finally, MRC uses SRv6 source routing to quickly bypass faulty links and only forwards packets on normal and available paths.
This allows it to use a simple static network control plane and fundamentally avoid a large class of fault - related problems unique to dynamic routing.
02 Supporting Multi - Plane Networks, Achieving Lower Cost and Power Consumption
MRC uses a multi - plane network. Instead of considering each network interface as an 800Gb/s link, it splits it into multiple smaller - granularity sub - links. For example, a single network interface can be connected to eight different switches simultaneously. Thus, an eight - way independent parallel network (network plane) can be built, with each path having a bandwidth of 100Gb/s, rather than building a single 800Gb/s network.
The advantage of this is that a switch that originally supported 64 800Gb/s ports can now provide 512 100Gb/s ports after the change. With this, a network capable of fully interconnecting approximately 131,000 GPUs can be built using only two layers of switches, while traditional 800Gb/s networking requires a three - or four - layer switch architecture.
▲ Supporting multi - plane networks
This designed network has lower cost and power consumption. It can provide a network with more path diversity than traditional network designs and allows more traffic to stay locally on the Layer 0 switches, thereby improving performance.
However, this path diversity is often difficult to fully utilize. Traditional network protocols for AI training usually require each data transmission to follow a single fixed path to ensure that packets arrive in order.
In a large - scale multi - plane network, this brings two major problems: First, different data streams may compete for the same link, causing network congestion. Second, a single data stream can only occupy one of the many network planes. Without targeted optimization, the multi - plane network will experience severe congestion, and the overall performance will be greatly reduced.
▲ Congestion caused by the collision of packet streams
03 Packet Scattering and Forwarding across Hundreds of Paths
MRC fundamentally changes this mode.
Instead of limiting a single data transmission to a single path, it distributes the packets of a single transmission across hundreds of paths in the network and transmits them in parallel across all independent network planes.
Packets can arrive out of order, but all MRC packets carry the final memory address. Therefore, the receiving end does not need to wait for sorting and can write the packets to memory as they arrive.
In this way, each MRC connection maintains a small amount of state information for the many paths it uses. Once congestion is detected on a certain path, it will immediately switch to other paths to balance the network load.
If a packet is lost, MRC will adopt a conservative strategy. It assumes that the path may have failed, immediately stops using the path, and retransmits the possibly lost packets.
After eliminating a path, MRC will send probe packets to check if there is indeed a fault. If there is a fault, it will further detect if the link has recovered.
Another reason for packet loss is congestion at the destination. MRC can handle such scenarios through the packet truncation mechanism: when a switch is about to discard a packet due to congestion, it will not discard the entire packet directly. Instead, it will cut off the payload and only forward the packet header to the destination to trigger an explicit retransmission request.
Moreover, packet truncation can effectively reduce misjudgment and avoid wrongly determining packet loss caused by simple congestion as a path failure.
Combining these mechanisms such as multi - plane topology, packet scattering and forwarding, load balancing, and packet truncation, MRC connections can detect network faults and complete detours in microseconds, reducing the impact on synchronous training tasks. In contrast, traditional network architectures often take seconds or even tens of seconds to complete convergence and achieve fault detours.
04 Further Simplifying the Network, Stopping a Path Once Packet Loss Occurs
MRC goes a step further in simplifying the network.
In traditional solutions, switches run dynamic routing protocols such as BGP (Border Gateway Protocol) to calculate available paths and achieve fault detours.
However, the switches themselves have complex structures and run very complex software. Once a hidden anomaly occurs, such problems are often difficult to troubleshoot and will continuously cause connection interruptions until the fault is repaired.
After adopting MRC, once packet loss occurs on a certain path, MRC will stop using that path.
The solution it adopts is to turn off dynamic routing and instead use IPv6 Segment Routing (SRv6). SRv6 allows the sender to directly specify the forwarding path of each packet in the network by embedding a sequence of switch identifiers in the destination address field of each packet.
The principle is as follows:
When a switch forwards a packet, it checks if its own identifier is in the path list. If it matches, it removes its own current identifier by offsetting the destination address field, revealing the identifier of the next - hop switch.
Then the switch queries this identifier in the static routing table to determine the next - hop forwarding direction of the packet.
Different from dynamic routing, this type of static routing table is deployed once during the initial configuration phase of the switch and will not change subsequently.
MRC uses SRv6 to distribute packets across all network planes and uses multiple paths in parallel within each plane. Once a path fails, MRC can directly stop using that path.
The switch does not need to recalculate the route. It only needs to forward packets strictly according to the preset static routing rules without any additional complex processing.
05 Conclusion: Big Tech Companies Join Hands to Break the Bottleneck of Computing Power Utilization in Supercomputer Clusters
According to the official blog, MRC has significantly improved OpenAI's ability to train new large models and enabled the network architecture to match its AI development roadmap.
As the scale of training clusters continues to expand, network design increasingly determines the actual utilization rate of available computing power. MRC can enable GPU clusters to operate stably in the face of congestion, link failures, and maintenance, which would have interrupted training tasks in the past.
In the scenario of ultra - large - scale computing power, this reliability and operational efficiency may become the basic prerequisite for supporting the synchronous training of cutting - edge large models.
This article is from the WeChat official account "Xin Dongxi", author: Cheng Qian. Republished by 36Kr with permission.