NVIDIA Confronts a Group of Formidable Rivals

The Ultra Ethernet UE 1.0 specification is released, targeting AI/HPC. UET supports hardware-accelerated multipath transmission.

The recently released Ultra Ethernet (UE for short) 1.0 specification defines a transformative high - performance Ethernet standard for future artificial intelligence (AI) and high - performance computing (HPC) systems. This article, co - written by the authors of the specification, provides a high - level overview of the design of Ultra Ethernet, offering key design motivations and scientific backgrounds for understanding its innovative points. Although Ultra Ethernet has achieved technological breakthroughs throughout the entire Ethernet protocol stack, its most prominent contribution is the innovative Ultra Ethernet Transport (UET for short) - a protocol that can be fully accelerated by hardware, specifically designed for reliable, high - speed, and efficient communication in ultra - large - scale systems.

More than twenty years ago, InfiniBand was the last major standardization achievement in the field of high - performance networks. Ultra Ethernet, on the other hand, fully utilizes the vast Ethernet ecosystem and the thousand - fold increase in computing efficiency per bit of data transmitted, opening a new era of high - performance networks.

Introduction

Ultra Ethernet (UE) enables Ethernet - based high - performance AI and HPC network support through a standardized new protocol. This article, written by the authors of the Ultra Ethernet specification, supplements the core content of the complete specification, focusing on the technological development process and innovative technical points involved in the nearly 2.5 - year R & D process. This article is aimed at a wide range of readers, so many details have been simplified, and easy - to - understand expressions and explanations are used. For all questions regarding Ultra Ethernet, the ultimate authoritative reference is the 562 - page complete specification.

In 2022, the world rapidly entered a new era of large - scale computing necessary to meet the needs of artificial intelligence systems. At this time, major data center service providers realized that InfiniBand and its supporting protocol - Remote Direct Memory Access over Converged Ethernet (RoCE for short) - had obvious limitations. At the same time, the success of Ethernet as a general - purpose interconnection technology is beyond doubt: hundreds of millions of ports are deployed annually, and its inventors were awarded the Turing Award that year.

About ten years ago, RoCE v2 (the second - generation RoCE protocol) embedded the transport layer of InfiniBand into routable Ethernet (Layer 3 of the OSI model, i.e., the network layer), enabling its deployment in data centers. The RoCE protocol almost unchangedly adopted the transport protocol of InfiniBand, requiring the network to provide loss - less transmission capabilities and ensuring strict in - order delivery of data packets. In converged Ethernet, this loss - less and in - order packet delivery is mainly guaranteed by the Priority Flow Control (PFC) mechanism.

However, PFC requires reserving a large amount of head - buffer space for specific traffic classes and is prone to congestion spread and Head - of - Line Blocking problems. In addition, the requirement of in - order delivery limits the flexibility of path selection, which may lead to sub - optimal network performance. The above and other limitations of the RoCE protocol are summarized in detail in the literature.

The original InfiniBand transport layer protocol was born 25 years ago. In the technological environment at that time, architecture designers could only improve bandwidth as much as possible under limited computing resources. After 25 years of exponential development following Moore's Law, the cost of transistors (and thus the cost of computing) has decreased by more than 100,000 times, while the available bandwidth has only increased from Single Data Rate (SDR) to Extended Data Rate (XDR), an increase of only 100 times. Therefore, not only has the performance of end - point accelerators been significantly improved, but the computing resources available to network architecture designers per bit of data transmitted have also increased by more than 1000 times. This change has prompted many companies to rethink the design of the network protocol stack in their internal AI product lines [16,25,32] and HPC deployments. Some industry insiders realized early on that data center networks and HPC networks would inevitably converge into a single technology. Subsequently, several companies started relevant discussions, committed to promoting this convergence process.

In the first quarter of 2022, companies such as AMD, Broadcom, HPE, Intel, and Microsoft took the lead in forming a small working group and unanimously agreed to create an open standard for the next - generation Ethernet based on the technological R & D achievements carried out in parallel within each company, in order to expand the market. The project was initially named HiPER and was later renamed Ultra Ethernet (UE). In July 2022, a new alliance was officially established, and the first offline meeting was held in September of the same year. At the first meeting, a wide range of opinions were put forward: from "standardizing a certain RoCE variant" to "building a new standard based on existing various technical elements", the entire team participated with high enthusiasm and a strong discussion atmosphere. In January 2023, the working group determined the development direction: integrating the functional features designed for HPC, security mechanisms, and congestion management technologies designed for data centers to create a new technical solution - a highly scalable transport layer protocol that can run on plain old Ethernet.

Shortly after, in July 2023, the Ultra Ethernet Consortium (UEC for short) was officially announced to be established jointly by AMD, Arista, Broadcom, Cisco, Eviden (formerly Atos), HPE, Intel, Meta, and Microsoft. As an open project under the Linux Foundation's Joint Development Foundation (JDF), the alliance has developed rapidly. By the end of 2024, the number of member companies had exceeded 100, and the number of participants had exceeded 1500.

The goal of the Ultra Ethernet Consortium (UEC) is to define an open next - generation HPC and AI network standard that is compatible with existing Ethernet deployments and supports interoperability between devices from different vendors. The discussions of the alliance revolve around the following core principles:

Large - scale scalability: The key to meeting the large - scale deployment requirements of future AI systems. Ultra Ethernet (UE) is designed to support the flexible deployment of millions of network endpoints and provides a Connectionless API. Initially, Ultra Ethernet focuses on supporting the traditional fat - tree topology, while not restricting the application of other optimized topologies (such as HammingMesh, Dragonfly, or Slim Fly), although the alliance has not yet tested and verified these non - traditional topologies.

High performance: Achieved through an efficient protocol designed for large - scale deployments. For example, the connectionless API of Ultra Ethernet is supported by a mechanism that can establish end - to - end reliability context without additional delay - that is, the first - arriving data packet can establish the context (which can take as little as a few nanoseconds), even in scenarios where data packets are delivered in a large - scale disorderly manner. In addition, Ultra Ethernet also supports optional extended functions, such as packet trimming, which can achieve fast packet loss detection and recovery.

Compatibility with existing Ethernet data center deployments: By imposing minimal requirements on the switch infrastructure, Ultra Ethernet achieves compatibility with existing Ethernet data center deployments, facilitating the easy deployment and gradual expansion of existing infrastructure. Ultra Ethernet switches only need to support Equal - Cost Multi - Pathing (ECMP) and Explicit Congestion Notification (ECN) marked at the egress, and can (optionally) support the packet trimming function to improve network performance. Ultra Ethernet does not require modifications to the physical layer (PHY Layer, Layer 1 of the OSI model) or the data link layer (Link Layer, Layer 2 of the OSI model), but in order to improve the performance of the data link layer in new deployment scenarios, several optional extended functions are defined, providing room for vendor differentiation. Ultra Ethernet is fully compatible with the Ethernet standard, which means that users can use existing operation, management, debugging, and deployment tools.

Vendor differentiation: Under the premise that the specification ensures interoperability, Ultra Ethernet maximally supports vendor - specific innovation. This enables the existing Ethernet vendor ecosystem to drive rapid innovation cycles and technological R & D in an active and large market. In many scenarios (such as packet load balancing, fast packet loss detection), the specification only proposes a set of options for implementing a compatible protocol, rather than mandating a specific solution. Vendors can either choose one of the solutions proposed in the specification or develop their own innovative solutions. This design enables architects to carry out customized designs according to system optimization goals. For example, vendors can choose specific load - balancing and packet - loss detection solutions to build high - performance systems that are easy to operate and maintain and have reliable performance.

Scope of application

Ultra Ethernet divides the network into three basic types: local network (usually called scale - up network), backend network (usually called scale - out network), and front - end network. Figure 1 shows the architecture of these three network types. The local network (purple) is used to connect the central processing unit (CPU) and accelerators (XPU, such as GPU, TPU, etc.). In current typical deployment scenarios, this type of node - level or rack - level network uses CXL, NVLINK, or Ethernet technology, with a transmission distance of up to 10 meters and a latency target of sub - microseconds. The front - end network (green) is the traditional data center network, responsible for "east - west" (inside the data center) and "north - south" (between the data center and the outside) traffic transmission. The backend network (blue) is a high - performance network that connects computing devices (such as accelerators). The backend network and the front - end network are usually both called "scale - out networks", and they can be implemented through a single physical network instance. In fact, Ultra Ethernet also supports this network - convergence deployment, and also allows for the deployment of physically independent network instances.

Key features of Ultra Ethernet

Ultra Ethernet (UE) can run seamlessly on existing Ethernet networks. The specification recommends allocating Ultra Ethernet traffic to an independent traffic class, but its congestion control algorithm can also share switch buffers with other traffic and work in coordination. Ultra Ethernet uses (routable) addresses and packet header formats compatible with IPv4 or IPv6 (Layer 3 of the OSI model) to ensure seamless integration. Ultra Ethernet defines Fabric Endpoints (FEP) as logical entities responsible for terminating both ends of the transport layer in unicast operations. Functionally, Fabric Endpoints (FEP) are roughly equivalent to traditional Network Interface Controllers (NIC, i.e., network cards).

The key features of Ultra Ethernet include:

A highly scalable connectionless transport protocol using Ephemeral Packet Delivery Contexts (PDC);
Removal of connection - oriented dependencies at the semantic layer, including buffer address mapping, access authorization, and error models;
Native support for per - packet multi - path transmission ("packet spraying"), combined with a flexible and scalable load - balancing scheme, without the need for out - of - order reassembly at the receiving end;
Support for both reliable and unreliable transmission modes, and within each mode, support for both in - order and out - of - order delivery, optimally covering various application scenarios;
Support for a lossy (best - effort) transmission mode to avoid Head - of - Line Blocking, combined with optional packet trimming and other fast packet - loss detection schemes for rapid recovery;
Innovative congestion management schemes that can quickly adapt to incast traffic and in - network congestion;
Support for vendors to provide products with pure - hardware, pure - software, or hybrid hardware - software deployments;
Integration of scalable end - to - end encryption and authentication functions;
Link - layer optimization to support hardware - accelerated deployment.

The following sections will elaborate on these features and other functions in detail in the part of the Ultra Ethernet architecture. Before that, first introduce the ECMP - based packet spraying technology - the basic concept of load balancing in Ultra Ethernet.

ECMP packet spraying

Equal - Cost Multi - Pathing (ECMP) is a scheme for network traffic load balancing. Switches that support ECMP do not directly map the destination address to a single port, but to a set of ports corresponding to equivalent - cost paths to the destination. Subsequently, an output port P is selected for each data packet through a deterministic hash function p = H(x). The input of the hash function is usually configurable and generally includes the complete IP quintuple (source address, destination address, source port, destination port, and protocol type). Therefore, if the traditional ECMP scheme is directly adopted, all data packets of the same flow will be transmitted along the same deterministic path (in the absence of faults). Ultra Ethernet re - defines one of the fields to carry the so - called Entropy Value (EV). For example, if the standard UDP/IP protocol is used, this field is the UDP source port (which is not used in traditional scenarios).

The Internet Assigned Numbers Authority (IANA) has assigned the Ultra Ethernet Transport Layer protocol (UET) to UDP port 4793. This port is not only a large prime number but also an incremental value of the RoCEv2 port number (++RoCEv2++). Ultra Ethernet also supports the native pure - IP mode. In this mode, the position of the Entropy Value (EV) is the same as that of the UDP source port. At this time, the source Fabric Endpoint (FEP) can select different Entropy Values (EV) for each data packet that needs to be transmitted along a different path; if in - order delivery is required, the same Entropy Value (EV) is assigned to the data packets.

Figure 2 shows a full Clos network architecture built based on an 8 - port switch (green circles), which supports 64 endpoints (gray squares). The switch X in the second layer is highlighted in the figure, which contains 4 uplink ports and 4 downlink ports. When data packets are transmitted in the network, they first enter the switch through the downlink ports. Unless the destination is an upper - layer node of the tree structure to which the switch belongs, they will be forwarded through the uplink ports. In a Clos network, when a data packet reaches the common upper - layer switch of the source and destination endpoints, it will turn to the only downlink path. Taking the network in the figure as an example, 64 endpoints are divided into 4 groups, with 16 endpoints in each group; there are 4 equivalent - cost paths between any two nodes in the same group (such as node C and node D) (all passing through 3 - hop switches, the green and red paths in the figure); and there are 16 paths between any two nodes in different groups (such as node A and node B) (the purple and yellow paths in the figure).

Due to its simple design, the traditional ECMP scheme has certain limitations: nodes cannot directly select paths. It can only be determined that in a fault - free scenario, two data packets with the same Entropy Value (EV) will be transmitted along the same path; but it cannot be determined whether different Entropy Values (EV) will share the same path due to hash conflicts. In fact, such hash conflicts and related (sub -) path sharing are common! Taking nodes in the same group as an example, there are only 4 different paths, but the value range of the Entropy Value (EV) is as high as 216. Therefore, for an ideal uniformly distributed hash function, when two Entropy Values (EV) are randomly selected, the conflict probability is as high as 25%; and the conflict probability between nodes in different groups is still 6.25%. In actual deployments, if switches with a larger number of ports are used, the above probabilities may be different.

If two paths conflict, the available bandwidth of each path will be halved, which may lead to serious performance losses. In traditional Ethernet systems, once a path is determined, it will not change, which will lead to a phenomenon called "traffic polarization", and the problem will be more prominent.

The packet spraying technology of UE can avoid such polarization by assigning different EVs to each data packet, thus achieving a uniform distribution of data packets among all switches in a statistical sense. Even if hash conflicts occur, the conflict duration is short, and the switch buffer can absorb the resulting traffic imbalance, ultimately achieving full utilization of network bandwidth and long - term average traffic balance. If all endpoints achieve uniform spraying, the packet spraying mechanism will be very simple; but when some flows need to be delivered in - order (and thus need to deterministically occupy some paths), the implementation of the mechanism will be more challenging. UE provides a variety of optional load - balancing algorithms for determining how to assign EVs to each data packet; the choice of the optimal scheme still leaves room for vendor differentiation and academic research.

Ultra Ethernet profiles

The UE specification provides three profiles (HPC, AIFul, and AlBase) to support different function sets, thus enabling implementations of different complexities. The HPC profile provides the richest function set, including wildcard tag matching, and is optimized for MPI and OpenSHMEM workloads. The AIFul profile is a superset of the Al

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA is facing a group of formidable rivals.

Introduction

Scope of application

Key features of Ultra Ethernet

ECMP packet spraying

Ultra Ethernet profiles