Unmasking the "Supernode": A Server Stack Without Unified Memory Addressing

Without the ability of "unified memory addressing", it's just riding on the coattails of the "super node" trend.

When multi-modal large models with trillions of parameters become the norm, the "arms race" in the AI industry has long since shifted:

It is no longer just about competing in model parameters and stacking servers. Instead, it has delved into the underlying computing architecture, initiating a "system-level confrontation."

Thus, the "supernode" has become the "new favorite" in the computing industry.

As of now, more than a dozen domestic enterprises have launched "supernodes," but their actions have shown "deformations": It seems that as long as dozens of servers are stuffed into a cabinet and connected with optical fibers, they can be labeled as "supernodes" and claim to have broken Moore's Law.

After comparing the technical logics of multiple "supernodes," we discovered a cruel technical truth: If "unified memory addressing" cannot be achieved, the so-called "supernodes" are somewhat like "imposters," and in essence, they are still the stacking architecture of traditional servers.

01 Why Do We Need Supernodes? The Root Cause Lies in the "Communication Wall"

Let's go back to the origin: Why does the Scale Out cluster architecture, which has been used for more than two decades in the Internet era, no longer work in the era of large models?

The "Supernode Development Report" released by the China Academy of Information and Communications Technology a few months ago has already provided the answer, vividly summarizing the reasons as "three walls":

The first is the communication wall. In the large model training scenario, the communication frequency increases exponentially with the number of model layers and parallelism. The microsecond-level protocol stack latency accumulates over trillions of iterations, causing the computing units to be in a waiting state for a long time, directly limiting the utilization rate of computing power.

The second is the power consumption and heat dissipation wall. To solve the latency and waiting issues, engineers have to rack their brains to increase the computing power density and stuff as many computing units as possible into a cabinet. The cost is terrifying heat dissipation pressure and power supply challenges.

The third is the complexity wall. The hardware stacking of "brute force creates miracles" has pushed the cluster scale from thousands of cards to tens of thousands or even hundreds of thousands of cards, but the operation and maintenance complexity has increased synchronously. During the large model training process, faults need to be handled every few hours.

The real challenge at hand is that large models are moving from single-modal to full-modal fusion, with the context length reaching the megabyte level, training data up to 100TB, and the latency requirement in scenarios such as financial risk control being less than 20 milliseconds... The traditional computing architecture has an obvious bottleneck.

To meet the new computing power requirements, breaking the "communication wall" is destined to be an inevitable step. Besides stacking servers, are there any other paths?

Let's first sort out the technical principle behind the "communication wall."

In the traditional cluster architecture, the principles of "separation of storage and computing" and "node interconnection" are followed. Each GPU is an isolated island, having its own independent territory (HBM video memory) and only understanding "local language." When it needs to access the data of the neighboring server, it has to go through a cumbersome "diplomatic procedure":

Step one is data migration. The sender copies the data from the HBM to the system memory.

Step two is protocol encapsulation. The data is sliced and encapsulated with TCP/IP or RoCE packet headers.

Step three is network transmission. The data packets are routed to the target node through the switch.

Step four is unpacking and recombination. The receiver parses the protocol stack and strips the packet headers.

Step five is data writing. The data is finally written to the memory address of the target device.

The academic term for this process is "serialization - network transmission - deserialization," which has a latency of several milliseconds. When processing web requests, this latency does not affect the user experience. However, in large model training, the model is divided into thousands of pieces, and the calculation of each layer of the neural network requires extremely high-frequency synchronization between chips. It's like when doing a math problem, you have to call your neighbor to confirm every time you write a number. The problem-solving efficiency can be said to be "dismal."

The industry has specifically proposed the concept of "supernodes" and stipulated three hard indicators - high bandwidth, low latency, and unified memory addressing.

The first two concepts are not difficult to understand. Simply put, it means to widen the road (high bandwidth) and make the car run faster (low latency). The most core and difficult to achieve is precisely "unified memory addressing": The goal is to build a globally unique virtual address space. The memory resources of all chips in the cluster are mapped into a huge map. Whether the data is in its own video memory or in the memory of the neighboring cabinet, for the computing unit, it is just a matter of an address.

Similarly, when doing a math problem, you don't need to "call" your neighbor, but directly "reach out" to get the data. The overhead of "serialization and deserialization" is eliminated, the "communication wall" no longer exists, and there is room for improvement in the utilization rate of computing power.

02 What Makes Unified Memory Addressing Difficult? The "Generational Gap" in Communication Semantics

Since "unified memory addressing" has been proven to be the correct path, why do some "supernodes" on the market still stay at the level of server stacking?

It is not just the gap in engineering capabilities but also the generational gap in "communication semantics," involving communication protocols, data ownership, and access methods.

There are currently two mainstream communication methods.

One is the message semantics for distributed collaboration, usually embodied by send and receive operations. Its working mode is like "sending a courier."

Suppose you want to deliver a book. You first have to pack the book in a box (construct a data packet), fill in the courier form with the recipient's address and phone number (IP address, port), call a courier to send it to the logistics center (switch), the recipient unpacks the box to take out the book after receiving the courier (unpack), and finally, the recipient has to reply "received" (ACK confirmation).

Even if the courier runs very fast (high bandwidth), the time for packing, unpacking, and intermediate transfer (latency and CPU overhead) cannot be saved.

The other is the memory semantics for parallel computing, usually embodied by load and store instructions. Its working mode is like "taking a book from the bookshelf."

Similarly, when delivering a book, you just walk directly to the public bookshelf, reach out to take it down (Load instruction), and put it back after reading (Store instruction). There is no packing, no form filling, and no "middleman making a profit." The improvement in efficiency is obvious.

Protocols such as TCP/IP, InfiniBand, and RoCE v2 support message semantics, which are also the direct causes of the communication wall. However, protocols such as Lingqu and NVLink already support memory semantics. In that case, why can't "pseudo-supernodes" still achieve unified memory addressing?

Because the crown jewel of memory semantics is "cache coherence": If node A modifies the data at the shared memory address 0x1000, and node B's L2 cache has a copy of this address, it must ensure that node B's copy is immediately invalidated or updated.

To achieve "memory semantics," two conditions must be met:

First is the communication protocol and cache coherence.

The communication protocol no longer transmits bulky "data packets" but "Flits" that contain memory addresses, operation codes (read/write), and cache status bits. At the same time, a cache coherence protocol is also needed to broadcast coherence signals through the bus to ensure that all computing units see the same information.

Second is the switching chip that acts as a "translator."

The switching chip plays the role of a "translator," enabling devices such as CPUs, NPUs/GPUs to communicate with each other under a unified protocol and integrate into a unified global address space. No matter where the data is stored in the memory, there is only one "global address," and CPUs, NPUs, and GPUs can directly access through the address.

Most "pseudo-supernodes" that cannot meet the above conditions adopt the PCIe + RoCE protocol interconnection solution, which is a typical case of "using big words to attract attention and small words to evade responsibility."

RoCE cross-server memory access requires RDMA, does not support unified memory semantics, and lacks hardware-level cache coherence. It still needs network cards, queues, and doorbell mechanisms to trigger transmission. In essence, it is still "sending a courier," just that the courier runs a little faster. And the theoretical bandwidth of a single lane of PCIe is 64GB/s, which is an order of magnitude lower than the bandwidth requirement of supernodes.

As a result, they promote themselves in the name of "supernodes" but do not support unified memory addressing, and cannot achieve global memory pooling and memory semantic access between AI processors. The cluster can only achieve "board-level" memory sharing (such as the interconnection of 8 cards in a single machine). Once it goes beyond the server node, all memory accesses need to communicate through message semantics, and there are obvious bottlenecks in optimization.

03 What's the Value of Supernodes? The Perfect "Partner" for Large Models

Many people may ask, what's the use of going to so much trouble to achieve "unified memory addressing"? Is it just for the "technical cleanliness"?

Let's start with the conclusion: Unified memory addressing is by no means a "useless skill." In the actual combat of large model training and inference, it has been proven to bring huge benefits.

The first scenario is model training.

When training an ultra-large model with trillions of parameters, the HBM capacity is often the primary bottleneck. A single card has 80GB of video memory. After stuffing in the model parameters and intermediate states, there is often little left.

When the video memory is insufficient, the traditional approach is "Swap to CPU" - use PCIe to move the data to the CPU's memory for temporary storage. However, there is a big problem: The bandwidth of PCIe is too low, and the CPU needs to participate in the copying. The time for moving the data back and forth is longer than the time for GPU calculation, and the training speed drops significantly.

Under the real supernode architecture, the CPU's memory (DDR) and the NPU's video memory (HBM) are in the same address space. The strategy of "using storage instead of calculation" can be adopted to finely manage the memory: Offload the temporarily unused data or weights to the CPU memory, and quickly pull them back to the on-chip memory for activation through the "high bandwidth & low latency" capability when needed. The utilization rate of the NPU can be increased by more than 10%.

The second scenario is model inference.

In multi-round conversations, each round of conversation requires Put and Get operations. Put stores the KV data in the memory pool, and Get retrieves the KV data from the memory pool, requiring a larger KV Cache space for frequent data storage.

In a traditional cluster, the KV Cache is usually bound to the video memory of a single card. If a user asks an extremely long question and the video memory of node A is filled up by the KV Cache, even if the video memory of the nearby node B is empty, it cannot be borrowed without unified memory addressing. The task has to be re - scheduled and recalculated.

With unified memory addressing, global pooling of the KV Cache can be achieved, and Prefix Cache reuse (prefix cache) is supported. For example, the "System Prompt" is usually fixed, and only one copy needs to be stored in the global memory. All nodes can directly read it through the method of "one - store, multiple - retrieve." When the PreFix Cache hit rate is 100%, the throughput performance of the cluster can be increased by 3 times.

The third scenario is the recommendation system.

Search, advertising, and recommendation are the "cash cows" of the Internet, relying on ultra - large - scale Embedding tables. Since the Embedding tables usually far exceed the memory of a single machine, they must be stored in different servers in a sharded manner.

During the inference process, the model needs to frequently retrieve specific feature vectors from the Host side (CPU memory) or the remote Device side. If the "sending a courier" method like RoCE is used to handle small packets, the overhead of packing and unpacking alone accounts for the majority, resulting in a serious doorbell effect and high latency.

By using unified memory addressing and cooperating with the hardware - level memory transmission engine, the computing unit can directly send read instructions to the remote memory and automatically handle the data transfer. When the first vector is still on the way, the second request has already been sent, greatly reducing the communication latency, improving the end - to - end recommendation efficiency, and is expected to achieve minimized overhead.

Without exaggeration, only when the three capabilities of "high bandwidth, low latency, and unified memory addressing" cooperate with each other can the cluster truly work like a single computer, can a real supernode be achieved, can it be the perfect "partner" for large model training and inference, and can it be the inevitable direction for the evolution of computing power infrastructure in the AGI era. Without the ability of "unified memory addressing," it is ultimately just riding on the popularity of "supernodes."

04 Conclusion

When

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Unmasking the "Supernode": Without Unified Memory Addressing, It's Still a Server Stack

01 Why Do We Need Supernodes? The Root Cause Lies in the "Communication Wall"

02 What Makes Unified Memory Addressing Difficult? The "Generational Gap" in Communication Semantics

03 What's the Value of Supernodes? The Perfect "Partner" for Large Models

04 Conclusion