Entlarve die Tarnung der "Superknoten": Ohne einheitliche Arbeitsspeicheradressierung bleibt es ein Server-Cluster!

Das Fehlen der Fähigkeit zur "einheitlichen Adressierung des Speichers" bedeutet letztendlich nur, von der Popularität des "Superknotens" zu profitieren.

When multimodal language models with billions of parameters become the norm, the "arms race" in the AI industry has already taken a turn:

It's no longer just about increasing model parameters and stacking servers. Instead, the focus has shifted to the underlying computing architecture, and a "system-wide duel" has begun.

The "super node" has thus become the "new favorite" of the computing industry.

So far, more than a dozen Chinese companies have launched "super nodes" on the market, but there have been "distortions": It seems that all one needs to do is to house a few dozen servers in a rack and connect them with fiber - optic cables, then attach the label "super node" and claim to have broken Moore's Law.

After comparing the technical logics of several "super nodes", we discovered a harsh technical truth: If one cannot achieve "unified memory addressing", the so - called "super node" is more like a "fraudster passing itself off as the real deal", and in essence, it is still the stacking architecture of traditional servers.

01 Why do we need super nodes? The root lies in the "communication barrier"

Let's first go back to the starting point: Why does the Scale - Out cluster architecture, which has been used for more than twenty years in the Internet era, no longer work in the era of language models?

The China Academy of Information and Communications Technology has already given the answer in its "Report on the Development of Super Nodes" published a few months ago and vividly summarized the reasons as the "three walls":

The first is the communication barrier. In language model training scenarios, the communication frequency grows exponentially with the number of model layers and the degree of parallelism. The microsecond - level latency of the protocol stack accumulates over billions of iterations, causing the computing modules to remain in a waiting state for a long time, which directly restricts the utilization of computing power.

The second is the power consumption and cooling barrier. To reduce latency and waiting times, engineers have to pull out all the stops to increase the computing power per unit area and house as many computing modules as possible in a rack. However, the price for this is a terrifying cooling pressure and a great challenge in power supply.

The third is the complexity barrier. "Throwing hardware around" has increased the cluster size from thousands of graphics cards to tens of thousands and even hundreds of thousands, but the operational complexity has also increased. During language model training, errors need to be fixed every few hours.

The real challenges we face are that language models are moving from single - modality to full - modality fusion, the context length is in the mega - range, the training data amounts to up to 100 TB, and the latency requirements in scenarios such as financial risk assessment are less than 20 milliseconds... The traditional computing architecture has obvious bottlenecks.

To meet the new requirements for computing power, it is inevitable to break the "communication barrier". Are there any other ways besides stacking servers?

First, let's take a look at the technical principles behind the "communication barrier".

In the traditional cluster architecture, the principles of "separation of storage and computing power" and "node connection" are followed. Each GPU is an island, having its own independent territory (HBM graphics memory) and only understanding "its own language". When it needs to access the data of a neighboring server, it has to go through a cumbersome "diplomatic process":

Step one is data transfer. The sender copies the data from the HBM to the system memory.

Step two is protocol encapsulation. The data is encapsulated into TCP/IP or RoCE message headers.

Step three is network transmission. The data packet is routed to the destination node via a switch.

Step four is unpacking and reassembly. The receiver analyzes the protocol stack and removes the message header.

Step five is data writing. The data is finally written to the memory address of the destination device.

The academic term for this process is "serialization - network transmission - deserialization", and there is a latency of a few milliseconds. When processing web page requests, this latency has no impact on the user experience. But when training language models, the model is cut into thousands of parts, and during the calculation of each layer of the neural network, there must be extremely frequent synchronization between the chips. It's like solving a math problem where you have to call your neighbor to confirm every number you write. The efficiency is simply "incredibly poor".

The industry has then developed the concept of the "super node" and set three tough criteria - high bandwidth, low latency, and unified memory addressing.

The first two concepts are not difficult to understand. Simply put, it means building a wider road (high bandwidth) and making the car drive faster (low latency). However, the core and the most difficult part is the "unified memory addressing": The goal is to create a globally unique virtual address space where all the memory resources of the chips in the cluster are mapped onto one huge map. Whether the data is in one's own graphics memory or in the memory of a neighboring rack, for the computing modules, it is just a difference in address.

When solving a math problem, one no longer has to "call the neighbor" but can simply "reach out the hand" and get the data. The costs of "serialization and deserialization" are eliminated, the "communication barrier" no longer exists, and there is room for improving the utilization of computing power.

02 Where lies the difficulty in unified memory addressing? The "generation gap" in communication semantics

If "unified memory addressing" has been confirmed as the right way, why do some "super nodes" on the market still stick to stacking servers?

It's not only due to differences in engineering skills but also to the "generation gap" in "communication semantics", which relates to the communication protocol, data ownership rights, and access methods.

Currently, there are two common types of communication.

One is the message semantics for distributed cooperation, which is usually manifested through send and receive operations, and the way of working is like "sending a package".

Suppose you want to send a book. First, you have to pack the book in a box (create a data packet), write the recipient's address and phone number on the package label (IP address, port), call a courier service to take the package to the logistics center (switch), the recipient has to open the package and take out the book (unpack), and finally, the recipient has to send an "acknowledgment" (ACK confirmation).

After this whole process, even if the courier service is very fast (high bandwidth), you can't save the time for packing, unpacking, and intermediate processing (latency and CPU overhead).

The other is the memory semantics for parallel computing, which is usually manifested through load and store commands, and the way of working is like "taking a book from the bookshelf".

Taking the book example again, you simply go to the public bookshelf, reach out your hand and take the book down (Load command), and put it back after reading (Store command). There is no packing, no form - filling, no "middleman making a profit", and the efficiency improvement is obvious.

Protocols such as TCP/IP, InfiniBand, RoCE v2 support message semantics and are also the direct reason for the existence of the communication barrier, but protocols such as Lingqu and NVLink already support memory semantics. If so, why can't the "fake super nodes" still achieve unified memory addressing?

Because the crown jewel of memory semantics is "cache coherence": If node A changes the data at the common memory address 0x1000 and node B has a copy of this address in its L2 cache, one must ensure that the copy in node B immediately becomes invalid or is updated.

To realize the "memory semantics", two conditions must be met:

First, the communication protocol and cache coherence.

The communication protocol no longer transmits bulky "data packets" but "flits" that contain the memory address, the operation code (read/write), and the cache status bit. At the same time, one also needs a cache - coherence protocol that spreads the coherence signals over the bus to ensure that all computing modules see the same information.

Second, the "translator" as a switch chip.

The switch chip plays the role of the "translator", enabling the CPU, NPU/GPU, and other devices to communicate with each other under a unified protocol and be integrated into a unified global address space. No matter where the data is stored, there is only one "global address", and the CPU, NPU, and GPU can directly access the data via the address.

The "fake super nodes" that do not meet the above conditions mostly use the PCIe + RoCE protocol for interconnection, which is a typical example of "big words to attract attention, small words to excuse oneself".

Accessing the memory across server boundaries with RoCE requires RDMA, does not support unified memory semantics, and lacks hardware - side cache coherence. It still requires the network card, the queue, and the ringing system to trigger the transmission. In essence, it is still "sending a package", only that the courier service is a bit faster. The theoretical bandwidth of PCIe is 64 GB/s per lane, which is an order of magnitude lower than the bandwidth requirements of the super node.

The result is that one advertises itself as a "super node" but does not support unified memory addressing and cannot realize global memory pooling and memory - semantics access between AI processors. The cluster can only achieve a "card - level" of memory sharing (e.g., communication between 8 cards in a server), but once one goes beyond the server nodes, all memory accesses have to be done through message - semantics communication, and there are obvious bottlenecks in optimization.

03 What is the value of the super node? The perfect "partner" for language models

Maybe many people wonder why one should go to so much trouble to achieve "unified memory addressing". Is it just for the sake of "technical purity"?

First, the conclusion: Unified memory addressing is not a "quixotic technology", and it has been proven in the practice of language model training and inference that it brings huge advantages.

The first scenario is model training.

When training huge models with billions of parameters, the HBM capacity is often the first bottleneck. A graphics card has 80 GB of graphics memory, and after putting in the model parameters and intermediate states, there is often not much left.

If the graphics memory is insufficient, the traditional method is "Swap to CPU" - one uses PCIe to move the data to the CPU's memory and store it there temporarily. But there is a big problem: The bandwidth of PCIe is too low, and the CPU has to be involved in the copying operation. The time for moving the data back and forth is longer than the time for GPU calculation, and the training speed drops significantly.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Entlarve die Tarnung der "Superknoten": Ohne einheitliche Adressierung des Arbeitsspeichers bleibt es immer noch ein Server-Cluster.

01 Why do we need super nodes? The root lies in the "communication barrier"

02 Where lies the difficulty in unified memory addressing? The "generation gap" in communication semantics

03 What is the value of the super node? The perfect "partner" for language models