Was tun, wenn SRAM die Miniaturisierung einstellt?

Ermöglichen Sie die Verwaltung der Datenlayout und -migration.

Storage latency, bandwidth, capacity, and energy consumption are increasingly becoming bottlenecks in improving performance.

In this article, we re - examine the system architecture where a large number of CPUs share a mass storage (ranging from several terabytes to petabytes). We believe that the practical engineering problems of scalability and signaling limit such designs.

Therefore, we propose an opposite approach.

Instead of creating a large, shared, and homogeneous storage, the system explicitly divides the storage into smaller segments that are more closely coupled with the computing units. Thanks to advancements in single - chip/2.5D/3D integration technology, this "computing - storage node" provides a private local storage through which node - exclusive data can be accessed within micrometers, significantly reducing access costs.

Storage elements inside the housing support the shared state within the processor and offer better bandwidth and energy efficiency than off - housing DRAM. DRAM is used as the main storage for large working sets and cold data. By explicitly defining the storage capacity and access distance in hardware, the software can efficiently build this storage hierarchy and manage data layout and migration.

Introduction

The idea of a large distributed storage address space is very attractive. It allows applications to scale seamlessly beyond a single host while delegating the complexity of caching, consistency, and placement to the underlying system. In the 1980s and 1990s, this idea was explored in the form of Distributed Shared Memory (DSM) and contributed to the development of storage consistency models in modern multi - core and multi - processor systems.

As storage is increasingly becoming a bottleneck in data centers and cloud servers, these concepts are being re - examined to create a new generation of systems with a huge network - connected storage system that can be shared by many processors. In this article, we argue that this approach is not practical due to two obstacles in modern engineering - scalability and signaling. These obstacles are real, physically - based limitations.

The first obstacle is scalability, i.e., the ability to make transistors and circuits smaller and more cost - effective by using finer tools and more complex manufacturing processes. The scaling of storage technology is practically over. The cost per byte for static random - access memory (SRAM) and dynamic random - access memory (DRAM) has stabilized, and there are no clear ways to reduce costs in the next five years. While the logical components continue to shrink (albeit more slowly than before), the proportion of storage in the system cost is increasing, making the configuration of large - scale storage unattractive from an economic and architectural perspective. Instead, we should focus on improving storage utilization.

The second obstacle lies in signaling - for a given bandwidth, the energy required for signaling between components determines that the energy efficiency and bandwidth of storage must be improved through close integration with computing logic [1]. Inside a chip, accessing remote SRAM cache lines is slower and more energy - consuming, and accessing across chip boundaries is even more expensive. Accessing DRAM via traces on the printed circuit board is an order of magnitude more expensive; accessing remote storage via CXL or RDMA incurs additional costs. These performance losses make remote storage extremely expensive.

In view of these obstacles, we propose a different approach: a physically combinable, decoupled architecture. The system consists of computing - storage nodes that closely integrate computing power with private local storage and shared storage inside the housing while using off - housing DRAM for mass storage. The software must build the storage system: it must represent both the "almost zero - distance local storage" and the "higher - latency layer" through an abstraction layer and decide which data should remain local, which should be shared, and which should be transferred to the off - housing DRAM to efficiently manage data layout and migration.

The End of 2D Scaling: SRAM and DRAM

Two - dimensional (2D) semiconductor scaling technology has enabled higher storage densities and capacities by reducing costs. However, Figure 1 shows that traditional 2D scaling has reached its limits for both SRAM and DRAM. The cost per byte for DRAM has stagnated for more than a decade, which is why DRAM costs account for a dominant share of the system cost when scaling servers [2]. SRAM has also hit a similar obstacle: we cannot make smaller SRAM cells.

For SRAM, the main limitation is that the transistor size is approaching the atomic scale: manufacturing tolerances limit the matching of transistors in coupled inverter pairs, reducing signal security. Computing logic is not affected by this problem because digital signals can be restored at each circuit level. For DRAM, the main limitations are the etching costs for high - aspect - ratio capacitors and the complex transistor geometry required to ensure low leakage current. Although advanced manufacturing processes can reduce the physical size of DRAM cells, they cannot reduce the cost per storage cell. We can still manufacture DRAM DIMMs with larger capacities, but the cost per byte will not decrease.

The most important conclusion from these limitations is that mass storage is becoming more and more expensive. The growth of the on - chip cache cannot outpace the growth of the chip area, and modern server processors are already very large (the AMD SP5 has an area of 5,428 mm²). The system must use storage resources more efficiently.

Locality = Efficiency and Bandwidth

Closer integration improves bandwidth and energy efficiency in data transfer between storages. The cache is a good example of this effect: the L1, L2, and L3 caches use the same SRAM technology, but the L1 cache offers excellent performance thanks to smaller storage bank sizes, finer access steps, and physical proximity to the CPU core.

The DRAM bandwidth per processor socket is increasing only slowly: a modern DDR5 - 5600 memory stick has a bandwidth of 358 Gbps, and the number of memory sticks per socket has increased from eight to twelve, corresponding to a total bandwidth of 4.3 Tbps. In the same period, the growth in the number of cores per socket has either outpaced or at least kept up with the increase in bandwidth. Figure 2 shows the single - core bandwidth of Intel and AMD server processor packages since 2018: this indicator has stagnated.

The bandwidth limitations and energy problems of DRAM stem from its connection on the printed circuit board (PCB) - the number of copper traces and contact pins on the PCB is limited (e.g., DDR5 has only 288 pins). High - bandwidth memory (HBM) improves integration technology by rearranging and closely placing DRAM chips. By embedding a silicon logic chip inside the housing under several DRAM chips and using through - silicon via technology, each HBM3E stack can achieve 1,024 pins and shorter connection paths. This significant difference in the number of pins directly leads to a bandwidth advantage for HBM. Table 1 shows how closer physical integration enables a higher number of pins, greater bandwidth, and lower energy consumption. A lower number of pins requires faster signal circuits, which increases energy consumption.

These integration limitations mean that core performance cannot be improved by DRAM. The PCB cannot accommodate more DIMM modules because the number of pins has already reached its practical limit. Transmitting higher signal frequencies over copper lines is very energy - consuming.

Physically Combinable Decoupling Solution

These scaling problems force us to fundamentally reconstruct the storage hierarchy - we should shift from pure capacity to locality, bandwidth, and energy efficiency.

We propose an opposite approach to the traditional "decoupling" of storage, which aims at finer integration of computing power and storage and prioritizes storage utilization - even if this may lead to a slight decrease in computing utilization. The core of this approach is the computing - storage node, where the computing unit and local storage are stacked on top of each other using 3D integration technology, as is the case with AMD's VCache design and the Milan - X processor, for example.

In contrast to normal caches, this private local storage can have an explicit management system and be specifically used for storing node - specific data (e.g., execution stacks and other thread - private states). Access within micrometers, enabled by microbumps, hybrid bonding, through - silicon vias, or single - chip wafer - level connections, significantly reduces the latency, energy consumption, and bandwidth bottlenecks associated with a large address space. Following the practice of modern multi - chip processors, the shared state between nodes (e.g., locking mechanisms) is placed in a shared storage inside the housing (e.g., HBM). Although it is slower than the private local segments, it still offers better bandwidth and energy efficiency than off - housing DRAM.

However, the integration possibilities are limited by physical constraints (e.g., heat dissipation, module size, etc.)³. For mass storage, off - housing DRAM is still required. DRAM is no longer used as a shared, flat address - space pool but as a capacity - oriented storage layer for large working sets and cold data, while performance - dependent accesses are managed through faster, distributed storages inside the housing. The software must build the storage system itself: it must represent both the "almost zero - distance local storage" and the "higher - latency layer" through an abstraction layer and decide which data should remain local, which should be shared, and which should be transferred to the off - housing DRAM to efficiently manage data layout and migration.

This article is from the WeChat account "Semiconductor Industry Observation" (ID: icbank), author: Stanford. Published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。