What to do if SRAM stops scaling down?
Memory latency, bandwidth, capacity, and energy consumption are increasingly becoming bottlenecks for performance improvement.
In this paper, we revisit the system architecture where a large amount of memory (ranging from terabytes to petabytes) is shared among numerous CPUs. We believe that two practical engineering challenges, scaling and signaling, limit such designs.
To address this, we propose an opposite approach.
Instead of creating large, shared, and homogeneous memory, the system explicitly divides the memory into smaller slices, which are more closely coupled with computing units. Leveraging the advancements in monolithic/2.5D/3D integration technologies, these "compute-memory nodes" provide private local memory, enabling access to node-exclusive data over micron-scale distances and significantly reducing access costs.
On-package memory elements support shared states within the processor, offering better bandwidth and energy efficiency than off-package DRAM, which is used as the main memory for large working sets and cold data. By explicitly defining memory capacity and access distance in hardware, software can efficiently build this memory hierarchy to manage data layout and migration.
Introduction
The idea of a large distributed memory address space is quite appealing. It allows applications to scale seamlessly beyond a single host while leaving the complexity of caching, consistency, and placement to the underlying system. In the 1980s and 1990s, this idea was explored in the form of Distributed Shared Memory (DSM), which provided a reference for the memory consistency models of modern multi-core and multi-processor systems.
As memory gradually becomes the bottleneck in data centers and cloud servers, research is revisiting these concepts in an attempt to build a new generation of systems with a vast amount of network-connected memory that can be shared among multiple processors. This paper argues that due to two obstacles in modern engineering - scaling and signaling - this approach is infeasible. These obstacles are practical limitations based on physical principles.
The first is scaling ability, which refers to the ability to make transistors and circuits smaller and cheaper using more sophisticated tools and complex manufacturing processes. The scaling of memory technology has essentially ended. The cost per byte of Static Random Access Memory (SRAM) and Dynamic Random Access Memory (DRAM) has plateaued, and there is no significant cost reduction path in the next five years. As logic devices continue to shrink (although at a slower pace than before), the proportion of memory in the system cost is increasing, making it economically and architecturally unfeasible to configure large-capacity memory. We should focus on improving memory utilization efficiency.
The second obstacle lies in signaling - for a given bandwidth, the energy required for signal transmission between components determines that memory energy efficiency and bandwidth must be improved through deep integration with computing logic [1]. Inside the chip, accessing a remote SRAM cache line is slower and more energy-consuming, and accessing across chips is even more expensive. Accessing DRAM through circuit board traces is an order of magnitude more costly; accessing remote memory through CXL or RDMA adds even more overhead. These performance costs make remote memory extremely expensive.
Facing these obstacles, we propose a different approach: a physically composable disaggregated architecture. The system consists of compute-memory nodes that tightly integrate computing power with private local memory and on-package shared memory, while using off-package DRAM for large-capacity storage. Software decides through explicit composition of the memory system: which data to keep locally, which to share between nodes, and which to transfer to DRAM, etc.
The End of 2D Scaling: SRAM and DRAM
Two-dimensional (2D) semiconductor scaling technology has previously achieved higher storage density and capacity at a lower cost. However, Figure 1 shows that traditional 2D scaling has come to an end for both SRAM and DRAM. The cost per byte of DRAM has remained stagnant for more than a decade, which is why the DRAM cost dominates the system cost after server scale-up [2]. SRAM also faces a similar bottleneck: we can no longer manufacture smaller SRAM cells.
For SRAM, the main limitation stems from the fact that the transistor size is approaching the atomic scale: manufacturing tolerances limit the matching of transistors in cross-coupled inverter pairs, thereby reducing the signal margin. Computing logic is not affected by this problem because digital signals can be restored at each circuit level. For DRAM, the main constraint lies in the etching cost of high-aspect-ratio capacitors and the complex transistor geometry required to ensure low leakage current. Although more advanced process nodes can reduce the physical size of DRAM cells, they cannot reduce the cost per cell. We can continue to manufacture larger-capacity DRAM DIMMs, but the cost per byte will not decrease.
The main conclusion from these limitations is that large-scale storage inevitably comes with sky-high costs. The growth rate of on-chip caches cannot exceed the expansion of the chip area, and modern server processors are already very large (the AMD SP5 is 5,428 mm²). The system must use storage resources more efficiently.
Locality = Efficiency and Bandwidth
Closer integration improves the bandwidth and energy efficiency of data transmission between memories. Caches are a prime example of this principle: L1, L2, and L3 caches all use the same SRAM technology, but the L1 cache achieves excellent performance through a smaller memory bank size, finer access granularity, and a physical location closer to the CPU core.
The DRAM bandwidth of the processor socket is slowly increasing: modern DDR5 - 5600 memory modules have a bandwidth of 358 Gbps, and the number of memory modules per socket has increased from 8 to 12, resulting in a total bandwidth of 4.3 Tbps. However, during the same period, the growth in the number of cores per socket has exceeded or matched the increase in bandwidth. Figure 2 shows the single-core bandwidth of Intel and AMD server processor packages since 2018: this indicator has stagnated.
The bandwidth limitation and energy consumption problem of DRAM stem from its connection method on the printed circuit board (PCB) - the number of copper traces and bump pins on the board is limited (for example, DDR5 only has 288 pins). High Bandwidth Memory (HBM) repositions and closely arranges DRAM chips through improved integration technology. By embedding an on-package silicon-based logic chip under multiple DRAM chips and using through-silicon vias for connection, each HBM3E stack can achieve 1024 pins and a shorter interconnection distance. This significant difference in the number of pins directly translates into a bandwidth advantage for HBM. Table 1 shows how closer physical integration achieves higher pin density, wider bandwidth, and lower energy consumption. A lower pin density requires higher-speed signal circuits, thereby increasing energy consumption.
These integration limitations mean that core performance cannot be improved through DRAM. The circuit board cannot accommodate more DIMM modules, and the number of pins has reached the practical limit. Transmitting higher signal rates through copper wires will incur high energy consumption costs.
Physically Composable Disaggregated Solution
These scalability challenges force us to fundamentally restructure the memory hierarchy design - shifting the focus from raw capacity to locality, bandwidth, and energy efficiency.
We propose to revolutionize the traditional memory "disaggregation" concept, emphasizing finer-grained integration of computing and memory and prioritizing improved memory utilization - even if this may lead to a slight decrease in computing utilization. The core of this solution is the compute-memory node, which stacks and integrates the computing unit with local memory through 3D integration technology, with AMD's VCache design and Milan - X processor being typical examples.
Unlike ordinary caches, this private local memory can adopt an explicit management mechanism and is dedicated to storing node-specific data (such as the execution stack and other thread-private states). Access over micron-scale distances achieved through microbumps, hybrid bonding, through-silicon vias, or monolithic wafer-level interconnections greatly alleviates the latency, energy consumption, and bandwidth bottlenecks caused by a large address space. Drawing on the practices of modern multi-chip processors, cross-node shared states (such as locking mechanisms) are placed in on-package shared memory (such as HBM). Although it is slower than the private local segments, its bandwidth and energy efficiency are still far better than those of off-package DRAM.
However, the degree of integration is limited by physical constraints (such as heat dissipation, module size, etc.)³. Large-capacity storage still needs to rely on off-package DRAM. DRAM is no longer a shared flat address space pool but has become a capacity-driven storage layer for large working sets and cold data, while performance-critical accesses are managed through faster on-package distributed memories. Software needs to build the memory system itself: by presenting "near-zero local memory" and "higher-latency shared levels" together through an abstraction layer, it decides which data to keep locally, which to share, and which to transfer to off-package DRAM, thereby achieving efficient management of data layout and migration.
This article is from the WeChat official account "Semiconductor Industry Observation" (ID: icbank), author: Stanford. Republished by 36Kr with permission.