Break through the "memory wall" and advance on three fronts
Preface
In recent years, the explosive growth of AI and high-performance computing has been driving an exponential climb in computing demand. From the emergence of ChatGPT to the visual shock brought by Sora, large-scale AI models have not only experienced exponential expansion in parameter scale but also shown an astonishing growth curve in their demand for computing power.
However, behind this prosperity, an increasingly severe challenge is emerging - the "memory wall".
From large language models with hundreds of billions of parameters to intelligent terminals at the edge, various applications have put forward unprecedentedly stringent requirements for the performance, power consumption, and area (PPA) of memory. The memory "bandwidth wall" has become the core bottleneck restricting the throughput and latency of AI computing. Traditional memory technologies are struggling to meet the system's energy efficiency optimization needs, and the huge performance gap is restricting AI chips from reaching their full potential.
As a global leader in semiconductor manufacturing, TSMC has deeply recognized this fundamental contradiction. In the 2025 IEDM (International Electron Devices Meeting) tutorial, TSMC clearly pointed out that the future competition of AI and high-performance computing chips will not only be a race in transistor density and frequency but also a comprehensive competition in the performance, energy efficiency, and integration innovation of the memory subsystem.
Based on TSMC's technology roadmap, this article will focus on SRAM, MRAM, and CIM, combined with the full-stack technology system of 3D packaging and computation-memory integration, to deeply analyze the technological evolution, current challenges, and future integration trends of high-speed embedded memory for AI computing.
Under the rapid development of AI computing power, the memory "bandwidth wall" has become the core pain point
The evolution history of AI models can be regarded as an extreme squeeze on computing power and memory.
From the early AlexNet to today's GPT-4, Llama2, and PaLM, the model parameters have jumped from the million level to the trillion level. The expansion of the model scale has directly driven the computing volume (FLOPs) in the training and inference stages to skyrocket. Data shows that in the past 70 years, with the increase in the parameter scale of machine learning models, the training computing volume has increased by more than 10^18 times, and the inference computing volume has also shown an explosive growth.
However, according to the classic Roofline Model, the ultimate performance of any computing system is jointly determined by its peak computing power and memory bandwidth.
Therefore, this explosive growth in computing demand not only poses challenges to processor performance but also pushes memory to the forefront of technological change - the bandwidth, latency, energy consumption, and density of memory have become the core factors determining the overall performance of AI/HPC systems.
The growth rate of computing performance is seriously imbalanced with the improvement rate of memory bandwidth, forming a "bandwidth wall" that restricts system performance. According to statistical data, in the past 20 years, the peak floating-point operation performance of hardware (HW FLOPS) has increased by 60,000 times, with an average increase of 3.0 times every two years; while the DRAM bandwidth has only increased by 100 times, with an average increase of 1.6 times every two years; and the interconnection bandwidth has increased by 30 times, with an average increase of only 1.4 times every two years.
This imbalanced growth rate has led to the memory bandwidth becoming the main bottleneck restricting computing throughput in scenarios such as AI inference, and a large amount of computing resources are idle due to waiting for data. Taking NVIDIA's H100 GPU as an example, its peak computing performance under BF16 precision reaches 989 TFLOPs, but the peak bandwidth is only 3.35 TB/s. When the operation intensity is insufficient, the system performance will be limited by memory, and the huge computing potential cannot be fully released.
Facing the stringent requirements of AI and HPC, memory technology needs to meet three core indicators simultaneously: large capacity, high bandwidth, and low data transfer energy consumption. Large capacity ensures the storage requirements for model parameters and training data, high bandwidth solves the data throughput bottleneck, and low energy consumption is the key to achieving green computing - high power consumption not only increases hardware costs (such as larger-capacity batteries and more complex cooling systems) but also limits the deployment possibilities in scenarios such as edge devices.
In this context, the traditional computing-centric architecture is accelerating the transformation to a memory-centric one, and high-density, low-energy embedded memory has become the key direction for technological breakthroughs. TSMC believes that the evolution path of future memory architectures will revolve around "memory-computation collaboration": from traditional on-chip caches, to on-chip caches + large-capacity in-package memory, to high-bandwidth, low-energy in-package memory, and finally to in-memory computing and near-memory computing, breaking through the performance and energy efficiency bottlenecks through the deep integration of memory and computing.
To balance the multiple requirements of speed, bandwidth, capacity, and power consumption, modern computing systems generally adopt a hierarchical memory architecture. From registers to storage devices, memories at different levels show a clear performance-cost trade-off: registers and SRAM caches undertake high-frequency data access tasks with their advantages of low latency (1 ns for registers and 10 ns for SRAM caches) and high bandwidth; HBM and DRAM main memories balance capacity and performance; and storage devices such as SSDs meet the storage requirements for massive data with large capacity and low density.
TSMC believes that the evolution of storage technology in the era of AI and HPC is not a single-point breakthrough of a single technology but a comprehensive collaborative optimization of materials, processes, architectures, and packaging.
Facing the industry challenges, TSMC continues to optimize the embedded memory technology at each level based on the above hierarchical structure: SRAM, as the core of the cache layer, continuously optimizes density and energy efficiency through process and design innovation; MRAM, with its non-volatile and high-density characteristics, fills the technological gap of embedded non-volatile memory (eNVM); DCiM breaks the physical boundary between storage and computing and optimizes the energy efficiency ratio from the architectural level. At the same time, the development of 3D packaging and chiplet integration technology further shortens the physical distance between storage and computing units, providing a system-level solution for breaking through the "bandwidth wall".
SRAM: The "performance cornerstone" of computing scenarios
Static Random Access Memory (SRAM), as the main solution for high-speed embedded memory, has become the preferred technology for key levels such as registers and caches due to its core advantages of low latency, high bandwidth, low power consumption, and high reliability. It is compatible with advanced CMOS logic processes. From the FinFET to the Nanosheet architecture, SRAM continuously optimizes performance through process iterations.
In application scenarios, SRAM is widely deployed in various high-performance chips such as data center CPUs, AI accelerators, client CPUs, gaming GPUs, and mobile SoCs. In terms of process nodes, SRAM has covered all nodes from N28 to N2. With the popularization of advanced processes (N3/N2), its usage in high-performance computing chips continues to grow, becoming the core support for improving chip performance.
Among them, the area scaling of SRAM is the key to optimizing chip performance. However, as the process nodes evolve to 7nm, 5nm, 3nm, and even 2nm, the area scaling speed of SRAM cells gradually slows down, facing many technological challenges. TSMC has achieved continuous scaling of SRAM through the Design-Technology Co-Optimization (DTCO) strategy combined with various innovative technologies.
From the perspective of the technological evolution process, the area scaling of SRAM depends on the process and design breakthroughs at key nodes: strain silicon technology was introduced at the 90nm node; the high-k metal gate (HKMG) process was adopted at the 45nm node; the FinFET architecture, flying bit line (FLY BL), and double word line technology were introduced at the 28nm node; EUV lithography and metal coupling technology were applied at the 7nm node; and further scaling was achieved through the Nanosheet architecture at the 2nm node.
This increase in density allows chips to integrate a larger-capacity cache in a limited area, directly driving the improvement of computing performance - the instructions per cycle (IPC) increases significantly with the increase in the L3 cache capacity, and the CPU performance improvement effect is particularly obvious with a 32-fold increase in cache capacity. It can be seen that the energy efficiency and response speed of SRAM caches far exceed those of DRAM main memory and SSD storage.
However, as the process nodes evolve to 7nm, 5nm, 3nm, and even 2nm, SRAM is facing increasingly severe development challenges: first, the area scaling speed slows down, the size reduction range of SRAM cells gradually narrows, and the difficulty of integrating a larger-capacity cache in a limited chip area continues to increase; second, there is a dilemma in optimizing the minimum operating voltage (VMIN), and the read and write stability under low VMIN faces challenges, directly affecting chip energy efficiency; third, the interconnection loss increases. When the width of the Cu metal line is less than 20nm, the resistivity increases rapidly, resulting in a significant increase in the resistance and capacitance of word lines and bit lines, restricting the speed improvement of SRAM.
In addition to the continuous evolution and technological innovation at the process level mentioned above, to address the area limitation of SRAM caches on traditional chips, at the design level, TSMC has introduced the 3D stacked V-Cache technology to optimize the capacity, latency, and bandwidth of the last-level cache (LLC) through a 3D stacked architecture.
The AMD Ryzen™ 7 5800X3D processor uses this technology, integrating 8 computing cores, 512KB L1 cache, 4MB L2 cache, and a shared L3 cache of up to 96MB. Through a 32-byte/cycle bidirectional bus, it has achieved a leapfrog improvement in cache performance and significantly improved gaming performance, fully verifying the enabling effect of 3D stacked SRAM on computing performance.
In addition, TSMC has also developed technologies such as write assist circuits, read assist circuits, and dual-rail SRAM, reducing the VMIN of N3 process SRAM by more than 300mV; through technologies such as staggered triple-metal layer word lines and flying bit lines, it has reduced interconnection loss and improved the speed and density of SRAM.
In the future, the development of SRAM will focus on two major directions: first, continuously promote process scaling. On the N2 and more advanced nodes, through the deep integration of the Nanosheet architecture and the DTCO strategy, further improve density and energy efficiency; second, combine with 3D packaging technology to achieve a leapfrog increase in cache capacity through vertical stacking to match the ultra-high bandwidth requirements of AI accelerators; third, collaborate with the in-memory computing architecture to become the core storage unit of DCiM, providing high-speed data access support during the computing process.
In-memory computing, DCiM becomes the protagonist
If optimizing SRAM is a meticulous refinement of the traditional architecture, then Computing-in-Memory (CIM) is a more disruptive architectural revolution, and its core idea directly targets the root cause of the "memory wall": reducing unnecessary data movement.
In a typical AI accelerator, more than 90% of the energy consumption may be used for moving data between storage units and computing units rather than actual computing operations. Therefore, data movement has become the core factor restricting the energy efficiency ratio of the accelerator.
The CIM architecture breaks the von Neumann architecture that separates "storage and computing". It directly embeds simple computing functions into the memory array, tightly integrating computing units and storage units. Data is processed locally or nearby, greatly saving energy consumption and latency, and becoming the key path to solving this problem.
Different from the traditional DLA (Deep Learning Accelerator) architecture that separates storage and computing and relies on data movement, the CIM architecture enables computing to be performed in memory, significantly improving data reuse rate and greatly optimizing the energy efficiency ratio.
Regarding the two paths of Analog Computing-in-Memory (ACiM) and Digital Computing-in-Memory (DCiM), TSMC believes that DCiM has more development potential than ACiM.
Compared with ACiM, DCiM has obvious advantages in technological scaling, precision control, and scenario adaptation due to its lack of precision loss, strong flexibility, and process compatibility: ACiM faces challenges such as analog signal changes and limited dynamic range, while DCiM can be compatible with advanced processes, continuously improve performance as the nodes evolve, and support multi-precision computing. It has become the core architectural direction for AI computing, especially suitable for edge inference scenarios, providing a scalable solution for solving the energy efficiency bottleneck of data centers and terminal devices.
The core advantages of DCiM are reflected in three aspects:
- High flexibility: It can configure the computing bit width for different AI workloads to achieve the best balance between precision and energy efficiency;
- High computing density: Benefiting from advanced logic processes, the energy efficiency (TOPS/W) and computing density (TOPS/mm²) of DCiM have significantly improved with the progress of the manufacturing process.