Crossing the "Memory Wall": Wafer-Scale Revolution and Computing Power Path in the AI Inference Era

In the era of inference, why does Cerebras dare to challenge NVIDIA?

In 2026, the global development of AI reached a landmark turning point—for the first time in history, the inference capital expenditure of ultra-large-scale cloud providers exceeded the training capital expenditure. The industrial anchor shifted from "training large models" to "using large models", and the structure of computing power demand underwent a fundamental transformation.

In the training era, the core contradiction of computing power was "double-precision floating-point and cluster scale"; while in the inference era, the core contradiction became "memory bandwidth and communication latency".

The bottleneck of large model inference is no longer just computing, but data transfer—model weights, intermediate activation values, and KV Cache need to be frequently exchanged between off-chip DRAM (such as HBM) and GPUs. The larger the model, the higher the energy consumption and latency of data transfer, which ultimately far exceeds the energy consumption of computing itself, thus forming a memory wall.

NVIDIA GPUs have built a solid fortress with CUDA and NVLink, but they still cannot avoid GPU idling caused by bandwidth bottlenecks.

Zhipu, a domestic large model company, conducted a very simple experiment: in a 512-card inference cluster, with the GPU, model, and code remaining unchanged, only by increasing the network bandwidth limit from 200GB/S to 400GB/S, the inference throughput directly increased by 10%, and the output latency of the first token decreased by 19%—the principle is very simple, just widen the road, and the cars can run faster.

However, non-GPU architectures represented by Cerebras seem to be tearing a hole in the memory wall.

Size comparison between Cerebras WSE-3 chip and NVIDIA B200 GPU

The essence of Cerebras: a near-memory computing machine based on SRAM

Cerebras Systems was founded in Silicon Valley by Andrew Feldman and others. The early founding team all came from a low-power micro-server company called SeaMicro, which was later acquired by AMD. Subsequently:

In 2015, the founding team established the "wafer-scale computing" route;

In 2016, it completed registration and Series A financing and entered the stealth R & D stage;

In 2019, it released its first product, the WSE-1 chip and the CS-1 system, based on TSMC's 16nm process;

In 2021, it released its second-generation product, based on TSMC's 7nm process;

In 2024, it released its third-generation product (WSE-3 / CS-3), based on TSMC's 5nm process. Both the chip and the system were manufactured in the United States, making it a truly pure American-made chip system.

CS-3 system configuration, including 1 WSE-3 chip

The architectural philosophy of Cerebras' Wafer-Scale Engine (WSE) is simple, straightforward, and hits the pain point directly: use the extreme expansion of physical space to achieve the extreme compression of data transfer latency.

Ordinary chips are made by cutting a wafer into many small chips. For example, NVIDIA GPUs follow this approach. Cerebras does the opposite: it doesn't cut the wafer but directly makes almost the entire wafer into a single super-large chip called the Wafer-Scale Engine, WSE.

Traditional chips are formed by cutting a whole 300mm diameter wafer into hundreds of small chips; while Cerebras chooses to keep the whole wafer as the entire chip. The latest WSE-3 has 4 trillion transistors and 900,000 AI cores, with each core equipped with 48KB of local SRAM, so that the on-chip SRAM of the entire chip reaches 44GB, providing an on-chip memory bandwidth of 21PB/second and a fabric bandwidth of 214Pb/second, which is thousands of times that of traditional HBM bandwidth.

The memory bandwidth of Cerebras WSE is 2625 times that of NVIDIA's B200 packaged chip, breaking the memory bandwidth bottleneck in the large model inference scenario.

In Cerebras' architecture, model weights are never stored on SRAM but on off-chip storage MemoryX and are transferred layer by layer to the large chip. The implementation method is to separate the weight storage of the neural network model from the computing unit.

All model weights are externally stored in the memory expansion module MemoryX. The weights required for each layer of network calculation will be transferred layer by layer to the CS-3 system as needed. The weights are stored in the DRAM and flash memory of MEMORY X and are transferred to the CS-3 system at full bandwidth. These weights will not be stored in the CS-3 system, and not even temporary caches will be retained. The CS-3 relies on the core underlying data flow mechanism to complete the operation.

With its wafer-scale architecture, Cerebras shows a crushing advantage in LLM inference limited by memory bandwidth. When generating tokens one by one, the weights are streamed layer by layer from off-chip MemoryX to the CS-3. When running different models, the token rate is 1.5 - 5 times that of NVIDIA's B200.

Comparison of token rates between NVIDIA DGX B200 GPU and Cerebras CS-3 chip when running different large models

Its core advantage lies in: the 44GB on-chip SRAM of CS-3 provides an ultra-high bandwidth of 21 PB/s (2625 times that of B200) and 214 Pb/s interconnection, enabling weight stream transmission to break free from the HBM interface limitation. Therefore, it performs particularly well in TTFT (Time To First Token, the time from when a request is sent to when the model returns the first token), long context, and agent workloads.

Although the weights are externally placed in MemoryX and loaded layer by layer as needed without on-chip caching, the CS-3 relies on the core data flow mechanism to complete full FP16 precision lossless operations in SRAM; with linear performance expansion, it also releases amazing total throughput in multi-user concurrent inference.

In addition to bandwidth, there is also an advantage in power consumption. Recently, Liu Sheng, the chairman of Zhongji Xuchuang, also mentioned in a speech that customers' requirement for optical modules is 1 pJ/bit, while the current level is 10 pJ/bit. In the Cerebras chip, the power consumption of interconnection is only 0.15 pJ/bit, while the current power consumption of GPU interconnection is 10 pJ/bit.

Comparison of bandwidth and power consumption between Cerebras interconnection and GPU interconnection architecture

It can be seen that if Cerebras' wafer-scale large chip architecture becomes the mainstream for AI inference and even training, it may significantly suppress and structurally change the shipments of traditional optical modules and CPO (Co-Packaged Optics). The core logic is that the high demand for optical modules and CPO is essentially to solve the bandwidth bottlenecks of "inter-chip interconnection" and "inter-node interconnection" in GPU clusters; while Cerebras' architecture solves the problem by "eliminating distributed interconnection".

Counterintuitive: the "real and false" flaws of wafer-scale large chips

The core of a chip always lies in Trade Off. Cerebras' pursuit of extreme on-chip SRAM bandwidth has also brought some problems.

Low yield?

On the contrary, the size of a single AI core is reduced to 0.05 square millimeters (1% of the size of a single computing core of H100), so the yield is actually higher. Through on-chip routing, defective cores can be shut down and bypassed, so that compared with traditional multi-core processors, the defect tolerance is increased by 100 times. In fact, the whole chip has 1 million AI cores, but considering the yield, it is publicly claimed to have 900,000 AI cores.

Only good at inference, not at training?

In the years since Cerebras was founded, training was the mainstream topic, so the company has always done a lot of work around training. It's just that after the demand for inference became popular, people found that its advantages in inference were more obvious.

In fact, the simplified distributed computing also brings a series of advantages such as reduced code complexity and reduced communication overhead.

Training a model with 175 billion parameters on 4000 GPUs usually requires about 20,000 lines of distributed training code.

Cerebras achieved equivalent training with 565 lines of code—the entire model can be installed on the wafer, and there is no need to deal with the complexity of data parallelism.

The scaling of SRAM is dead, and the core advantage is facing a physical ceiling.

The third-generation product is based on TSMC's 5nm process, and its SRAM capacity only increased by 10% compared with the second-generation product based on TSMC's 7nm process. After 5nm, the area of SRAM cells hardly decreases with the progress of the process.

This means that Cerebras can no longer significantly increase its core advantage (SRAM capacity) by upgrading TSMC's process (such as from 5nm to 3nm) as it did in the past.

Limited by wafer size, heat dissipation capacity, and manufacturing cost, storage resources such as on-chip SRAM are difficult to expand linearly in sync with computing cores, and the resource ratio has reached a bottleneck. This almost blocks its path of evolution.

Technical specifications of Cerebras' three generations of products

The triple purgatory of heat dissipation, process, and ecosystem.

The whole wafer generates heat intensively, with a relatively high heat flux density. It must rely on a customized computer room and a dedicated liquid cooling system. In addition, the ecosystem's universality means that customers must adapt to its customized software stack, and its compatibility with existing general programming frameworks such as CUDA is weak, resulting in high software transplantation and adaptation costs.

Low off-chip bandwidth, becoming an "isolated island" for expansion.

Due to the limitations of the wafer-scale physical design, the number of I/O pins that can be led out at the edge of the WSE is extremely limited, resulting in an I/O bandwidth of only 150GB/s. Compared with NVIDIA's NVLink, which can have a bidirectional bandwidth of up to 1.8TB/s, it is like a snail. This means that it is extremely difficult for the WSE to expand outward at high speed. Although Cerebras' SwarmX interconnection performs well in multi-system combinations, in the face of super-large models that require high-speed interconnection of multiple chips, the extremely low off-chip bandwidth has become a structural physical shackle.

Route competition: self-developed by large manufacturers, how long is Cerebras' window period left?

Large manufacturers have more than one way to solve the problem of "inference requires higher bandwidth + lower latency". They are encircling the technological dividends of start-up companies through three parallel paths.

① Self-developed ASIC chips

Google's TPU v8 has been split into two versions: training-specific and inference-specific; AWS Trainium 4 is on the way; Microsoft Maia is already in use within Azure, built on TSMC's 3nm process, with native FP8/FP4 tensor cores, a redesigned memory system, equipped with 216GB HBM3e and 272MB on-chip SRAM; even Anthropic has started to evaluate self-developed inference chips.

The probability of this path is very high. It will directly cause the TAM (Total Addressable Market) of "third-party inference procurement" to be compressed by 10% to 25% in 2028.

② Generalization of the process in the standard Packaging route

This is a direct and crushing blow to Cerebras.

TSMC's SoW (System-on-Wafer) has been widely opened to customers, and the CoWoS 9.5x interposer will also be launched in 2027.

What these two products do—stitching multiple dies at the wafer level—essentially makes Cerebras' physical process more general and accessible to the public.

NVIDIA's Vera Rubin will enter this ecosystem in the second half of 2026.

Although Cerebras' cross-reticle stitching is exclusive, the exclusive window period is at most 2 to 3 years. After 2027 - 2028, its process barrier will be diluted by TSMC's advanced packaging.

③ Breakthrough in optical interconnection/optical computing

The interconnection and memory wall of electronic chips have reached their limits. The high bandwidth, low latency, and zero crosstalk of photons are the ultimate solutions.

The optical route represented by Lumentum is on the rise. The biggest advantage of wafer-scale is on-chip computing, but models will inevitably become larger, and high-speed interconnection above the wafer scale is a rigid demand.

With the maturity of CPO (Co-Packaged Optics) and Optical Interconnects, in the future, we are very likely to see optical I/O directly introduced into the WSE wafer, breaking the shackles of electrical interconnection; and NVIDIA may also acquire companies with specific architectural advantages such as LPU (e.g., Groq) and combine optical interconnection to develop a wafer-scale system compatible with existing NV super-node software.

A mad dash on the cliff: Cerebras' business and delivery

Cerebras is currently facing a cliff-like mad dash forced by huge orders.

Transactions with top customers such as OpenAI have forced Cerebras to transform from a chip company into a new-type cloud service provider. It no longer just sells hardware but needs to lock in and build a large amount of data center power and facilities in the short term.

According to the contract requirements, Cerebras needs to deliver 250MW of data center capacity every

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。