StartseiteArtikel

Just now, Liang Wenfeng publicly released the "Memory" module under his name, and DeepSeek V4 has become more detailed.

机器之心2026-01-13 08:38
More details about DeepSeek v4 have emerged!

Just over a dozen hours ago, DeepSeek published a new paper titled "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models", which was completed in collaboration with Peking University. Liang Wenfeng is also among the authors.

Paper link: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

Let's briefly summarize the problem this new research aims to solve: Currently, large language models mainly achieve sparsity through Mixture of Experts (MoE), which is known as "conditional computation". However, existing Transformer architectures lack a native knowledge lookup mechanism and can only inefficiently simulate retrieval behaviors through the computation process.

In response to this situation, DeepSeek proposed conditional memory, which complements the conditional computation of MoE, and implemented it by introducing a new module called Engram.

Currently, the implementation related to the "Engram" module has been uploaded to GitHub.

Project link: https://github.com/deepseek-ai/Engram

This made netizens exclaim: "DeepSeek is back!"

In addition, combined with the research "mHC: Manifold-Constrained Hyper-Connections" announced during the New Year's Day, it's clear that the appearance of DeepSeek v4 is becoming clearer, and we're just waiting for its release!

Besides conditional computation (MoE),

LLMs also need an independent conditional memory, Engram

The MoE model expands the model capacity through conditional computation. However, the existing Transformer architecture lacks a native knowledge lookup primitive and can only inefficiently simulate retrieval behaviors through the computation process.

To solve this problem, DeepSeek proposed conditional memory, a sparsity dimension that complements conditional computation, and implemented it through the Engram module. Engram modernizes the classic 𝑁-gram embedding, enabling it to perform knowledge lookup with O (1) time complexity.

By formally posing the sparsity allocation problem, DeepSeek also discovered a U-shaped expansion rule to characterize the optimal trade-off between neural computation (MoE) and static memory (Engram).

Guided by this rule, DeepSeek expanded Engram to a scale of 27 billion parameters. Under the conditions of strictly equal parameter count and equal FLOPs, its overall performance significantly outperforms the pure MoE baseline model.

Notably, although the memory module is mainly used to improve knowledge retrieval ability (e.g., +3.4 improvement in MMLU, +4.0 in CMMLU), DeepSeek observed more significant gains in general reasoning ability (e.g., +5.0 in BBH, +3.7 in ARC-Challenge) and code and mathematical reasoning tasks (e.g., +3.0 in HumanEval, +2.4 in MATH).

Further analysis shows that Engram can relieve the burden of reconstructing static knowledge from the shallow layers of the model, effectively increasing the effective depth of the network for complex reasoning. Additionally, by delegating local dependency modeling to the lookup table mechanism, Engram frees up the capacity of the attention mechanism, allowing it to focus more on global context modeling, thus significantly improving long-context retrieval ability (e.g., the accuracy of Multi-Query NIAH increased from 84.2 to 97.0).

Finally, Engram also demonstrates infrastructure-aware efficiency at the system level: its deterministic addressing method supports prefetching from the host memory at runtime with almost no additional performance overhead.

DeepSeek believes that conditional memory will become an indispensable core modeling primitive in the next generation of sparse large models.

The Engram architecture is as follows. Its design goal is to enhance the Transformer backbone network by structurally separating static pattern storage and dynamic computation processes. This module performs two functional stages sequentially for each position in the sequence: retrieval and fusion.

During operation, DeepSeek first extracts and compresses the suffix N-gram of the current position and retrieves the corresponding static embedding vectors in a deterministic manner through a hashing mechanism. Subsequently, these retrieved embeddings are dynamically adjusted under the modulation of the current hidden state and further refined through a lightweight convolution operation. Finally, Engram is integrated with the multi-branch architecture.

Sparse retrieval based on hashed 𝑁-grams

The goal of this stage is to map the local context to static memory entries. This process mainly includes tokenizer compression and retrieving the corresponding embedding representations through a deterministic hashing mechanism.

Tokenizer compression: To maximize the semantic density of memory units, DeepSeek introduced a layer of vocabulary projection. For this purpose, they pre-designed a mapping function

that maps the original token IDs to canonical identifiers based on text normalization equivalence relations (e.g., using NFKC normalization, unifying case, etc.). In practice, for a tokenizer with a size of 128k, this process can reduce the effective vocabulary size by approximately 23% (see Appendix C for details).

Multi-head hashing: Parametrizing the entire space of all possible N-gram combinations is computationally and storage-wise infeasible. Drawing on the work of Tito Svenstrup et al. (2017), DeepSeek adopted a hash-based approximation method. To reduce the impact of hash collisions, K independent hash heads are introduced for each N-gram order n.

Context-aware gating

The embedding vectors retrieved from the conditional memory through hashed 𝑁-grams in the previous stage essentially provide static prior information independent of the specific context. However, due to their static nature, these embeddings lack the ability to adapt to the current context and may be affected by noise interference caused by hash collisions or word polysemy in practical applications.

To address this issue, DeepSeek introduced a context-aware gating mechanism after retrieval, inspired by the attention mechanism.

System efficiency: Decoupling computation and storage

In models with a memory mechanism, scale expansion is often limited by the limited capacity of high-bandwidth memory (HBM) on GPUs. However, the deterministic retrieval mechanism adopted by Engram naturally supports decoupling parameter storage from computational resources. Different from MoE, which relies on the runtime hidden state for dynamic routing, the retrieval index of Engram is completely determined by the input token sequence. This predictability enables specialized optimization strategies for the training and inference stages, as shown in Figure 2.

During the training stage, to accommodate large-scale embedding tables, DeepSeek adopted a standard model parallelism scheme, distributing the embedding table shards across multiple GPUs. During the forward propagation process, the activated embedding rows are collected through the All-to-All communication primitive; during the backward propagation stage, the corresponding gradients are distributed back to each shard, enabling the total available memory capacity to expand linearly with the number of accelerators.

During the inference stage, this deterministic characteristic further supports a prefetch-and-overlap strategy. Since the memory indices to be accessed can be determined before the forward computation starts, the system can asynchronously prefetch the embedding vectors from the host memory with sufficient capacity through PCIe. To effectively mask the latency caused by communication, the Engram module is placed at a specific level in the backbone network, using the computation of its previous Transformer layers as a buffer to avoid GPU computation stalls.

This also requires hardware–algorithm co-design: On the one hand, placing Engram deeper can lengthen the computation window for hiding communication latency; on the other hand, from the perspective of modeling effects, earlier intervention to offload the reconstruction of local patterns is more beneficial. Therefore, the optimal insertion position of Engram must satisfy both the modeling performance and system latency constraints.

In addition, the 𝑁-grams in natural language naturally follow the Zipfian distribution, meaning that a small number of high-frequency patterns contribute to the majority of memory accesses. This statistical feature inspires researchers to build a Multi-Level Cache Hierarchy: caching the high-frequency accessed embeddings in faster storage media (e.g., GPU HBM or host DRAM) and storing a large number of low-frequency long-tail patterns in larger but slower storage media (e.g., NVMe SSD). This hierarchical design enables Engram to scale to extremely large memory capacities while minimizing the impact on the effective access latency.

U-shaped expansion rule and sparsity allocation

As a specific implementation of "conditional memory", Engram structurally complements the "conditional computation" provided by MoE experts. This section aims to explore the expansion properties of this duality and how to optimally allocate the sparse capacity.

Specifically, this research is driven by two core questions:

Allocation under limited constraints: Given a fixed total number of parameters and training computation (i.e., equal parameters and equal FLOPs), how should the sparse capacity be divided between MoE experts and Engram embeddings?

Infinite memory paradigm: Considering that Engram has a lookup cost that does not increase with scale, how will Engram itself behave in terms of expansion if the memory budget is relaxed or aggressively expanded?

First, let's look at the optimal allocation ratio between MoE and Engram. When calculating the matching formula, DeepSeek used the following three parameter metrics to analyze this trade-off:

P_tot: The total trainable parameters, excluding vocabulary embeddings and the language model head.

P_act: The parameters activated per token. This metric determines the training cost (FLOPs).

: The non-activated parameters, representing the "free" parameter budget that can be used to increase the model size without increasing the computation cost (e.g., unselected experts or non-retrieved embeddings).

DeepSeek kept P_tot and P_act fixed within each FLOPs budget, so the models have the same number of parameters and the same FLOPs per token. For MoE, P_act is determined by the selected top-k experts, and the parameters of the unselected experts contribute to P_sparse. For Engram, only a fixed number of slots are retrieved per token, so increasing the number of embedding slots increases P_tot but does not increase the FLOPs per token.

Next is "Engram in the infinite memory mode". In addition to optimizing the allocation under a fixed parameter budget, DeepSeek explored the complementary setting: aggressive memory expansion. The motivation for this research comes from Engram's unique ability to decouple storage and computation.

DeepSeek used a fixed MoE backbone with P_tot ≈ 3B and P_act = 568M and trained on 100B tokens to ensure convergence. On this basis, an Engram table was added, and the number of slots M was adjusted from 2.58 × 10⁵ to 1.0 × 10⁷ (increasing by up to approximately 1.3 billion parameters).

The following Figure 3 (left) reveals a consistent U-shaped relationship between the validation loss and the allocation ratio 𝜌. Notably, even when the MoE allocation is reduced to only 𝜌 ≈ 40% (i.e., 46 experts for a 5.7B model and 43 experts for a 9.9B model), the Engram model still achieves performance comparable to that of the pure MoE baseline (𝜌 = 100%).

In addition, the pure MoE baseline proves to be suboptimal: reallocating approximately 20%-25% of the sparse parameter budget to Engram yields the best performance. In the quantitative analysis, in the 10B range (𝐶 = 6 × 10²⁰), the validation loss improved from 1.7248 (𝜌 = 100%) to 1.7109, approaching the optimal value (Δ = 0.0139) when 𝜌 ≈ 80%. Notably, the position of this optimal point is stable across different ranges (𝜌 ≈ 75%-80%), indicating a robust allocation preference among different scales under fixed sparsity. This observed U-shape confirms the structural complementarity between the two modules.

Figure 3 (right) shows that increasing the number of memory slots significantly improves the validation loss, and this improvement remains stable throughout the range. The curve follows a strict power law (linear in the logarithmic space), indicating that Engram provides a predictable expansion knob: larger memory continues to bring benefits without additional computation.

The key point is that in terms of expansion efficiency: while OverEncoding benefits from a larger memory table, Engram unleashes greater expansion potential under the same memory budget.

Considering the allocation rules, these results validate the role of conditional memory as an independent and scalable axis of sparse capacity, complementing the conditional computation of MoE.