New Paper by Liang Wenfeng: Is DeepSeek V4's Architecture Revealed for the First Time, Directly Targeting Transformer's Fatal Flaws?

Late at night, a new paper on DeepSeek, signed by Liang Wenfeng, has emerged. This time, they proposed a brand - new Engram module, which solves the memory problem of Transformer and enables the model capacity to no longer rely on increasing parameters!

Just now, a new paper from DeepSeek was released, with Liang Wenfeng as one of the authors!

This time, in collaboration with Peking University, they are directly targeting "memory", which is the most critical and fatal problem of Transformer.

Currently, Mixture of Experts (MoE) has become the mainstream architecture for large models. However, in essence, it is still based on Transformer. Due to the lack of a native "knowledge lookup" mechanism, a lot of retrieval capabilities have to be simulated through extensive computations.

In the 33 - page paper, the team proposed a sparse axis of "conditional memory" complementary to MoE and implemented it through a brand - new Engram module:

Modernize the classic hashed N - gram embedding to provide deterministic knowledge lookup with an approximate O(1) time complexity.

Paper link: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

Through the modeling of "Sparsity Allocation", they unexpectedly discovered a "U - shaped scaling law" between MoE and Engram.

This means that the resource ratio between the two needs to be adjusted to find an optimal trade - off between computation and static memory.

Following this law, after expanding Engram to 27B parameters, it outperforms the MoE baseline under strict equal - parameter and equal - FLOPs conditions.

Put simply, MoE only solves the problem of "how to calculate less", while Engram directly solves the problem of "don't calculate blindly".

It delegates the tasks that need to be looked up to O(1) memory, freeing the attention from local trivialities. As a result, it not only becomes better at memorizing knowledge but also improves reasoning, coding, and mathematics abilities simultaneously.

This may become the next mainstream route for sparse LLMs. More importantly, the next - generation V4 may integrate this new method.

Stop laborious calculations and insert an "electronic brain" into Transformer

Currently, it has become an established rule that LLMs are getting larger and larger. A familiar approach is -

Increase the number of parameters and make the computation "sparse".

The Mixture of Experts (MoE) model is a typical example. Each token only needs to activate a small number of experts. Through "conditional computation", the parameter scale can soar while still controlling the FLOPs.

As can be seen from the Artifical Analysis list, the mainstream of existing sparse large models is MoE.

However, the problem is that Transformer lacks a "native knowledge lookup" ability. So, many tasks that should be solved in O(1) time like retrieval have to be "simulated for retrieval" through a large amount of computation, which is very inefficient.

The new paper from Peking University and DeepSeek brings up a very interesting point: Sparsification can serve not only "computation" but also "memory".

Therefore, the team proposed Engram, which delegates a large number of "fixed, local, and stereotyped" patterns in language modeling to an extensible lookup table module.

In this way, the Transformer backbone can focus its attention and depth on tasks that require more "combination and reasoning".

Two types of tasks in language modeling

In the paper, the authors clearly divide language modeling into two types of subtasks:

Some tasks require "combination and reasoning": context relationships, long - range dependencies, logical reasoning, and chained reasoning.

The other type of tasks is more like "pattern retrieval": entity names, fixed collocations, common phrases, grammar fragments, and repeatedly occurring local structures.

An obvious common feature of the latter is that they are often local, stable, and repeatedly occur.

If multi - layer attention and FFN are used to "compute" them, the model can do it, but the cost is extremely high, and it will also occupy the expression space of the early layers.

To identify the entity "Diana, Princess of Wales", the LLM has to consume multi - layer attention and FFN to gradually combine features. In theory, this process can be completed through a single knowledge lookup operation.

What Engram wants to do is straightforward -

Transfer these "local static patterns" to a cheap knowledge lookup primitive.

It quickly provides candidate information through deterministic lookup tables, and then the context decides whether to adopt it.

Core architecture of Engram: Brute - force lookup + Memory switch

The term Engram originates from neurology, originally meaning "memory trace". It is an extensible and retrievable memory unit.

It can be used to store patterns and information fragments that the LLM may have encountered during the reasoning process.

Engram can be understood as modernizing the classic "hashed N - gram embedding" and turning it into an "extensible lookup table module" inserted into the middle layer of Transformer.

As shown in Figure 1, Engram is a conditional memory module aiming to enhance the Transformer backbone network by structurally separating the storage of static patterns from dynamic computation.

Formally, given an input sequence X=(x_1,...,x_T) and the hidden state H^(l)∈R^Txd at layer l, this module processes each position t in two functional stages: Retrieval and Fusion.

Next, let's take a look at the key design points of Engram.

Sparse retrieval based on hashed N - grams

The first stage is mainly responsible for mapping the local context to static memory entries, which is achieved through tokenizer compression and deterministic hashed retrieval embedding.

Tokenizer compression

To maximize semantic density, the authors introduced a vocabulary projection layer.

They pre - computed a surjective function P:V→V', using normalized text equivalence (such as NFKC, lowercasing, etc.) to collapse the original Token IDs into canonical identifiers.

This process can reduce the effective vocabulary size of a 128k - sized tokenizer by 23%.

Multi - head hashing

It is computationally infeasible to directly parameterize the space of all possible N - gram combinations. The authors adopted a hashing - based method.

To reduce collisions, K different hash heads are assigned to each N - gram order n.

Each head k maps the compressed context to an index in the embedding table E_n,k through a deterministic function φ_n,k:

Context - aware gating

The retrieved embeddings e_t serve as context - independent prior information. However, they are susceptible to noise interference caused by hash collisions or polysemy.

To enhance expressiveness and resolve this ambiguity, the authors adopted a context - aware gating mechanism inspired by the attention mechanism.

They use the current hidden state h_t as a dynamic Query, and the retrieved memory e_t as the source for Key and Value projections:

Among them, W_K and W_V are learnable projection matrices.

To ensure gradient stability, they perform RMSNorm on the Query and Key before calculating the scalar gate α_t∈(0,1):

Finally, to expand the receptive field and enhance the model's non - linearity, the authors also introduced a short depth - causal convolution:

Visualization of gating

To empirically verify whether Engram behaves as expected, the authors visualized the gating scalars α_t of Engram - 27B on various samples in Figure 7.

The results show an obvious selective pattern. The gating mechanism is consistently activated (shown in red) when dealing with local, static patterns.

In English, strong activation is observed on multi - Token named entities (such as Alexander the Great, the Milky Way) and fixed phrases (such as By the way, Princess of Wales).

The key is that this behavior can effectively generalize across languages.

In the Chinese demo, Engram can identify and retrieve unique idiomatic expressions and historical entities, such as "Four Great Inventions" and "Zhang Zhongjing".

These qualitative results confirm that Engram successfully identifies and processes fixed language dependencies, effectively freeing the Transformer backbone from memorizing these static associations.

System efficiency: Decoupling computation and storage

Scaling memory - enhanced models is often limited by the capacity of GPU high - bandwidth memory (HBM).

However, Engram's deterministic retrieval mechanism naturally supports the decoupling of parameter storage and computing resources.