HomeArticle

Behind DeepSeek's consecutive publication of two papers lies an academic relay.

机器之心2026-01-16 09:25
True technological breakthroughs often come from continuous openness and mutual inspiration.

As of mid-January 2026, we still haven't seen the release of DeepSeek V4, but its features are becoming increasingly clear.

Recently, DeepSeek published two research papers in quick succession. One addresses how to ensure the stable flow of information, and the other focuses on how to efficiently retrieve knowledge.

When the first paper (mHC) was released, those who read it were quite confused and couldn't understand it. They even asked AI assistants to explain it in various ways. We also looked through netizens' discussions and found that the best way to understand it thoroughly is to trace the research context and see how researchers have built on each other's work over the years. The same goes for understanding the second paper (Conditional Memory).

So, we started to look through the analyses of various researchers. At this time, we discovered an interesting phenomenon: There is actually a "relay" relationship between many of the works of DeepSeek and ByteDance's Seed team. mHC made significant improvements based on ByteDance's Seed team's HC (Hyper-Connections). Conditional Memory cited multiple works of ByteDance's Seed, such as OverEncoding and UltraMem.

If we clarify the relationships between these works, we believe that we can not only gain a deeper understanding of DeepSeek's papers but also see the directions in which the innovation of large model architectures is breaking through.

In this article, based on our own observations and the comments of academic experts, we attempt to sort out the details for you.

A Decade-Long Relay of Residual Connections

To understand mHC, we need to go back to 2015.

In that year, AI expert He Kaiming and others proposed ResNet, which used residual connections to solve a long-standing problem in the training of deep neural networks: As the number of network layers increases, the information transmitted from the front to the back gradually becomes distorted, and the last few layers can hardly learn anything. The idea of residual connection is very simple. Each layer not only receives the processed results from the previous layer but also retains a copy of the original input. The two are added together and then passed on.

This design is the cornerstone of deep learning. Over the past decade, almost all mainstream deep network architectures have used residual connections as the default configuration. Whether it is various CNNs in the field of computer vision, Transformer in the field of natural language processing, or today's large language models, they are no exception.

In the meantime, researchers have made a lot of improvements in aspects such as attention mechanisms, normalization methods, and activation functions, but the basic form of residual connections has hardly changed fundamentally.

It wasn't until September 2024 that ByteDance's Seed team proposed HC, and the paper was later accepted by ICLR 2025.

The core innovation of HC lies in significantly enhancing the topological complexity of the network without changing the FLOPs overhead of a single computing unit. This means that under the same computational budget, the model can explore a wider range of feature combination methods.

Liu Yong, a tenured associate professor and doctoral supervisor at Renmin University of China, believes that HC breaks the tradition of identity mapping residual connections dominated by ResNet and proposes a new paradigm of multi-channel concurrent connections. By introducing width dynamics and cross-layer feature aggregation, it proves that by increasing the feature dimension (Expansion) of the residual path and introducing learnable Dynamic Hyper Connections, the problem of Representation Collapse can be effectively alleviated, and the pre-training efficiency of large language models can be improved. It provides a brand-new architectural foundation beyond traditional residual networks, that is, no longer limited to single-path feature superposition, but constructing a higher-dimensional and more flexible feature flow space through hyper-connections.

DeepSeek stated in the mHC paper: In recent years, research represented by Hyper-Connections (HC) (Zhu et al., 2024) has introduced a new dimension to residual connections and verified its significant performance potential through experiments. The single-layer structure of HC is shown in Figure 1 (b). By expanding the width of the residual flow and increasing the complexity of the connection structure, HC significantly improves the topological complexity of the network without changing the FLOPs overhead of a single computing unit.

It can be seen that the new architectural paradigm of "expanding the width of the residual flow + learnable connection matrix" proposed by ByteDance's Seed team forms an important basis for the design of its subsequent methods, and related work is further carried out within this paradigm framework.

However, HC encountered bottlenecks in the process of large-scale training, resulting in unstable training and limited scalability. Nevertheless, it pointed the way for subsequent research. Liu Yong believes that the HC paper provides three core ideas for the mHC research:

Firstly, Stream Expansion. By expanding the dimension of the residual flow (such as expanding it by 4 times or more), the capacity and learning ability of the model can be significantly enhanced.

Secondly, Weighting of multi-scale connections. By introducing a learnable matrix to allocate the contributions of features at different levels, it reveals the importance of connection weight management (the Sinkhorn-Knopp algorithm in mHC).

Finally, the potential of dynamic topology. The paper shows that the model can dynamically adjust the feature flow direction according to the depth. This soft topological structure provides a new perspective for solving the training difficulties of deep networks. These explorations made mHC realize that although the complication of the topological structure can bring benefits, the problems of training stability and engineering efficiency that come with it must also be solved.

Based on these explorations, the DeepSeek team was able to clarify the research direction of mHC: While inheriting the advantages of the HC architecture, it targeted the bottlenecks in its large-scale implementation.

Liu Yong pointed out that mHC made targeted improvements to the stability risks and memory access overhead exposed by HC during large-scale deployment. In terms of research ideas, mHC continued the width expansion and multi-path aggregation of HC and further applied manifold constraints through techniques such as the Sinkhorn-Knopp algorithm to project the general space of HC back to a specific manifold. Thus, while retaining the performance advantages of HC, it regained the crucial identity mapping property of residual networks and solved the instability of HC during ultra-large-scale training. At the engineering level, mHC proposed more efficient kernel optimization (Infrastructure Optimization), which enabled this paradigm to move from theoretical experiments to industrial applications with trillions of parameters.

Based on these improvements, mHC not only solved the stability problem but also showed excellent scalability in large-scale training (such as the 27B model).

It is not difficult to find that mHC solved the engineering bottlenecks of HC in large-scale training. By introducing manifold constraints, mHC restored the training stability while retaining the advantages of the HC architecture, making this new paradigm truly applicable to the training of mainstream large models.

Some netizens believe that the mHC proposed by DeepSeek is a convincing advancement of the training architecture techniques of ByteDance's Seed HC.

From the emergence of residual connections in 2015, to the proposal of HC by ByteDance's Seed in 2024, and then to the proposal of mHC by DeepSeek in 2026, we can clearly see that the evolution of residual connections in algorithms is the result of continuous relay and optimization by different institutions and researchers.

And in another paper published by DeepSeek, we saw almost the same pattern repeat itself.

Both Using N-gram, ByteDance's Seed and DeepSeek Derive New Conclusions One After Another

Different from the "abstract" feeling of the mHC paper, the problem solved by the "Conditional Memory" paper is relatively easy to understand: Many questions asked of large models can be directly answered by looking up a table. For example, "What is the capital of France?" However, since the standard Transformer lacks the native primitive for knowledge lookup, even such simple questions require the model to calculate, just like you have to derive formulas on your own during an exam. This is undoubtedly a waste.

In response, the "Conditional Memory" paper proposes a solution: Equip the model with a "cheat sheet" (Engram). For common phrases, directly look them up in the table, and save the computing power for more complex reasoning.

Specifically, Engram works like this: Provide the model with a huge "phrase dictionary". When the model reads a certain word (such as "Great"), it combines the previous few words into an N-gram (such as "the Great" or "Alexander the Great"), then uses a hash function to convert this N-gram into a number, and directly looks up the corresponding vector in the dictionary.

This "N-gram hash lookup" method was also used by ByteDance's Seed before. In the paper proposing the OverEncoding method (titled "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling"), they found that equipping the model with a huge N-gram dictionary almost provides a "free" performance improvement. Why is it "free"? Liu Yong analyzed that because these massive embedding parameters are sparsely activated, only a very small number of them are looked up during each inference, so they don't consume much video memory or computing power. More importantly, the paper found that the larger the dictionary, the better the performance, and the improvement amplitude is predictable.

Paper link: https://arxiv.org/pdf/2501.16975

If the paper of ByteDance's Seed tells us through experiments that "increasing the input vocabulary can improve the score", the DeepSeek paper opens up a new track: It turns N-gram into an external storage Engram, divides the work with MoE, and formally proposes the new axis of "conditional memory", and tells us how to allocate parameters most cost-effectively.

Let's go back to the exam analogy: ByteDance's Seed found that giving students a formula handbook can improve their grades, so they concluded that "a large vocabulary is a better input representation". DeepSeek further asks: In what other ways can this method improve grades? Through mechanism analysis using tools such as LogitLens, they found that this lookup mechanism can free the model from the heavy task of local static pattern reconstruction, allowing the early layers to directly obtain high-order semantics, thereby increasing the effective inference depth of the model.

Based on this insight, DeepSeek no longer simply regards N-gram as a simple vocabulary expansion but elevates this experimental conclusion to "Conditional Memory", a new scaling law axis parallel to conditional computation (MoE). On this basis, they proposed the problem of "Sparsity Allocation": Under a fixed parameter budget, how to allocate parameters between MoE experts and static storage modules? Experiments revealed a U-shaped scaling law - Betting all on MoE is not the optimal solution. Allocating about 20% - 25% of the parameters to Engram actually yields better results.

Liu Yong said that in terms of engineering implementation, DeepSeek also made systematic technical improvements. At the architectural level, it improved the limitation of only injecting information at the input layer (Layer 0) in the previous work and injected the Engram module into the middle layers of the model, enabling parallelism and integration between storage access and deep computation. In terms of the interaction mechanism, it abandoned simple embedding summation and introduced "context-aware gating" to dynamically adjust the retrieval results using hidden states. In terms of system optimization, it improved storage efficiency through tokenizer compression and used prefetching technology at the hardware level to solve the latency problem caused by massive parameters, making this technology truly capable of large-scale industrial implementation.

In Section 3.2 of the paper, we found that DeepSeek compared its Engram with the OverEncoding method of ByteDance's Seed and pointed out that although both can benefit from a larger embedding table, under the same parameter budget, the scaling efficiency of Engram is significantly higher.

Leveling Up Together and Inspiring Each Other

The Significance of Research Publication Becomes Concrete

Every time DeepSeek publishes a paper, it causes quite a stir on Twitter. A blogger even mentioned that 30% of the people on the plane he was on were reading the newly published DeepSeek paper.

Ultimately, this reflects a problem - Currently, there are fewer and fewer leading large model manufacturers that are willing to publicly share their research results and lead everyone to "level up" together. The relay between DeepSeek and ByteDance's Seed in research shows us the value of publicly sharing research results.

At the same time, DeepSeek's exploration of excellent results within the community also gives us some inspiration. There are actually many ideas of domestic leading large model teams like ByteDance's Seed that are worth further exploration.

For example, at the architectural level, in addition to the OverEncoding mentioned earlier, the DeepSeek paper also mentioned several related studies by ByteDance's Seed, including the sparse model architecture UltraMem and its new version Ultramemv2. This brand-new model architecture effectively solves the high memory access problem of the traditional MoE architecture during the inference stage through a distributed multi-level cascaded memory structure, Tucker decomposition retrieval, and implicit parameter expansion optimization. At the same time, it verifies the Scaling Law expansion characteristics superior to the traditional architecture.

In addition, ByteDance's Seed has also published many bold attempts to explore new paradigms in basic research. For example, Seed Diffusion Preview systematically verified the feasibility of the discrete diffusion technology route as the basic framework for the next-generation language models; SuperClass abandoned the text encoder for the first time and directly used the tokenization of the original text as multi-classification labels, achieving better results than the traditional CLIP method in visual tasks; It even proposed a new neural network architecture FAN, which made up for the deficiencies of mainstream models such as Transformer in periodic modeling by introducing the idea of Fourier principle.

Although the research on these underlying technologies cannot be used for the training of commercial models in the short term, the progress of the technology industry occurs precisely through the exploration of the unknown by countless researchers.

After all, what truly drives technological progress is never a single breakthrough but continuous accumulation and mutual inspiration.

This article is from the WeChat official account