DeepSeek and ByteDance are in the same boat.
On the eve of the new year, DeepSeek published a core paper focusing on the innovation of neural network architecture, with Liang Wenfeng as the corresponding author. The paper proposed the manifold-constrained HyperConnection (mHC) architecture, aiming directly at the stability problem in large-scale model training.
This work has opened up a path that balances performance and efficiency for Chinese AI enterprises with hardware limitations. It also forms a key echo with ByteDance's earlier exploration of residual flow optimization. Both aim to transform the residual connection, the basic architecture of the model.
DeepSeek's research precisely fills the systematic gap in ByteDance's "HyperConnection" technology. This achievement not only provides a new solution for the industrial implementation of the underlying architecture of large models but also once again confirms the industrial evolution logic that hardware constraints can be transformed into innovation drivers.
Since the ResNet was proposed in 2016, the residual connection has become the skeletal design of deep learning. By bypassing layers of non-linear transformations through "shortcut connections", it fundamentally alleviates the problem of gradient vanishing or explosion, supporting increasingly deeper model structures.
For a long time, industry innovation has mostly focused on modules such as the attention mechanism and MoE (Mixture of Experts). The residual flow itself has been in a state of "silent stability" until ByteDance broke this situation with the HyperConnection technology in 2024.
ByteDance's HyperConnection significantly improves the model's expressive ability by broadening the width of the residual flow and constructing multi-channel parallel signal flows and allowing the model to learn the interaction patterns between the flows. However, this technology has exposed a fatal shortcoming in large-scale training: signal divergence.
DeepSeek's tests show that in the training of a 27-billion-parameter model, the gradient norm fluctuates violently after about 12,000 steps, and the training collapses. More seriously, the signal intensity at the 60th layer swells to 3,000 times the input value. The core of the problem is that in pursuit of expressiveness, HyperConnection abandons the original identity mapping constraint of the residual connection. This defect can be masked by parameter adjustment on a small scale, but it is sharply magnified in large-scale training.
The core innovation of mHC is to constrain the learnable transformation matrix on the manifold formed by doubly stochastic matrices. This is equivalent to setting a "rigid budget" for signal propagation: the sum of the elements in each row and column of the matrix is 1 and non-negative, ensuring that the output signal intensity is strictly between the maximum and minimum values of the input signal, thus preventing signal explosion.
More importantly, doubly stochastic matrices have combinatorial invariance - they remain stable after multiple layers are stacked. Experiments show that in the same scenario where HyperConnection has a 3,000-fold signal amplification, the peak signal amplification of mHC is only 1.6 times. To control the computational cost, DeepSeek uses the Sinkhorn-Knopp iteration for projection, which converges in only 20 iterations, and the additional training cost is suppressed to 6.7%.
Hardware constraints force not only algorithm innovation but also system-level optimization across the entire link. After HyperConnection broadens the residual flow, the data read and write volume per layer doubles. With the limited interconnection bandwidth of A800/A100, the chips are prone to fall into an efficiency trap of "waiting for data far more than computing". DeepSeek breaks the deadlock through three key technologies:
1. Operator fusion: Merge operations with similar memory access patterns into a single GPU kernel to reduce data transfer.
2. Recomputation in backpropagation: Instead of storing intermediate activation values, recompute them in real-time to exchange computation for memory.
3. Pipeline parallel optimization: Overlap cross-GPU communication with local computation to mask communication latency with computation.
These optimizations transform the memory overhead that originally grows linearly with the number of layers into a bounded overhead that can be controlled by the module size. Combined with the mixed-precision kernel written based on TileLang (mainly bfloat16, with float32 for key precision), stable performance improvement is achieved across all parameter scales. In tests, models with 3 billion to 27 billion parameters all performed excellently after being equipped with mHC. The 27-billion-parameter model improved by 2.1% on the BIG-Bench Hard complex reasoning task and by 2.3% on the DROP reading comprehension task.
Previously, the V3 architecture paper corresponded to the V3 model, and the R1 inference paper corresponded to the R1 model. This time, the mHC paper was released three weeks before the Spring Festival in 2026, and the outside world generally expects the next-generation flagship model (R2) to debut soon.
This strategy of "papers first" not only establishes technical credibility through peer review but also leaves a timestamp for originality in a complex geopolitical environment. It also conveys a clear message globally: The core competitiveness of Chinese AI enterprises does not rely on cutting-edge computing power chips.
DeepSeek chose to publish its results on open platforms such as arXiv and Hugging Face instead of traditional journals. Although it sacrificed some academic prestige, it gained speed and accessibility in technology dissemination. This open model accelerates the diffusion of knowledge and also exerts direct competitive pressure on peers: when the performance gain of mHC can be quantified and the implementation can be reproduced, Western laboratories either have to follow similar technologies or must demonstrate the superiority of their own paths.
Previously, the R1 model triggered a research and development boom in inference models, and the mHC architecture is likely to push the optimization of the residual flow into a new round of iteration. More importantly, this model conveys a clear signal to technology regulators: Hardware limitations have not stifled innovation. Instead, they have forced Chinese AI enterprises to take the most fundamental path of "solving problems from the mathematical root".
ByteDance and DeepSeek have successively stepped into the same innovative river of "breaking through the traditional residual flow". The former took the lead in exploring the way but stopped at the large-scale bottleneck. The latter, driven by hardware constraints, built a navigable technology bridge through mathematical constraints and system-level optimization.
There are only six weeks left until the Spring Festival in 2026. The release of the R2 model will test the industrialization quality of the mHC architecture. Regardless of the final benchmark test results, this path of "innovating under constraints" is of milestone significance - it clearly proves that there is more than one track in the AI competition, which is not just about "burning money on computing power". Hardware limitations are never a stumbling block to innovation but a catalyst for real core breakthroughs.
This article is written based on publicly available information and is only for information exchange purposes and does not constitute any investment advice.
This article is from the WeChat official account "Jinduan". Author: Mu Yang. Republished by 36Kr with permission.