HomeArticle

The truth about Kimi "breaking the Transformer architecture"

锦缎2026-03-19 08:10
It doesn't shatter the myth of Transformer; instead, it adds a crucial brick to this edifice.

This week, a paper titled "Attention Residuals" has thrust Kimi into the global spotlight in the field of artificial intelligence. One of the paper's authors is even a 17 - year - old high school student. Elon Musk, the CEO of xAI, and Shubham Saboo, a senior AI product manager at Google, have also publicly congratulated Kimi. The latter even claimed that Kimi is touching the "untouched part of the Transformer architecture for a decade".

For a moment, the public opinion field has been in an uproar. Headlines such as "Breaking the Transformer architecture", "Silicon Valley is shaken", and "Rewriting industry rules" have quickly occupied the top spots.

Let's state the conclusion first: This is a genius - like concept and an extremely hardcore research, but in essence, it does not deviate from the basic framework of the Transformer architecture. As for those sensational labels, most of them are from marketing accounts and lack factual basis.

In fact, the exploration of residual connections is not an isolated case. From DeepNorm in 2022 to DenseFormer in 2024, optimizing this cornerstone of deep neural networks has always been the direction of continuous efforts in the industry. The Kimi research team is not the pioneer of this technical route, but they have contributed a solution that combines radicalism, elegance, and engineering potential on this existing path.

01

The Structural Dilemma of Deep Transformers

Driven by the scaling law, the path to improving model performance increasingly relies on the expansion of parameters and scale, and the sharp increase in the number of neural network layers is inevitable. However, the research team noticed a key phenomenon: when data is transmitted between neural network layers, there is a "PreNorm dilution problem". As a normalization technique, PreNorm has become the mainstream choice for modern architectures because it can effectively stabilize training and accelerate convergence.

To facilitate intuitive understanding, let's compare a large model to an assembly line composed of one hundred programmers. Each programmer corresponds to a layer of neural network, and they work together to complete a large - scale software project.

In the traditional standard residual connection mode, the state update between layers follows the following formula:

The output of the current layer is equal to the direct addition of the output of the previous layer and the "modified part" (i.e., the output of the transformation function) of this layer. By analogy, each programmer receives the code from the previous one, attaches their own modifications, and then passes it on to the next one.

This simple accumulation method will cause a chain of problems in practice. From a mathematical perspective, it will lead to two mutually - causal training dilemmas:

Firstly, early information is diluted and buried. The original features extracted by the first layer of the neural network - such as the initial semantics of tokens - after dozens of layers of accumulation, their relative weights are gradually weakened, and their original appearance becomes blurred. Programmers at the end of the assembly line have no idea what kind of underlying logic was drafted at the source. The deeper the model, the more difficult it is to accurately retrieve and utilize early low - level features.

Secondly, there is numerical scale inflation and gradient imbalance. The continuous accumulation of residuals is like the endless expansion of a project code library. If programmers added later want their changes to have a visible impact, they have to add a larger amount of code. In the context of the network, the deep layers must output signals with a larger numerical scale to have a say in the accumulation. This phenomenon may be tolerable in forward propagation, but it hides a crisis in backpropagation: the gradients of the shallow layers may oscillate violently, while the gradients of the deep layers tend to be tiny, and the gradient distribution of the entire network is extremely uneven, making training prone to instability.

Therefore, the core proposition of the research is condensed into: How can the "programmers" at the deepest layer of the network still clearly identify and call the basic code written by the first "programmer"?

02

The Duality Mapping between the Time Dimension and the Depth Dimension

The key insight of the Kimi research team lies in identifying the duality relationship between time - series processing and network depth construction in the evolution history of neural networks.

The Transformer is not the initial form of neural networks. Around 2018, recurrent neural networks (RNNs) dominated sequence modeling. RNNs process text word by word in a sequential manner, compressing historical information into a single hidden state and passing it backward. As a result, subsequent units can only receive a "compressed package" of past information, and early inputs are easily forgotten - this process is surprisingly similar to the information transfer mechanism of standard residual connections.

The Transformer has subverted this paradigm with its attention mechanism. In autoregressive decoding, each token at each position can directly "look back" at all previous tokens in the sequence and focus on key information through weighting. In the time dimension, the attention mechanism perfectly resolves the problems of information compression and forgetting.

A natural analogy emerges: Can we abandon the "RNN - like thinking" implied in residual connections and introduce the attention mechanism in the depth dimension of the network?

This is precisely the core innovation of the Kimi paper - Attention Residuals (AttnRes). The traditional residual accumulation formula is reshaped into an attention - weighted form based on Softmax:

The new formula no longer simply adds the outputs of shallow layers. Instead, each layer is equipped with a "pseudo - query vector", which can dynamically scan the outputs of all previous layers and assign extremely high Softmax weights to the layers containing key information. The weights of irrelevant information layers are suppressed to near - zero.

This content - aware and input - dependent selection mechanism essentially transfers the core concept of the Transformer horizontally to the design of the residual path. Residual connections thus change from passive "information transportation" to active "on - demand retrieval", effectively avoiding the chronic problem of deep - layer information dilution.

03

From Theoretical Concept to System - Level Engineering

If it only stops here, attention residuals may still be confined to the ideal scenario in the laboratory. In real - world large - model engineering practice, especially in the harsh environment of distributed training with hundreds of billions of parameters, directly applying this mechanism will cause an "explosion" of video memory and communication.

On the premise that distributed training generally uses techniques such as activation recomputation and pipeline parallelism, if cross - layer full connection is forcibly implemented, the deep - layer network will have to obtain the complete output tensors of all shallow layers across physical GPU nodes. As the number of layers L increases, the cross - stage data transmission volume and video memory usage will increase sharply on the scale of O(Ld), posing a catastrophic burden on the computing power cluster.

Therefore, the block attention residuals proposed by the Kimi team to solve the engineering implementation show extremely high practical wisdom.

To put the theory into practice, the Kimi team designed a sophisticated dimensionality - reduction scheme:

The core idea is "block - based dimensionality reduction".

Returning to the analogy of the programmer assembly line: Requiring the last programmer to know the specific contributions of each previous colleague means that each previous programmer needs to keep a complete "draft box" - which is not feasible in physical space. The solution is to divide the programmers into N departments. Standard residuals are used within the departments, and the outputs of multiple layers are compressed into a single "block - level representation". The attention residual mechanism is enabled between departments, only focusing on these N block - level representations without tracing the output of each specific layer.

This simple and bold strategy directly reduces the complexity of video memory and communication from O(Ld) to O(Nd), removing the biggest obstacle for the implementation of the theory.

Secondly, the cross - stage cache design in the training phase further optimizes the communication overhead. In the mainstream staggered pipeline scheduling mode, each physical GPU often needs to process multiple computing stages. The team designed a local cache mechanism to ensure that the previously received block - level representations stay in the local video memory, thus avoiding repeated cross - node transmission. This significantly compresses the communication peak of pipeline parallelism and allows the cross - block communication time to be effectively masked by the computing process.

Finally, the two - stage calculation and online Softmax fusion in the inference phase alleviate the memory bandwidth bottleneck. Repeatedly reading a large amount of historical block - level representations during inference can easily lead to severe memory bandwidth pressure. The research team adopted a two - stage strategy: In the first stage, cross - block attention is calculated in a batch processing manner to amortize the memory reading cost; in the second stage, local attention within the block is calculated sequentially. The results of the two stages are seamlessly merged through online Softmax technology and kernel - fused with operators such as RMSNorm.

There is no need to go into details about the technical details, but the results are impressive: After the above - mentioned complex cross - layer attention mechanisms are superimposed, the additional training overhead brought by Block AttnRes is almost negligible; in a typical autoregressive inference scenario, the end - to - end latency increase is less than 2%. The Kimi team has achieved such a high degree of optimization while rewriting the underlying network topology of large models, which is a miracle in engineering.

04

Empirical Results and Industrial Significance

Finally, the Kimi research team deployed this architecture to a small MoE model with a parameter scale of 48B (3B activations) and conducted real - world pre - training using up to 1.4 trillion tokens of data.

The scaling law curve shows that with the same computing power input, the model using Block AttnRes always achieves a lower loss value. Simply put, this architecture enables the model to achieve the performance that a traditional baseline model can only achieve with 1.25 times the computing power. For the pre - training phase, which often costs tens of millions of dollars, getting a 25% gain in computing power for free contains huge commercial value.

In downstream ability tests, tasks that require multi - step logical reasoning benefit the most:

GPQA - Diamond increases by 7.5%, Math increases by 3.6%, and HumanEval increases by 3.1%. This result is highly self - consistent logically: Both mathematical derivation and code generation require the model to have long - term reasoning and information retention capabilities, and the deep retrieval mechanism of AttnRes exactly meets this "never - forget - the - original - intention" internal requirement.

Yang Zhilin, the founder of Dark Side of the Moon, also confirmed the value of this architecture from the side in his public speech at the 2026 NVIDIA GTC conference: "To continuously break through the upper limit of large - model intelligence, we must reconstruct the underlying cornerstones such as optimizers, attention mechanisms, and residual connections."

Of course, this technology is still a significant distance from truly subverting the Transformer architecture or rewriting industry rules. The core engineering code has not been fully open - sourced, and only pseudo - code - level demonstrations are provided in the public repository. At the same time, all the impressive experimental results in the paper are from the self - owned model structure and private data of Dark Side of the Moon. Whether attention residuals can reproduce stable and significant benefits on other mainstream large models still needs to be independently verified by third - parties.

Objectively speaking, in the field of deep learning, attempts to heuristically modify the underlying mechanisms are not uncommon. But the fact that a paper can get a "light - speed" thumbs - up from Musk itself shows its weight.

Perhaps the most accurate conclusion is: This is a new design of the residual mechanism that combines academic aesthetics and engineering practicality and is worthy of in - depth tracking by the entire industry. It is not a myth that overthrows the Transformer but a key brick added to this edifice.

And Dark Side of the Moon has shown the world that in the "deep - water area" of underlying architecture innovation, Chinese AI companies are also capable of submitting answers with high technical content and world - class standards.

This article is from the WeChat official account "Silicon - based Starlight", author: Si Qi, published by 36Kr with authorization.