HomeArticle

Elon Musk reposted Kimi's paper, sparking a major discussion in Silicon Valley. What is the next battlefield for Attention?

36氪的朋友们2026-03-20 07:12
Depth is the next battleground for attention.

On March 16, 2026, the Kimi team uploaded a paper titled  Attention Residuals  to arXiv, and things quickly got out of control. Elon Musk retweeted it, and Andrej Karpathy commented, "We haven't really taken the title 'Attention is All You Need' seriously." Jerry Tworek, the former co - founder of OpenAI, simply said "deep learning 2.0". The last time an architecture paper from a Chinese team sparked this level of discussion in Silicon Valley might date back to  DeepSeek - V3.

However, despite all the buzz, most discussions remained at the level of "Kimi has come up with something new, and the bigwigs are excited." What was overlooked is that on the same day, the ByteDance Seed team and Huazhong University of Science and Technology jointly published another paper called  Mixture - of - Depths Attention (MoDA), which addresses the exact same problem but takes a completely different approach. In the same week, a third paper, "When Does Sparsity Mitigate the Curse of Depth in LLMs" by Dilxat Muhtar from Nanjing University, Shiwei Liu from MPI, and others, provided the most accurate pathological report from a theoretical perspective.

The emergence of these three papers in quick succession targeting the same issue is not a coincidence. A structural problem that has been overlooked for nearly a decade has finally reached a critical point where it has to be addressed.

The problem doesn't lie in the sequence dimension of attention. Attention has evolved through several generations in the past few years, from multi - head attention to grouped query attention, to MLA in  DeepSeek, and various sparse variants. Each generation has optimized how tokens interact with each other. This arms race has been quite spectacular, but it has obscured a fact - the way information is transmitted between layers has remained the same since the publication of the Transformer paper in 2017. Residual connection, h = h + f(h), is an addition operation without any learnable parameters.

The outputs of all historical layers are summed with equal weights. There is no selection, no forgetting, and no learning. The contributions of each layer are piled into the residual stream equally, regardless of whether the features learned are key or just noise.

Residual connection is the most successful "temporary solution" in the history of deep learning.

The Most Successful Temporary Solution

Residual connection was proposed by Kaiming He in ResNet in 2015. The idea is extremely simple. When the network reached twenty - some layers, it became difficult to train, and the vanishing gradient problem made the parameters in the deep layers hardly updated. So, a "highway" was added to each layer, allowing the input to directly skip that layer and connect to the output. Even if that layer didn't learn anything, information and gradients could at least be passed through this shortcut. The effect was immediate. ResNet increased the network depth from twenty - some layers to over a hundred layers. Two years later, when the Transformer was introduced, the residual connection was adopted unchanged. Since then, this design has remained untouched.

It's not that no one has tried. ReZero, FixUp, and Highway Network have all made variants to make the residual weights learnable. However, none of them have been selected as the mainstream architecture for large models because the residual connection is just too good. It's simple, stable, and hardly increases the computational cost. At the model scale at that time, its side effects hadn't been exposed.

44% of Layers Are Idling

What are the side effects? At the beginning of 2025, the team of Shiwei Liu from Westlake University, Emory, and MPI published "The Curse of Depth". In March this year, "When Does Sparsity Mitigate the Curse of Depth in LLMs" by Dilxat Muhtar and others from Nanjing University further provided a quantitative diagnosis. Under the current mainstream large model architecture, the transformations in the deep layers are getting closer and closer to the identity mapping. What goes in comes out, which means that layer is practically useless.

The numbers are quite ugly. Researchers use the "usefulness score" to measure whether each layer is making meaningful transformations. In a 12 - layer model, all layers are working. In a 16 - layer model, three layers are useless. In a 24 - layer model, nine layers are useless. In a 32 - layer model, 14 layers are useless, and 44% of the layers have hardly learned anything. When the number of parameters increased from 900 million to 2.3 billion, with a 156% increase in budget, the number of effective layers only increased from 12 to 18.

Figure 2 | Quantitative diagnosis of the curse of depth - the efficiency of the number of effective layers decreases as the model scale increases. This image is AI - generated.

The reason is directly related to the way residual connection works. The output of each layer is added to a "main road" through the residual connection. As the number of layers increases, the accumulated signal on the main road gets larger and larger (which can be understood as the "background volume" keeps rising), but the amplitude of the new signal generated by each layer is limited. In the deep layers, the new signal is drowned in the background noise, and the input and output are almost the same, making that layer ineffective.

Residual connection solved the problem of "letting the gradient pass through", but it created the problem of "making the deep layers meaningful".

In the era of large models, this cost is real. One layer involves billions of floating - point operations. If 44% of the layers in a 128 - layer model are idling, nearly sixty layers of computing power are doing useless work. The community has been working on optimizing the inference efficiency for several years, including quantization, distillation, pruning, sparse attention, and KV cache compression, all aiming at optimizing the "useful computations".

The biggest efficiency black hole doesn't lie in the quadratic complexity of attention, but in an addition operation that hasn't changed since 2015.

Build a New Road Instead of Fixing the Old One

The starting point of the ByteDance Seed team and Huazhong University of Science and Technology is not "the residual connection is broken and needs to be replaced". Their question is more straightforward - since the attention mechanism can already enable tokens to interact with each other, why can't it also access information in the depth direction?

Traditional attention only has one dimension - the sequence dimension. When a token in the 20th layer performs attention calculation, it can only see the information of other tokens in the same layer. It can't see its state in the 3rd or 10th layer, even though the features learned in those shallow layers are very useful for the current calculation. These shallow features are indeed still in the residual stream, but they have been repeatedly superimposed and gradually diluted by a dozen layers of residual updates. When the deep layers want to use the shallow features, they can only use this "diluted fruit juice" that has been diluted more than a dozen times.

MoDA's approach is to add a second dimension to attention - the depth dimension. Each attention head performs both normal sequence attention (token looking at token) and depth attention (directly retrieving the original KV pairs from all previous layers). The two types of information are jointly normalized under the same Softmax, and the model decides for itself whether to focus more on the context of the current layer or look back at the features learned in the shallow layers. The residual connection remains and is not replaced, but the deep layers no longer have to rely solely on it to obtain shallow information.

The idea is easy to understand, but the challenge lies in implementing it without slowing down the speed.

Figure 3 | MoDA's two - dimensional attention mechanism - the sequence dimension and the depth dimension are jointly normalized under the same Softmax.

Move the Scattered Files to the Workstation

The problem lies in the memory access pattern of the GPU. In normal attention calculation, all KV pairs come from the same layer and are stored continuously in the video memory, so the GPU can read them very efficiently. However, MoDA needs to retrieve KV pairs from all previous layers, and these data are scattered in different locations in the video memory. The GPU is most afraid of this kind of "scattered" random access, and the speed will drop sharply. If one naively directly concatenates the KV pairs of all historical layers, in a 48 - layer model, the attention calculation of each layer has to search through the "files" of the previous 47 layers, and almost all the memory accesses will be random, dragging the training speed to an unusable level.

MoDA's solution is called Grouped Rearrangement. The core idea is that since random access is slow, the data should be rearranged into a continuous form before calculation.

The approach consists of two steps. First, divide the queries (Query) of the current layer into several groups of a fixed size (for example, 64 tokens per group). Second, for each group, move the depth KV (KV pairs from all previous layers) it needs to a continuous memory area from the scattered video memory locations, rearrange them, and then perform the attention calculation at once. You can think of it as having an assistant move all the files a worker needs to a table next to his workstation and arrange them neatly, so that he can look through them while sitting. The cost of moving is much lower than the cost of running back and forth repeatedly.

The key to this design lies in the granularity of the grouping. If the group is too large, too much depth KV needs to be moved for each group, and the moving itself becomes the bottleneck. If the group is too small, the parallel computing power of the GPU cannot be fully utilized. MoDA chooses the same block size as FlashAttention (the current industry - standard high - speed attention calculation engine), so that the calculation of depth attention can directly reuse the underlying implementation of FlashAttention without writing a brand - new GPU operator.

At a sequence length of 64K, the operator efficiency of MoDA reaches 97.3% of FlashAttention - 2. With the addition of the entire depth attention mechanism, the speed only slows down by less than 3%.

Figure 4 | Grouped rearrangement strategy - moving the KV pairs of historical layers scattered in the video memory to a continuous memory area.

This figure means that depth attention is not a lightweight plugin. It requires each layer to read the KV caches of all previous layers. If the engineering is done poorly, this cross - layer data dependency will slow down the training speed by several times. MoDA has kept the additional overhead to a 3.7% increase in FLOPs, indicating that the grouped rearrangement strategy has indeed solved the problem of random access very effectively.

A 3.7% Cost for a 2.11% Gain

On a 1.5B parameter model (based on the OLMo2 training recipe), MoDA has an average performance improvement of 2.11% on 10 downstream tasks, with only a 3.7% additional computational cost. At first glance, it may not seem significant, but this is an improvement at the architectural level, not achieved by using more data or longer training. It will continue to take effect as the scale increases. Moreover, there are significant differences between tasks. The improvement in commonsense reasoning (WinoGrande) is 2.37%, and in scientific reasoning (ARC - Challenge) it is 4.35%. Tasks that require cross - layer feature integration have significantly greater benefits.

Figure 5 | Performance comparison of MoDA on 10 downstream tasks.

The Debt Owed by Pre - Norm

Perhaps the most valuable part of the MoDA paper is not MoDA itself, but an experiment on the normalization strategy.

Here, some background needs to be explained. After each layer of the Transformer finishes its calculation, it has to go through a step called "normalization", which aims to stabilize the numerical range and prevent numerical explosion or disappearance during training. There are two mainstream ways to place this step. Placing it before each layer's calculation is called Pre - Norm (also known as Pre - LN); placing it after the calculation is called  Post - Norm (also known as Post - LN). Since 2020, almost all large models have used Pre - Norm because it makes training more stable and less likely to crash. However, the "deep layer idling" problem mentioned earlier is precisely the side effect of Pre - Norm. In order to stabilize training, Pre - Norm actually continuously dilutes the signal strength in the deep layers.

The MoDA experiment conducted two groups of comparisons. On a 48 - layer model, Pre - Norm and Post - Norm were used respectively, and then MoDA's depth attention was added to each group. Under the Post - Norm configuration, adding depth KV led to a 0.0409 reduction in validation loss; with Pre - Norm, it was only 0.0041, almost ten times less.

This data reveals something more significant than MoDA itself. That is, Pre - Norm is not just "stabilizing training"; it is also systematically suppressing the learning ability of the deep layers. In the past, people were afraid to use Post - Norm because training was unstable and gradients were prone to explode. However, MoDA's depth attention provides a brand - new gradient path, allowing gradients not to rely entirely on the residual connection for transmission. With this new path, the original instability problem of Post - Norm is no longer a fatal flaw.

The combination of MoDA + Post - Norm opens up the possibility that the compromise made in the past for training stability (using Pre - Norm) may be reversed.

Figure 6 | Difference in validation loss between Pre - Norm and Post - Norm after adding depth KV.

Renovate the Old Road Instead of Building a New One

MoDA didn't touch the residual connection. It chose to build a new path outside the residual connection. On the same day, the Attention Residuals (AttnRes) paper published by the Kimi team took a more direct approach, directly modifying the residual connection itself.

The standard residual connection simply sums the outputs of all previous layers with equal weights and piles them into the main road. There is no selection and no forgetting. AttnRes replaces this fixed equal - weight addition with an attention operation. Each layer uses its own state as a query, and the outputs of all previous layers as candidates. Attention is used to determine which features of the previous layers are useful for the current layer and what the weights are.

The residual connection has changed from a fixed formula to a learnable dynamic routing.