Alibaba, Kimi, and Ant Group are all making bets. Has the hybrid attention mechanism shifted from an option to a necessity?
Yesterday, Xiaomi released the Mimo-V2 Pro large model, once again pushing the hybrid attention architecture into the industry spotlight.
This large model with trillions of parameters adopts a 1:7 hybrid attention ratio. While providing capabilities close to Claude Opus 4.6, its API pricing is only 1/5 of the latter.
In fact, Xiaomi's exploration of the hybrid attention architecture continues the technical consensus of leading domestic large model manufacturers in efficiency optimization. In the past period, several leading domestic large model players have demonstrated their breakthrough progress in hybrid attention.
In February this year, Ant Group launched the world's first thinking model with trillions of parameters based on the hybrid linear attention architecture. In September last year, Alibaba adopted the hybrid linear attention in its next-generation model architecture Qwen-Next. Meanwhile, players such as DarkSide.ai and MiniMax have also introduced similar architecture optimization solutions in their respective model iterations.
The exploration of the hybrid attention architecture has almost become a must - answer question for large model manufacturers. The only difference lies in the choice of technical paths, and the same is the common pursuit of the balance point between efficiency and performance.
01. Leading players bet on hybrid attention, with multiple technical paths running in parallel
In deep learning, the attention mechanism allows the model to selectively focus on the important parts of the input information, and Softmax has always been the core attention calculation mechanism of the mainstream architecture.
This mechanism "reviews" the complete context every time it calculates, accurately captures the associations between words, and endows the model with strong expressive power and fine - grained alignment ability.
However, its drawbacks are obvious: as the text length increases, its computational complexity grows quadratically. It also needs to store a large amount of KV caches, bringing memory pressure. This shows its deficiencies in commercial scenarios that increasingly pursue inference efficiency and cost control.
Facing this common challenge, the industry has explored three main technical paths.
The first path is sparse attention. Its core idea is to improve efficiency by "calculating less" and "calculating with emphasis". The representative model is DeepSeek.
The second path is sliding window attention. It still uses Softmax to calculate attention weights, but only focuses on neighboring tokens within a fixed window, thereby improving computational efficiency.
The third path is linear attention. Different from other solutions, it completely rewrites the Softmax formula, reducing the complexity from the quadratic level of O(N²) to the approximate linear level of O(N), and significantly reducing the inference cost.
However, all three paths have their own limitations. Today, the industry's collective shift to the hybrid architecture is essentially a correction of a single technical path.
It is worth noting that more and more solutions are converging towards hybrid linear attention, which is the only path that theoretically breaks through the sequence length limitation. It reconstructs the attention calculation paradigm. This thoroughness is both its risk and its potential.
02. How did hybrid linear attention become an industry consensus?
In China, many large model companies have started exploring the hybrid linear attention architecture.
From a time perspective, at the beginning of 2025, MiniMax released the Text - 01 model, which adopted a 1:7 hybrid linear attention and was implemented on a model with 456B parameters.
After that, the MiniMax - M1 model also adopted the same architecture. At that time, the MiniMax - M1 team judged that the hybrid architecture would become the mainstream of model design, but still faced bottlenecks in infrastructure and other aspects.
More explorations of hybrid linear attention broke out in the second half of 2025.
In September last year, Alibaba's Tongyi Lab released the next - generation basic model architecture Qwen3 - Next and completed the verification on an 80B model. This model replaces the standard attention with a combination of linear attention and gated attention to achieve effective modeling of long contexts. Under a 1:3 hybrid ratio, its performance can exceed that of a single architecture.
Alibaba's research team found that compared with the commonly used sliding window attention, linear attention has a stronger context learning ability.
Also in September last year, Ant Group's Bailing team open - sourced Ring - mini - linear - 2.0 and Ring - flash - linear - 2.0, verifying the usability of the Lightning Linear attention they developed in industrial - scale training and long - context inference.
These two models adopt more linear attention layers, verifying a 1:7 hybrid ratio. Their performance under a high FLOP budget is significantly better than that of a pure Softmax structure.
In this research, Ant Group's Bailing team further explored the synergy between architecture innovation and infrastructure system engineering optimization. The FP8 fusion operator they developed increased the computational efficiency of FP8 mixed - precision training to about 1.5 - 1.7 times the original level.
On the inference side, they developed a more efficient linear attention fusion operator to further improve the throughput of the inference engine.
With the synergy of architecture optimization and high - performance operators, the cost of the two Ring - linear models in deep inference scenarios is only about 1/10 of that of dense models of the same size. Compared with the original Ring series, the cost has also decreased by more than 50%.
In October last year, DarkSide.ai open - sourced the hybrid linear attention architecture Kimi Linear. Its core is Kimi Delta Attention (KDA), a new linear attention module that improves the gated delta rule through fine - grained design. This linear architecture adopts a 1:3 hybrid ratio, reducing memory usage while surpassing the quality of full - attention models.
Although the above explorations have verified the potential of the hybrid linear attention architecture in multiple dimensions, most of the results are still at the small and medium - scale level. In real applications, large models need to face engineering challenges such as trillions of parameters, millions of context windows, and high - concurrency inference.
Therefore, the key to the next step is: to promote these technical explorations to truly ultra - large - scale models and systematically verify their reliability, scalability, and economic value in industrial - grade applications.
03. Trillion - parameter models become the touchstone for the ultimate verification of efficiency and cost
The engineering implementation of promoting the hybrid linear attention architecture to the trillion - parameter level is steadily advancing.
Yang Zhilin, the founder and CEO of DarkSide.ai, expressed clear confidence in the prospects of hybrid linear attention. He believes that the linear architecture is a very worthy direction to explore, and his team has accumulated a lot of research in projects such as Kimi Linear.
In the next - generation model Kimi K3, DarkSide.ai plans to introduce more architecture - level optimizations on the basis of the hybrid linear attention architecture. He believes that even if the next - generation model Kimi K3 is not 10 times stronger than K2.5, it will definitely be "much stronger".
The Ant Group's Bailing team, which also bets on this technical route, has successively delivered two large models with trillions of parameters. One is the super - large hybrid linear attention architecture model Ling - 2.5 - 1T, and the other is the world's first thinking model with trillions of parameters based on the hybrid linear attention architecture, Ring - 2.5 - 1T.
Based on previous research, Ant Group's Bailing team constructed the Ling 2.5 architecture through incremental training. This architecture upgraded GQA + Lightning Linear to a more efficient MLA + Lightning Linear combination, further compressing the KV cache while retaining the model's expressive ability.
The Ling 2.5 architecture adopts a 1:7 hybrid ratio and also retains core mechanisms such as QK Norm and Partial RoPE to ensure that the model performance does not degrade during the architecture migration process.
In terms of cost reduction and efficiency improvement, Ling - 2.5 - 1T only needs an average output length of about 6,000 tokens to complete complex tasks that leading models need 15,000 - 23,000 tokens to handle. Its memory access scale is compressed to 1/10 of the traditional architecture, and the generation throughput is increased to 3 times.
The above explorations of the hybrid linear attention architecture are not only about performance improvement itself but also about redefining the application boundaries and business forms of large models.
Imagine that when the inference cost is significantly reduced and the token usage efficiency is continuously optimized, the model call cost may no longer be the core bottleneck restricting its large - scale implementation.
Subsequently, there will be a natural transformation of the application paradigm. Enterprises no longer need to "call the model on - demand" carefully but can embed it as a default ability into more business processes to achieve more extensive and in - depth efficiency improvement.
The role of large models in high - frequency and real - time scenarios may change accordingly. In scenarios such as search, recommendation, and intelligent customer service, they are no longer just supplementary modules of traditional systems. Instead, they are expected to play the role of core driving engines and become underlying infrastructure like databases and operating systems.
04. Conclusion: From parameter stacking to engineering competition, large - scale implementation of large models is getting closer
The exploration of the hybrid linear attention architecture is still deepening, but this path is doomed not to be smooth sailing. Different technical routes are still in repeated games and verifications. For example, after a phased exploration, MiniMax chose to return to the full - attention model to prioritize the stability and reliability in complex scenarios.
However, the deeper signal is becoming clearer: the competition of large models is shifting from "brute - force parameter stacking" to "precise calculation of engineering efficiency". When the industry gradually forms a consensus, what determines the outcome will no longer be just the scale itself but the effective ability that can be released per unit of computing power.
Subtle differences at the architecture level will eventually be magnified into significant cost advantages and experience gaps in enterprise - level implementation, promoting large models from "usable" to "easy to use" and then to true widespread adoption.
This article is from the WeChat public account “Zhidongxi” (ID: zhidxcom), author: Chen Junda. Republished by 36Kr with permission.