StartseiteArtikel

Ali, Kimi und Ant Group setzen gemeinsam ein. Ist die gemischte Aufmerksamkeit von einer Option zu einer Pflichtfrage geworden?

智东西2026-03-20 20:38
Die Konzepte nähern sich einander an, aber die Umsetzungen variieren. Ein neuer Konsens über die Architektur großer Modelle beginnt sich zu bilden.

Yesterday, Xiaomi introduced the Large Language Model Mimo - V2 Pro, once again bringing the Hybrid Attention Architecture into the public eye.

This Large Language Model with billions of parameters uses a Hybrid Attention ratio of 1:7. It offers performance approaching that of Claude Opus 4.6, and the API prices are only one - fifth of the cost of Claude Opus 4.6.

In fact, Xiaomi's exploration of the Hybrid Attention Architecture continues the technological consensus of leading Chinese Large Language Model providers in efficiency optimization. In the past few months, several leading Chinese Large Language Model providers have shown their groundbreaking progress in Hybrid Attention.

In February this year, Ant Group introduced the world's first Large Language Model with a Hybrid Linear Attention Architecture. In September last year, Alibaba applied Hybrid Linear Attention in its next - generation model architecture, Qwen - Next. At the same time, other providers such as Yuezhianmian and MiniMax have also integrated similar architecture optimizations into their models.

The exploration of the Hybrid Attention Architecture has almost become a mandatory task for Large Language Model providers. The difference lies only in the choice of technological approach, and the common goal is to strive for a balance between efficiency and performance.

01. Leading providers bet on Hybrid Attention, multiple technological approaches run in parallel

In deep - learning technology, the Attention mechanism allows the model to selectively focus on the important parts of the input information. So far, Softmax has been the core of the Attention mechanism in mainstream architectures.

This mechanism "reads" the entire context in each calculation and accurately captures the relationships between words. This gives the model strong expressive power and the ability for fine - grained adjustment.

However, the costs are obvious: as the text length increases, the computational complexity increases quadratically. A large amount of KV caches also need to be stored, which causes a significant memory requirement. This is a disadvantage in commercial scenarios where more and more emphasis is placed on inference efficiency and cost control.

Facing this common challenge, the industry has developed three main technological approaches.

The first approach is Sparse Attention. The core concept is to improve efficiency by "calculating less" and "calculating in a targeted way". A representative model is DeepSeek.

The second approach is Sliding Window Attention. Here, Softmax is still used to calculate the Attention weights, but only the neighboring tokens within a fixed window are considered to improve computational performance.

The third approach is Linear Attention. In contrast to other approaches, the Softmax formula is completely rewritten, reducing the complexity from O(N²) to O(N), which is almost linear. This leads to a significant reduction in inference costs.

However, all three approaches have their own limitations. The current industry trend towards hybrid architectures is essentially a correction of single solutions.

Notably, more and more solutions are approaching Hybrid Linear Attention. This is the only approach that theoretically overcomes the sequence - length limitation. It revolutionizes the way Attention is calculated. This radicalness brings both risks and potential.

02. How does Hybrid Linear Attention become an industry consensus?

In China, many Large Language Model providers have already started exploring the Hybrid Linear Attention Architecture.

Over time, in early 2025, the MiniMax Text - 01 model was released. This model uses a Hybrid Linear Attention ratio of 1:7 and was implemented on a model with 456 billion parameters.

Subsequently, the MiniMax - M1 model with the same architecture was also developed. At that time, the MiniMax - M1 team believed that the hybrid architecture would become the standard in model development, but it was still facing bottlenecks in infrastructure and other areas.

In the second half of 2025, there was an upsurge in further exploration of Hybrid Linear Attention.

In September last year, Alibaba's Tongyi Lab introduced the next - generation basic model architecture, Qwen3 - Next and validated it on an 80 - billion - parameter model. This model replaces the standard Attention with a combination of Linear Attention and Gated Attention to effectively handle long contexts. With a Hybrid ratio of 1:3, it can outperform single solutions.

Alibaba's research group found that Linear Attention has a stronger ability to learn from contexts compared to the common Sliding Window Attention.

In the same month, Ant Group's Bailing Team open - sourced Ring - mini - linear - 2.0 and Ring - flash - linear - 2.0, validating the applicability of their developed Lightning Linear Attention in industrial scaling and long - context inference.

These two models use more Linear Attention layers and validate a Hybrid ratio of 1:7. Their performance with a high FLOP budget is significantly better than that of a pure Softmax structure.

In this study, Ant Group Bailing also further explored the synergy between architecture innovation and infrastructure optimization. The FP8 fusion operator they developed increases the computational performance of FP8 mixed - precision training by 1.5 to 1.7 times.

For inference, they developed more efficient fusion operators for Linear Attention to improve the throughput of the inference engine.

Through the combination of architecture optimization and high - performance operators, the costs of the two Ring - linear models in in - depth inference scenarios are only about one - tenth of the costs of dense models of the same size. Compared with the original Ring models, the costs have been reduced by more than 50%.

In October last year, Yuezhianmian open - sourced the Hybrid Linear Attention Architecture Kimi Linear. The core of this architecture is Kimi Delta Attention (KDA), a new Linear Attention module that finely improves the Gated - Delta rules. This linear architecture uses a Hybrid ratio of 1:3 and outperforms models with full Attention in terms of quality with reduced memory requirements.

Although these explorations have validated the potential of the Hybrid Linear Attention Architecture in many ways, most results are on a small or medium scale. In practice, however, Large Language Models need to deal with engineering challenges such as billions of parameters, millions of context windows, and high competition in inference.

Therefore, the next step is to extend these technological explorations to real super - scale models and systematically validate their reliability, scalability, and economic value in industrial applications.

03. The billion - parameter model as a testing ground, the final validation of efficiency and costs

The implementation of the Hybrid Linear Attention Architecture at the level of billions of parameters is continuously advancing.

Yang Zhilin, the founder and CEO of Yuezhianmian, has clearly expressed his optimism about the future of Hybrid Linear Attention. He believes that the linear architecture is a very worthwhile research topic, and his team has already gained extensive experience from projects like Kimi Linear.

In the next model, Kimi K3, Yuezhianmian plans to make further architecture optimizations based on the Hybrid Linear Attention Architecture. He is convinced that the next model, Kimi K3, will definitely be "significantly better", if not ten times better than the K2.5.

The Ant Group Bailing Team, which also bets on this technological approach, has already developed two Large Language Models with billions of parameters. One is Ling - 2.5 - 1T, a huge model with a Hybrid Linear Attention Architecture, and the other is Ring - 2.5 - 1T, the world's first Large Language Model with a Hybrid Linear Attention Architecture.

Based on previous studies, the Ant Group Bailing Team developed the Ling 2.5 architecture through incremental training. This architecture improves the GQA + Lightning Linear combination to a more efficient MLA + Lightning Linear combination, which further reduces the KV caches while maintaining the model's expressive power.

The Ling 2.5 architecture uses a Hybrid ratio of 1:7 and also retains core mechanisms such as QK Norm and Partial RoPE to ensure that the model performance does not decline during architecture migration.

In reducing costs and increasing efficiency, Ling - 2.5 - 1T only needs an average output of about 6,000 tokens to handle complex tasks that require 15,000 to 23,000 tokens from leading models. The memory requirement is reduced to one - tenth of the traditional architecture, and the generation throughput is tripled.

All these explorations of the Hybrid Linear Attention Architecture not only mean an improvement in performance but also a re - orientation of the application boundaries and business models of Large Language Models.

Imagine if the inference costs are significantly reduced and the token - utilization efficiency is continuously improved, the costs of model calls may no longer be the central bottleneck for the wide implementation of Large Language Models.

This naturally leads to a change in the application model. Enterprises no longer need to use the model sparingly but can integrate it as a standard function into more business processes to achieve more comprehensive and in - depth efficiency improvements.

The role of Large Language Models in high - frequency and real - time scenarios could change. In scenarios such as search, recommendation systems, and intelligent customer support, they could no longer just serve as supplements to traditional systems but emerge as the core driving force and become a basic infrastructure like databases and operating systems.

04. Conclusion: From parameter accumulation to engineering, the wide implementation of Large Language Models is approaching

The exploration of the Hybrid Linear Attention Architecture is continuously deepening, but this path will surely not be smooth. Different technological approaches will continue to compete and be validated. For example, after a preliminary exploration, MiniMax decided to return to models with full Attention to ensure stability and reliability in complex scenarios.

But a deeper signal is becoming clearer: the competition in Large Language Models is shifting from "blind parameter accumulation" to "precise calculation of engineering efficiency". When a consensus forms in the industry, it will no longer be just the size of the model that matters, but the effective performance that can be achieved by a unit of computing power.

Small differences in architecture will eventually translate into significant cost advantages and experience differences in enterprise implementation, leading Large Language Models from "usable" to "well - usable" and finally to wide dissemination.

This article is from the WeChat account ...