重磅！小米大模型“杀入”第一梯队：代码能力开源居首，智商情商双在线

百万输出Token只要两块一

Another Chinese model has quietly slipped into the top - tier of open - source models.

This time, it's neither DeepSeek nor Qwen, but the recently announced open - source model MiMo - V2 - Flash by Xiaomi.

With only 30.9 billion parameters, this model has shown an extremely high efficiency density and achieved impressive results in several renowned comprehensive tests.

Not only did it score highly, but it also achieved a 2.6 - fold acceleration in inference and at the same time ensured first - class model performance with minimal deployment costs.

At the recently held Xiaomi Partner Conference for the "Overall Ecosystem of Auto, Smartphone and Home Automation", Xiaomi defined this model as the "new language foundation for the agent era".

This model has also gained wide international recognition. A user on X commented that MiMo - V2 - Flash would make agents even more practical.

Some users have even hoped that the model will be made available in the gguf format to make it more easily compatible with their own model frameworks.

From the technical report, we have also learned about a series of key technologies that Xiaomi has used behind MiMo - V2 - Flash:

5:1 Mixed Attention Mechanism: Combination of sliding window (SWA) and global attention;
Learnable Attention - Sink Bias: Solving the problem of semantic discontinuity due to local windows;
MTP (Multi - Layer Token Prediction): Forward prediction of multiple subsequent tokens, enabling up to a 2.6 - fold acceleration in inference;
MOPD (Multi - Teacher Online Policy Distillation): Rapid adaptation of performance to the teacher model with minimal training effort.

Looking in detail -

A "Team of Tutors" for the Student Model

MiMo - V2 - Flash uses a MoE architecture with a total of 30.9 billion parameters and 256 experts. Compared with models with trillions of parameters or open - source models with double the number of parameters, it's a "David against the Goliaths".

MiMo - V2 - Flash uses a dynamic activation mechanism where 8 experts are activated, which corresponds to a parameter count of 1.5 billion. The inference costs are only about 2.5% of the costs of the proprietary competitor Claude 4.5 Sonnet.

When processing long texts, MiMo - V2 - Flash uses a 5:1 mix of sliding window attention (SWA) and full attention.

SWA is a sparse attention mechanism that limits the attention field of each token to a narrow local window. This is like reading where you only see what's right in front of you. As a result, the complexity of the attention calculation can be reduced from quadratic to the length of the entire text to linear.

This approach differs from that of DeepSeek, which has opted for the sparse approach, while MiMo - V2 - Flash has chosen the linear approach.

However, SWA can lead to problems of semantic discontinuity and loss of overview in long texts when increasing efficiency. To address this, MiMo - V2 - Flash has introduced a learnable attention - sink bias.

This technology introduces a learnable sum term into the denominator of the softmax normalization, which allows the attention mechanism to "divert" excess weights to a virtual anchor point when there are no suitable matches in the local window.

This design is like holding a "logical anchor point" when quickly reading a long text to ensure that the model can maintain an overview of the entire text even with a very small sliding window.

Through this architecture of MiMo - V2 - Flash, the memory requirement of the KV cache is reduced to one - sixth, while the ability to understand long texts is even improved.

Observant users have noted from the technical report that Xiaomi's sliding window is only 128,000 tokens in size, but achieves better results than windows of 512,000 tokens.

To accelerate inference, the MTP (Multi - Layer Token Prediction) technology was introduced. This module is reused as a draft model for speculative decoding during inference and can predict multiple subsequent tokens in parallel by increasing computational intensity to compensate for the memory bandwidth limitation.

Simply put, while traditional models generate only one word at a time, the MTP technology can suggest multiple words at once, and the main model only needs to review the drafts in parallel.

This "single prediction and parallel review" mechanism significantly improves the efficiency of inference. After loading 3 MTP modules, a 2 - to 2.6 - fold acceleration of the actual inference can be achieved.

For the training process, a new paradigm called MOPD (Multi - Teacher Online Policy Distillation) was used.

This method builds a network of teacher models in specific areas and uses the inverse KL divergence to provide the student model with intensive token - level reward signals. This effectively solves the problems of sparse rewards and training instability in traditional RL.

It's like setting up a team of renowned tutors for the student model. The tutors evaluate and correct each step of the student in real - time, so it only takes about one - fiftieth of the training effort of the traditional SFT + RL process to reach or even exceed the performance of the teachers.

Through these end - to - end optimizations, the model has successfully found the optimal solution between computing power and memory utilization and has taken an outstanding position with high energy efficiency in the "Price vs. Speed" diagram published by the company.

This extreme technical exploitation has directly translated into cost advantages. The API prices are only 0.7 yuan per million input tokens and 2.1 yuan per million output tokens. Thus, Xiaomi has lowered the usage threshold of high - performance models from the "luxury goods" to the "daily necessities" level.

High IQ and EQ: Programming and Emotionality

According to the data published in the technical report, MiMo - V2 - Flash has shown extremely high competence. It scored 86.2 points in the Arena - Hard - Benchmark, which measures general ability, and 84.9 points in the complex inference test MMLU - Pro.

These core values have catapulted it into the top - tier of open - source models and enabled it to keep pace with the best models.

Its programming ability is its strongest suit. In the SWE - Bench Verified test, it achieved 73.4%, surpassing DeepSeek - V3.2 (73.1%) and Kimi - K2 Thinking (71.3%).

In addition, the model has shown remarkable generalizability and robustness in terms of agent ability. It solved 71.7% of the problems in the SWE - Bench Multilingual test and scored 80.3 points in the Tau2 - Bench, which measures the tool - using ability. Both key indicators are among the best in the world among open - source models.

The performance published by the company is impressive. But how does MiMo - V2 - Flash perform in practice? We conducted our own tests.

Let's first look at the programming ability that Xiaomi touts the most. In specific engineering scenarios, MiMo - V2 - Flash has shown very high completeness.

For example, when requested to create a web interface for the macOS operating system with front - end code, it could directly generate a complete code framework.

The prompt was as follows:

We opened the "File Manager", created and edited files, and then returned to the original directory. The files were still there and had the same content as when they were created.

When we viewed the files via the command line, the content was also unchanged.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

小米大模型“杀”进第一梯队：代码能力开源第一，智商情商全在线

A "Team of Tutors" for the Student Model

High IQ and EQ: Programming and Emotionality