Ausführliche Erläuterung von Kimi K2 Thinking: Vielen Dank an DeepSeek. Ich werde jetzt erstmal GPT-5 besiegen.
“Is this another glorious moment in the DeepSeek style? Open-source software has once again outperformed closed-source software.”
On November 6, 2025, Thomas Wolf, the co-founder of Hugging Face, expressed his feelings on X, which precisely summarized the discussions triggered by the release of the Kimi K2 Thinking model.
Kimi K2 Thinking has achieved remarkable results in multiple benchmark tests, matching or even surpassing the SOTA closed-source models. For example, on the benchmark of the text-only subset of HLE (Humanity's Last Exam), its tool-enhanced version scored 44.9%, exceeding GPT-5's 41.7%.
Kimi K2 Thinking is trained based on the Kimi K2 model, focusing on improving Agentic ability and reasoning ability. It is a Mixture-of-Experts (MoE) model with a total of 1 trillion parameters. Approximately 32 billion parameters are activated during each inference. It supports a 256k context window and adopts the native INT4 quantization technology. The design idea is to control the computing cost and training cost while maintaining a large model scale. According to a report by CNBC citing people familiar with the matter, the training cost of this model is only $4.6 million. In comparison, the training cost of DeepSeek's V3 (rental price, during the formal training phase) is $5.6 million, and that of R1 is $294,000. Here, the main consideration is the GPU pre-training cost, excluding investments in R & D, infrastructure, etc.
A core feature of Kimi K2 Thinking is its Agent ability. The official claims that it can continuously execute 200 - 300 tool calls to solve complex problems. Closed-source models such as Grok-4 widely use RL to improve tool usage and long-range planning, but it is the first time to see such an implementation in an open-source model. It shows that the open-source community is quickly catching up with the forefront of agent technology, and at the same time, it also puts forward higher requirements for model hosting services.
As of now, Kimi K2 Thinking has not released a technical report. There are only technical blogs and usage documents, and its training data, RL details, or formula have not been disclosed. Shortly after the model was released, discussions about the model architecture itself began to appear in the technical community. On X and Reddit, an architecture diagram comparing it with the DeepSeek model side by side began to circulate, triggering discussions about its technical origin.
Against the backdrop of the long-delayed release of DeepSeek's R2 and the community's eager anticipation, Kimi has emerged with a model that has an inherited architecture and is also an open-source SOTA inference model, making people feel as if Kimi has released R2 on behalf of DeepSeek.
"Inheritance" of Architecture and "Magic" of Engineering
LLM research engineer Sebastian Raschka conducted a detailed analysis. He pointed out the specific similarities and differences between the two on threads:
• The number of experts in each MoE layer has increased by approximately 1.5 times (384 vs 256)
• A larger vocabulary (160k vs 129k)
• Approximately 32 billion parameters are activated per token in K2 (37 billion in DeepSeek R1)
• Fewer dense FFN blocks before MoE
"In short, Kimi K2 is essentially a slightly adjusted version of DeepSeek V3/R1 in terms of scale. Its improvements are mainly reflected in data and training recipes."
Raschka's analysis points out a key fact. The "inheritance" of Kimi K2 Thinking from the core architecture of DeepSeek is obvious, including designs such as the MoE mechanism and MLA (Multi-Head Latent Attention). This is a targeted adjustment and optimization based on a proven foundation according to its own goals. For example, reducing the number of attention heads and the amount of activated parameters aims to reduce the inference cost; while increasing the number of experts and the vocabulary is to enhance the model's knowledge capacity and expression ability. This approach of "standing on the shoulders of giants" is the most direct manifestation of the open-source spirit.
In addition to inheriting the DeepSeek architecture, the achievements of Kimi K2 Thinking also rely on the extensive "adaptation" of the achievements of the entire open-source ecosystem. From the FlashAttention used at the bottom layer to accelerate attention calculation, to the MuonClip optimizer mentioned in the K2 technical report, which is improved to solve training instability, and various data processing and post-training methodologies, all integrate the collective wisdom of the open-source community.
If the architecture and open-source technology determine the skeleton of the model, then what makes it fleshy is the engineering implementation ability of the Moon's Dark Side itself. This is mainly reflected in three aspects:
1. Training stability: During the pre-training process of up to 15.5 trillion tokens, Kimi K2 Thinking achieved "zero loss spike". This means that the training process is extremely stable, and there is no need for costly rollbacks due to model crashes. For a model with a scale of trillions of parameters, this is an important engineering achievement.
2. Native quantization inference: Kimi K2 Thinking supports native INT4 quantization inference. It is said that it can double the inference speed with minimal performance loss and significantly reduce the GPU memory required for deployment. This is the key to pushing large-parameter models from the laboratory to wide application.
3. Long-range task execution: The model can stably execute 200 - 300 rounds of tool calls, which not only tests the model's reasoning ability but also its system robustness. In hundreds of steps of interaction, the model must be able to handle various exceptions, which requires a complex engineering mechanism behind it.
The specific decisions made by the Kimi team in selecting and integrating these open-source technologies, as well as the final execution ability of its engineering team, together form the basis for the current achievements of Kimi K2 Thinking. This technical route and successful paradigm remind many people of the situation when R1 was released. It inherits DeepSeek's efficient MLA + MoE architecture and the data/reward orientation of "verifiable tasks first", and stabilizes the capabilities through engineering means (such as MuonClip, long context, and toolchains). The difference is that the open form and goals of K2 Thinking are more inclined to application delivery.
Trade-Offs Beyond SOTA
A comprehensive review of Kimi K2 Thinking cannot be limited to the scores on the benchmark. An unavoidable point is the source of its benchmark scores. Many SOTA scores shown by Kimi K2 Thinking in its technical blog are obtained based on a special "Heavy" mode. According to the official description on Hugging Face, this mode generates the final result by running up to 8 inferences in parallel and then reflectively aggregating all the outputs. This approach is common in academia and model competitions. At the Grok 4 press conference on July 9 this year, xAI announced that Grok 4 Heavy scored 44.4% on HLE and 50.7% on the text-only subset.
This heavy mode also brings some problems. Firstly, it consumes a huge amount of resources, and it is almost impossible for ordinary users to reproduce this performance through API or local deployment. Secondly, it creates a gap between the benchmark scores and the real ability of a single instance of the model. The standard mode that users can actually experience is not the same as the "beast mode" on the list.
The pursuit of efficiency is also reflected in the engineering decisions at the bottom layer of the model, and these decisions often follow the principle of exchanging performance for cost. For example, although the official claims that the native INT4 quantization adopted by the model causes minimal performance loss, the precision compression from FP16 to INT4 is significant. This quantization may perform well on standard evaluation sets, but in longer and more complex inference chains, whether the cumulative effect of precision loss will affect the final success rate of tasks still needs to be tested by more extensive practical applications.
Similarly, reducing the number of attention heads from 128 to 64 is also an active choice made by the Kimi team to reduce memory bandwidth and computing overhead. However, the K2 technical report also admits that more attention heads usually bring better model quality. This means that Kimi K2 has made some compromises in model ability in order to achieve higher inference efficiency.
The bet on Agent ability by Kimi K2 Thinking also brings limitations in other dimensions. The official benchmark tests show that K2 Thinking surpasses the top models of OpenAI and Anthropic (GPT-5 and Sonnet 4.5 Thinking) in the two indicators of "agent reasoning" and "agent search", but it has not yet reached the top in "programming ability".
In today's era when cutting-edge models have made multi-modal a standard feature, Kimi K2 Thinking is still a pure text model. This difference is particularly obvious when dealing with tasks involving visual or spatial reasoning. For example, in tasks such as generating an SVG image of a "pelican riding a bicycle", a pure text model may have some problems due to the lack of basic visual understanding of the physical world:
SVG generated by Kimi K2 Thinking
The release of Kimi K2 Thinking feels like another collective celebration in the open-source AI community. It stands on all the excellent open-source achievements of DeepSeek, clarifies its most important performance goals at this stage, improves the details, and enhances the training efficiency, resulting in a new open-source model that can surpass the strongest closed-source models in today's most critical directions. Then this model also provides feedback and inspiration to the open-source community. At the same time, it is also a piece of the puzzle for Kimi's next-generation larger and more complete model - perhaps the next DeepSeek moment is not far away, and it may not necessarily be brought by DeepSeek itself.
This article is from the WeChat official account "Silicon Star People Pro", author: Zhou Yixiao. It is published by 36Kr with authorization.