Kimi K2 Thinking Launches Surprise Attack: Intelligent Agent and Reasoning Abilities Surpass GPT - 5, Narrowing Gap between Open - source and Closed

Model is Agent

Kimi K2 Thinking is now released and open-sourced!

It emphasizes the concept of "model as Agent". Not only is it "the most capable open-source thinking model of Kimi to date", but it also masters the ability to think while using tools —

Without manual intervention, it can perform 200 - 300 consecutive tool calls.

As one of the most attention - grabbing open - source model series this year, once the Thinking version of Kimi K2 was launched, it became a hot topic: it once again narrowed the gap between open - source models and closed - source models.

A quick overview of more technical details is here:

1TB of parameters, 32B of activated parameters, using INT4 instead of FP8.

A 256K context window.

More experts, fewer heads, more thinking.

△

In evaluation benchmarks such as the Human Last Exam (HLE), BrowseComp for testing autonomous web browsing ability, and the complex information collection and reasoning benchmark test SEAL - 0, Kimi K2 Thinking has refreshed the SOTA, surpassing closed - source models such as GPT - 5 and Claude Sonnet 4.5 (Thinking).

The code and weights of Kimi K2 Thinking both follow the most permissive MIT license. The new model has also been launched on kimi.com and the latest version of the Kimi mobile app immediately, so you can experience it right away. The API can be accessed through the Kimi Open Platform.

Technical Details

The official mentioned that K2 Thinking is the latest progress of the Dark Side of the Moon in the field of Test - Time Scaling. By simultaneously expanding the thinking tokens and the number of tool call rounds, the model has achieved stronger Agent and reasoning performance.

Comprehensive Improvement of Agent and Reasoning Abilities

In testing, in the Human Last Exam (HLE), under the same conditions where tools such as search, Python, and web browsing tools are allowed to be used, Kimi K2 Thinking achieved a SOTA score of 44.9%.

The official also released an example where K2 Thinking successfully solved a doctoral - level math problem through 23 rounds of reasoning and tool calls.

Third - party tests also confirm the improvement of its Agent ability:

Artificial Analysis tested Kimi K2 Thinking in the 𝜏² - Bench Telecom Agent tool - using benchmark.

The results show that Kimi K2 Thinking reached the SOTA. In the Agent scenario, it took a big step forward compared to the previously well - received K2 Instruct (from 73% to 93%).

Comprehensive Improvement of Autonomous Search and Browsing Abilities

In complex search and browsing scenarios, Kimi K2 Thinking also performs excellently.

In BrowseComp, where the average human intelligence scores 29.2%, Kimi K2 Thinking demonstrated its in - depth exploration ability and became the new SOTA model with a score of 60.2%.

Driven by long - term planning and autonomous search ability, Kimi K2 Thinking can use up to hundreds of rounds of the dynamic cycle of "think → search → browse web pages → think → program" to continuously propose and improve hypotheses, verify evidence, conduct reasoning, and construct logically consistent answers.

This ability to actively search while continuously thinking enables Kimi K2 Thinking to decompose vague and open - ended questions into clear and executable subtasks.

Enhanced Agentic Programming Ability

In terms of programming, in test benchmarks such as the SWE - Multilingual, SWE - bench validation set, and LiveCodeBench, Kimi K2 Thinking can also compete with the strongest closed - source models such as GPT - 5 and Claude Sonnet 4.5.

The official mentioned that Kimi K2 Thinking has significantly improved its performance when dealing with HTML, React, and front - end tasks with rich components, and can transform ideas into fully functional and responsive products.

In the Agentic Coding scenario, Kimi K2 Thinking can think while calling various tools, flexibly integrate into software agents, and handle more complex and multi - step development workflows.

For example, replicating a real and usable Word text editor.

Another example is creating a gorgeous voxel art work:

Upgrade of General Basic Abilities

Beyond the main line of Agent and reasoning abilities, the general basic abilities of Kimi K2 Thinking have also been upgraded.

Creative Writing: Kimi K2 Thinking has significantly improved its writing ability. It can transform rough inspirations into clear, moving, and purposeful narratives, making them both rhythmic and profound. It can handle subtle differences in writing styles and vague structures, and maintain style consistency in long - winded texts. In creative writing, the images it creates are more vivid, and the emotional resonance is stronger, integrating precise expression with rich expressiveness.

Academic and Research: In academic research and professional fields, Kimi K2 Thinking has significantly improved in terms of analysis depth, information accuracy, and logical structure. It can analyze complex instructions and expand ideas in a clear and rigorous way. This makes it particularly good at handling academic papers, technical abstracts, and long - form reports that require high information integrity and reasoning quality.

Personal and Emotional: When responding to personal or emotional questions, Kimi K2 Thinking's answers are more empathetic and more neutral. It not only thinks more deeply and clearly, can provide detailed views and practical follow - up suggestions, but also has more human touch.

Native INT4 Quantization

It is worth noting that K2 Kimi Thinking uses INT4 instead of FP8 precision.

The official explanation is that the thinking model will produce extremely long decoding lengths, and conventional quantization methods often lead to a significant decline in model performance. To overcome this challenge, they adopted quantization - aware training (QAT) in the post - training stage and applied INT4 weight - only quantization to the MoE components.

This enables Kimi K2 Thinking to support native INT4 reasoning in complex reasoning and Agentic tasks and doubles the generation speed.

Moreover, INT4 has stronger compatibility with reasoning hardware and is more friendly to domestic accelerated computing chips.

p.s. NVIDIA GPUs before Blackwell do not support FP4.

First - hand Testing

More test examples can be viewed on the official technical blog. We also conducted a simple test right away (only the long - thinking mode was enabled, and it was not connected to the Internet).

A classic question:

How to pass a 7 - meter - long sugar cane through a 1×2 - meter door

After thinking for nearly 5 minutes, the answer given by Kimi was:

Although the thinking time was a bit long, Kimi K2 Thinking successfully bypassed the trap in this question and realized that the length and width of the door do not actually limit the passage of the sugar cane.

In terms of programming, the question we tested was:

Write a Python program to make a ball bounce inside a rotating hexagon, and the ball's movement follows physical laws

This time, Kimi K2 Thinking quickly started writing code.

What do you think of this performance?

If you have also conducted first - hand testing, welcome to share more test results with us in the comment section!

Project address: https://huggingface.co/moonshotai/Kimi - K2 - Thinking

Technical blog link: https://moonshotai.github.io/Kimi - K2/thinking.html

Reference links:

[1]https://x.com/Kimi_Moonshot/status/1986449512538513505

[2]https://x.com/ArtificialAnlys/status/1986541785511043536

[3]https://mp.weixin.qq.com/s/oQp1kFpoYFhYQ8GzbwZLyA

This article is from the WeChat public account "Quantum Bit", author: Yuyang. It is published by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。