HomeArticle

K2 Thinking stole the show again. YANG Zhilin answered 21 questions early in the morning.

咏仪2025-11-11 18:28
Ask Kimi Anything.

Text by | Deng Yongyi

Edited by | Su Jianxun

After the release of K2 Thinking last week once again sparked global discussions, soon, at 0:00 Beijing time on November 11th, Yang Zhilin, the founder of Dark Side of the Moon, along with partners Zhou Xinyu and Wu Yuxin, conducted an online AMA (Ask Me Anything) session on the Reddit community for several hours, answering questions related to the new model.

This was also the first time that several co-founders appeared together.

From a high-profile approach to focusing on model technology, Kimi has stopped investing in advertising and has become quieter. Similar to the release of K2 three months ago, the release of this new model also took a low-key route: there was no official offline press conference, and the model was directly released on the community.

The main creators of the team chose to answer questions on Reddit and Zhihu, which is in line with Kimi's current open-source approach - these communities are where AI practitioners and geeks gather.

Foreign developers have spared no praise for Kimi K2. After the AMA session began, dozens of questions quickly filled the discussion thread, along with praises for Kimi's cost-effectiveness and the depth of its open source. "It's an absolutely great model!" many users said.

Many developers also "urged for updates" on the spot, hoping that the Kimi team would quickly launch a smaller version of the K2 Thinking model for deployment on PCs or for use in enterprise production environments.

Yang Zhilin also clarified a series of rumors for the first time - Will Kimi continue to be open source? Is the $4.6 million training cost of K2 Thinking true? And what are the plans for the next-generation K3 model and the key training details of K2 Thinking.

△ Yang Zhilin responds to the question of training cost

△ Will a larger-scale closed-source model be released in the future? A subtle answer: If the model becomes more and more dangerous :)

The Kimi team also candidly responded to technical discussions and even humorously responded to the recent AI bubble - "We don't know (why OpenAI is burning so much money), only Sam knows. We have our own rhythm." Zhou Xinyu, the co-founder of Dark Side of the Moon, said.

△ Zhou Xinyu, co-founder of Dark Side of the Moon

The newly released K2 Thinking is a model with up to 1 trillion parameters and a sparse mixture of experts (MoE) architecture - this is a relatively large scale among open-source models.

In several benchmark tests representing cutting-edge capabilities, K2 Thinking has indeed achieved good results, especially in reasoning and task execution.

On the Agent leaderboards of test sets known for their high difficulty, such as HLE (Humanity's Last Exam, with over 3,000 high-difficulty human expert test questions) and BrowseComp (autonomous web browsing), the scores of K2-Thinking even exceeded those of GPT-5.

K2 Thinking inherits the architectural design of DeepSeek, but more innovative work has been done on this basis - the parameters have been increased, and new quantization methods such as INT4 have been adopted.

In terms of price, K2-Thinking has a significant cost advantage. The price for outputting one million tokens is $2.5, which is only a quarter of that of GPT-5 ($10). Many people call it a "cost-effective alternative" to GPT-5 and Claude Sonnet 4.5.

"Is this another glorious moment like DeepSeek?" After the release of K2 Thinking, Thomas Wolf, the co-founder of Hugging Face, sighed on X.

In recent months, it has been a harvest season for domestic large models. Major manufacturers seem to have agreed to open source their models one after another, giving Silicon Valley a little shock - In September, Zhipu released GLM-4.6, in October, MiniMax released M2, and now with K2 Thinking, they are competing fiercely on the global leaderboards.

(We have also compiled the complete Q&A of this AMA at the end of the article)

The talkative K2 Thinking is designed to better perform tasks

During the AMA session and on communities such as Zhihu, many developers' first impression was that K2 Thinking is very talkative. When asking it a question, it takes a long time to think, and although it is cheap, it consumes a large number of tokens.

Being talkative actually serves one of the most important purposes: to enable AI to help humans complete more tasks.

From K2 to K2 Thinking, all designs revolve around this point: focusing on Agentic (intelligent agent) capabilities, so that AI can not only chat but also truly complete tasks.

Although K2 Thinking has up to one trillion parameters, the large scale is not for show, but to enable the model to encompass more knowledge, which is beneficial for understanding and performing tasks. This is equivalent to having a "smarter brain"; however, during actual operation, the activated parameters of K2 Think are controlled at 30 billion, which also ensures that the speed of answering questions and performing tasks is fast enough.

A long thinking chain is the strength of K2 Thinking. According to the official introduction of Kimi, K2 Thinking can continuously perform 200 - 300 tool calls to solve complex problems, ensuring task continuity.

A Zhihu user @Ordinary conducted an experiment: he gave K2 Thinking a doctoral-level math problem, and K2 Thinking successfully solved the problem with only 23 tool calls.

The specific execution process of K2 Thinking is as follows:

  • Step 1: The model first understands the problem and plans a solution path.
  • Step 2: Call a search tool to find relevant solutions and theories.
  • Step 3: Analyze the search results and determine whether they are usable.
  • Step 4 to Step N: Repeatedly call the Python code executor to write code, perform calculations, and verify hypotheses.

Loop: Continuously iterate in the cycle of "thinking - calling tools - verifying results" until the problem is solved.

It is not difficult to see that this imitates the process of humans solving problems, continuously iterating in the cycle of "thinking - calling tools - verifying results".

The "number of steps" measures the long-range execution ability and endurance of the model. The more steps, the more complex the tasks the model can handle and the more multi-round iterations are required. And in this process, how to prevent the model from deviating from the original goal is one of the main difficulties in training.

The core goal of many designs of K2 Thinking is to ensure that the model can completely handle complex tasks without losing information. To achieve the goal of "performance first", the Kimi team's trade-off is that they can sacrifice a little token efficiency - it doesn't matter if it is a bit talkative, but the task must be completed.

Regarding the recently popular OCR route research (pure pixel input model) of DeepSeek, the team also shared their thoughts. "Personally, I think this route is a bit too resource-intensive. I prefer to continue to work on the feature space to find more general and modality-agnostic methods to improve model efficiency." Wu Yuxin, the co-founder of Dark Side of the Moon, said.

△ Zhou Xinyu, co-founder of Dark Side of the Moon

In addition to text models, the Kimi team also said that they are still working on other modalities such as visual understanding, and the timeline may be postponed.

After the supply of Claude was cut off, the innovation speed in China has become even faster

Whether it is the release of Kimi K2 Thinking or GLM and MiniMax M2, they all point to a common trend: Under the circumstances of limited infrastructure such as chips and the cut-off of Claude's supply, domestic large models have accelerated the process of algorithm innovation.

Regarding the training cost, Yang Zhilin clearly stated that the $4.6 million is "not an official figure" and said that it is difficult to quantify the training cost because the main part is research and experiments, which cannot be included in the one-time training cost.

It is certain that K2 Thinking was completed under relatively limited conditions. Yang Zhilin said that K2 Thinking was trained on H800 GPUs equipped with Infiniband. Compared with the United States, Kimi is at a disadvantage in terms of the number of GPUs, but it has maximized the performance of each graphics card.

Not only Kimi, but also domestic teams still investing in base models have found niche innovation directions at the algorithm level.

A typical example is that MiniMax and Dark Side of the Moon have made different choices when facing the problem of "how to efficiently process long contexts".

The previous generation model M1 of MiniMax adopted a key model technique called Linear Attention, but with M2, it reverted to full attention.

The difference between the two is that MiniMax hopes to achieve more stable technology and not lose key content when processing long-chain information. MiniMax said in a recent technical blog: In practical applications, it was found that although Linear Attention can save computing power, for complex Agent tasks involving multi-step reasoning, the traditional method is more reliable. They value stability in the current engineering system more.

Kimi has chosen a more radical path. For example, the recently released Kimi Linear has developed the KDA + MLA route from a more fundamental hardware and architecture level, that is, mixing the KDA and MLA routes in a 3:1 ratio.

The traditional Transformer architecture is like a secretary with excellent memory but a bit forgetful - the model can remember every word and not miss any details; however, as the amount of information increases, the computing time of the model increases exponentially.

After adopting the KDA architecture, the model is forced to learn to "grasp key information". The model can selectively mark the importance, timeliness, and other dimensions of each word and selectively forget some details. This new architecture has significant advantages in terms of performance, speed, and video memory usage.

The choice of technical route is also related to the different business goals of each company.

The strategies of each company have begun to show obvious differences. MiniMax M2 is positioned as cost-effective, with fast inference speed and a wide range of multimodal options, hoping to attract developers to build a rich application ecosystem on their platform;

Kimi has chosen to continue to "climb the mountain", focusing on maximizing the capabilities of the text model and exploring the upper limit of intelligence. Under this goal, the team prioritizes performance, makes the Agent more usable, and temporarily does not consider token consumption efficiency.

Zhipu GLM has captured a large part of the market after the supply of Claude was cut off, especially in programming and reasoning scenarios. GLM-4.6 is a relatively comprehensive model in terms of performance, efficiency, and price, enabling enterprises to quickly start using it, and many application manufacturers can directly use it as a shell.

There is no right or wrong in these choices, they are just different survival strategies in the current environment.

In fact, the application ecosystem of Chinese open-source models is forming its own advantages - many overseas developers have started to build applications on Chinese open-source models and actively provide feedback. It is foreseeable that this open-source storm will also lead to more explosions in applications.

Appended below are the Q&A from the AMA session, edited and compiled by "Intelligent Emergence", with some merged:

Q: Is the $4.6 million training cost true?

Kimi: This is not an official figure. It is difficult to quantify the training cost because a large part of the work is research and experiments.

Q: What made you guys (affectionately said) choose a relatively untested optimizer to train such a large model?

Kimi: Muon is an optimizer that has not been tested by others, but in our experiments, it passed the Scaling Laws Ladder verification process.

We have confidence in our research system. You may think that we were just lucky to choose Muon, but behind this choice, dozens of optimizers and architectures failed the test in the experiments.

Q: What is your training hardware configuration? We'd like to know how your infrastructure differs from that of top US companies.

Kimi: We use H800 GPUs equipped with Infiniband. Although they are not as advanced as the high-end GPUs in the US and we are at a disadvantage in terms of quantity, we have made full use of each card!

Q: What are the most important indicators in your pre-training process? How does the process of ablating architectural changes work? At what scale should the tests be conducted, and which indicators need to be checked to ensure the model performs well?

Also, what have you done to make the data more conducive to model learning before and after pre-training? Are there any indicators that can predict whether the data is beneficial to the model? Can you share some experiences?

Kimi: The most important indicators are: Loss, Benchmarks, and internal stability indicators.

We have a Scaling Laws Ladder verification process that evolves at multiple scales. The model ablation stage must pass small-scale verification before proceeding to the next step. All indicators are important.

If there are any unexpected situations, we will pause the model scaling process until the problem is understood and resolved.

The most important hyperparameter is the learning rate (and the learning rate scheduler). There are too many variables, so it's best to understand the hyperparameters before delving into hyperparameter search.

A good dataset must show a good benchmark trend during training. If not, it's necessary to optimize the data or find a better benchmark to show progress.

I'd say that finding the right data mixture is an art. There are too many interactions and shared patterns between datasets. Start with your intuition, but ultimately trust the experiments.

Q: Is focusing only on pure text models a trade-off to achieve SOTA (state-of-the-art performance), or is it your long-term bet? Will you consider increasing the context window to 1M in the future?

Kimi: To develop a video understanding model, it takes time to acquire data and train. Therefore, we chose to release the text model first.

We have previously worked on a 1M context window, but the service is too expensive now. We will revisit longer context windows in the future. We should be able to increase the context length in future versions.

Q: Will you release a small model suitable for MacBooks? Or do you have plans to create a 32B or 20B model?

Kimi: We've noticed this demand, but currently, we don't have specific plans for a MacBook-friendly model. Small models like Kimi Linear are very cute, and we're likely to release