Yang Zhilin, the only Chinese founder of a large model to take the stage, made his debut at the US GTC and publicly unveiled the technical roadmap of Kimi.
According to a report by Zhidx on March 18th, early this morning at the NVIDIA GTC conference, Yang Zhilin, the founder of Dark Side of the Moon, as the only founder of a Chinese independent large model company invited to give an on - site speech at this session, delivered a speech titled "How We Scaled Kimi K2.5", for the first time fully disclosing the technical roadmap behind Kimi K2.5.
Just on March 16th, Dark Side of the Moon just released a new paper, previewing in advance the key module of the next - generation model - Attention Residuals (abbreviated as AttnRes). The core of this paper is the redesign of the Residual Connection, one of the most basic but long - accepted structures in large models.
This progress quickly caught the attention of the overseas AI community. Elon Musk said it was "impressive"; Andrej Karpathy, the former vice - president of research and co - founder of OpenAI, directly stated that people may not fully understand the groundbreaking paper "Attention is All You Need" on Transformer.
In this GTC speech, Yang Zhilin put this research back into the more complete technical framework of Kimi and presented a more systematic "roadmap". He summarized the evolution logic of Kimi K2.5 into the resonance of three dimensions: Token Efficiency, Long Context and Agent Swarms.
In Yang Zhilin's view, the current Scaling is no longer simply a stacking of resources, but rather to seek scale effects simultaneously in computational efficiency, long - term memory, and automated collaboration. If the technical gains in these three dimensions are multiplied, the model will exhibit an intelligence level far beyond the current situation.
This is also the first time that Dark Side of the Moon has systematically disclosed this set of technical roadmaps since Kimi released K2.5 at the end of January.
Yang Zhilin proposed that many technical standards commonly used in the industry at present are essentially products of eight or nine years ago and are gradually becoming bottlenecks for Scaling. To address this issue, the Kimi team chose to start from three basic modules: optimizer, attention mechanism, and residual connection, reconstruct them one by one, and continue to open - source.
01. Rewriting the training base: MuonClip, doubling the Token efficiency compared to AdamW
The Kimi team focused on Token Efficiency as the first priority. In his speech, Yang Zhilin mainly discussed the issue of optimizers.
He mentioned that since 2014, the Adam optimizer has been the default choice in the industry. However, in ultra - large - scale training, alternative solutions with higher Token efficiency have become an important direction. The Kimi team verified in experiments that the Muon optimizer has significant advantages in Token efficiency. With a similar computational budget, it can convert training Tokens into model capabilities at twice the efficiency.
▲The Muon optimizer achieves approximately twice the Token efficiency with the same computing power.
However, Yang Zhilin also pointed out that during the training process of the K2 model with trillions of parameters using Muon, the Kimi team encountered stability issues: Logits exploded during training, and the maximum value quickly exceeded 1000, causing the model to diverge.
In response to this problem, the Kimi team proposed the MuonClip optimizer. Yang Zhilin said that this method uses the Newton - Schulz iteration combined with the QK - Clip mechanism to constrain the numerical values during the training process. In actual training, the max logits of Kimi K2 were controlled within 100 and gradually decreased, and at the same time, the model loss was not negatively affected, achieving stable training.
▲MuonClip controls the max logits within 100 to achieve stable training.
He also mentioned that in order to make Muon scalable in large - scale GPU clusters, the Kimi team also designed "Distributed Muon", which distributes the optimizer state among data parallel groups and aggregates gradients when needed to complete the calculation, thereby improving memory efficiency and overall training efficiency.
02. The second focus is on long context: Kimi Linear, increasing the decoding speed from 128K to 1M by 5 to 6 times
Long Context is the second main line of Kimi's roadmap this time.
In this part, Yang Zhilin mainly introduced Kimi Linear. This is a hybrid linear attention architecture based on KDA (Kimi Delta Attention).
Its core idea is to rearrange the composition of the attention layer instead of using full attention for all layers by default.
Specifically, Kimi Linear uses a mixing ratio of approximately 3:1 between KDA and global attention, which reduces memory overhead while maintaining the model's expressive ability.
Yang Zhilin mentioned in his speech that Kimi Linear has completed training on a scale of 1.4T tokens and outperforms full attention and other baseline solutions in long - context, short - context, and reinforcement learning tasks.
The more direct change is reflected in the inference efficiency. Within the context range from 128K to 1M, the decoding speed can be increased by approximately 5 to 6 times, and at the same time, it maintains stable performance in scenarios of different lengths.
This change addresses a long - standing problem: as the context window continues to expand, the inference cost and latency increase synchronously, making it difficult to truly implement long - task capabilities. Kimi Linear transforms long - context from a "supportable ability" to an "efficiently usable ability".
03. Rewriting the residual connection: enabling each layer to actively retrieve information
Compared with the optimizer and linear attention, Attention Residuals is also a particularly crucial attempt in Kimi's technical roadmap this time.
The residual connection is an extremely basic layer design in deep networks and has been used for about ten years.
Yang Zhilin mentioned that the traditional residual connection uses a fixed addition accumulation method. As the network deepens, the hidden state will continue to grow, and the deep - layer information is easily diluted. The Kimi team's approach is to replace the residual path with a dynamic aggregation based on Softmax attention, enabling the model to selectively obtain information from previous layers according to the input content.
This change shifts the information flow from "layer - by - layer stacking" to "on - demand retrieval", maintaining a more stable information representation in deep networks.
In this part, Yang Zhilin extended the relevant ideas of Ilya Sutskever, the former chief scientist of OpenAI, at NeurIPS 2024: If the residual connection is regarded as a simplified LSTM unfolded along the depth, then Attention can be understood as a further expansion of this information channel.
▲Ilya proposed that "rotating the LSTM by 90 degrees results in a residual connection", and Attention can be regarded as its expansion.
Based on this understanding, Kimi proposed Attention Residuals and has open - sourced the relevant code and technical reports.
04. Visual reinforcement learning feeding back text capabilities, cross - modality bringing cognitive gains
In addition to the model's underlying architecture, Yang Zhilin also shared an important observation in the cross - modality research direction in his speech.
He mentioned that during the native vision - text joint pre - training process, after introducing Vision RL, the model not only improves its performance in visual tasks but also reversely enhances its pure text capabilities. The results of ablation experiments show that after visual RL training, the model's performance on text benchmarks such as MMLU - Pro and GPQA - Diamond increases by approximately 1.7% - 2.2%.
Yang Zhilin believes that this indicates that spatial reasoning and visual logic abilities can be transformed into deeper general cognitive abilities. The relevant work also points to a direction: the value of multi - modality training has shifted from "expanding input forms" to "enhancing underlying reasoning abilities".
He also mentioned that the Kimi team is promoting the "first open model with native, joint vision - text capabilities".
05. From single - Agent to cluster collaboration: Kimi bets on Agent Swarms
In the last part of the speech, Yang Zhilin focused on Agent Swarms.
He mentioned in his speech that the future form of agents will shift from single - agents to a cluster system that can be dynamically generated. Kimi K2.5 introduces an Orchestrator, which can create multiple sub - agents according to task requirements and break down complex tasks into parallel sub - tasks for execution.
▲The Orchestrator dynamically generates sub - agents and executes tasks in parallel.
These sub - agents can assume different roles, such as AI Researcher, Physics Researcher, Fact Checker, etc., and complete the overall task through division of labor and collaboration.
Yang Zhilin further added that this type of system can cover the entire process from input to output, including large - scale information acquisition (Input at Scale), parallel operations (Actions at Scale), task orchestration (Orchestration at Scale), and long - result generation (Output at Scale).
As the task complexity increases, the efficiency advantage of the agent cluster over a single agent will continue to expand. In experiments, the execution time can be shortened by several times.
He also pointed out that multi - agent systems are prone to "serial collapse", that is, seemingly multiple agents but actually reverting to single - agent execution. To address this, Kimi designed a parallel reinforcement learning reward mechanism, including Instantiation reward, Finish reward, and Outcome reward, to guide the model to truly break down tasks and execute them in parallel.
▲Three types of reward mechanisms are used to prevent "pseudo - parallel" and serial collapse.