Yang Zhilin, der einzige chinesische Gründer eines großen Sprachmodells: Premiere auf der amerikanischen GTC und Veröffentlichung der technischen Roadmap von Kimi

Doppelte Token-Effizienz, 1M langer Kontext und die Technologien hinter Agenten-Clustern werden alle vorgestellt.

According to a report by Zhidongxi on March 18th, early this morning at the NVIDIA GTC conference, Yang Zhilin, the founder of Dark Side of the Moon, as the only invited founder of a Chinese independent large - model company to give an on - site speech at this session, delivered a speech titled "How We Scaled Kimi K2.5" and for the first time fully disclosed the technical roadmap behind Kimi K2.5.

Just on March 16th, Dark Side of the Moon just released a new paper, previewing in advance the key module of the next - generation model - Attention Residuals (AttnRes for short). The core of this paper is the redesign of the Residual Connection, one of the most basic but long - term default structures in large models.

This progress quickly attracted the attention of the overseas AI circle. Elon Musk said it was "impressive"; Andrej Karpathy, the former vice - president of research and co - founder of OpenAI, bluntly stated that people may not fully understand the groundbreaking paper "Attention is All You Need" on Transformer.

In this GTC speech, Yang Zhilin put this research back into the more complete technical framework of Kimi and presented a more systematic "roadmap". He summarized the evolution logic of Kimi K2.5 into the resonance of three dimensions: Token efficiency, long context and Agent Swarms.

In Yang Zhilin's view, the current Scaling is no longer simply a matter of resource stacking, but rather to seek scale effects simultaneously in computational efficiency, long - range memory and automated collaboration. If the technical gains in these three dimensions can be multiplied, the model will exhibit a much higher level of intelligence than the current situation.

This is also the first time since the release of Kimi K2.5 at the end of January that Dark Side of the Moon has systematically disclosed this set of technical roadmaps.

Yang Zhilin proposed that many technical standards commonly used in the industry at present are essentially products of eight or nine years ago and are gradually becoming bottlenecks for Scaling. To address this issue, the Kimi team chose to start with three basic modules: optimizer, attention mechanism and residual connection, reconstruct them one by one and continue to open - source them.

01. Rewriting the training base: MuonClip, doubling the Token efficiency compared to AdamW

The Kimi team focused on Token efficiency as the first priority. Yang Zhilin focused on the optimizer issue in his speech.

He mentioned that since 2014, the Adam optimizer has been the default choice in the industry. However, in ultra - large - scale training, alternative solutions with higher Token efficiency have become an important direction. The Kimi team verified in experiments that the Muon optimizer has significant advantages in Token efficiency. With a similar computational budget, it can convert training Tokens into model capabilities at twice the efficiency.

▲The Muon optimizer achieves about twice the Token efficiency with the same computing power.

However, Yang Zhilin also pointed out that in the process of extending Muon to the training of the K2 model with trillions of parameters, the Kimi team encountered stability issues: Logits exploded during training, and the maximum value quickly exceeded 1000, causing the model to diverge.

In response to this problem, the Kimi team proposed the MuonClip optimizer. Yang Zhilin said that this method uses the Newton - Schulz iteration combined with the QK - Clip mechanism to constrain the numerical values during the training process. In actual training, the max logits of Kimi K2 were controlled within 100 and gradually decreased, and at the same time, the model loss was not negatively affected, achieving stable training.

▲MuonClip controls the max logits within 100, achieving stable training.

He also mentioned that in order to make Muon scalable in large - scale GPU clusters, the Kimi team also designed "Distributed Muon", which distributes the optimizer state among data parallel groups and aggregates the gradients when needed to complete the calculation, so as to improve memory efficiency and overall training efficiency.

02. The second focus is long context: Kimi Linear, increasing the decoding speed from 128K to 1M by 5 to 6 times

Long context is the second main line of Kimi's roadmap this time.

In this part, Yang Zhilin mainly introduced Kimi Linear. This is a hybrid linear attention architecture based on KDA (Kimi Delta Attention).

Its core idea is to rearrange the composition of the attention layer instead of using full attention for all layers by default.

Specifically, Kimi Linear uses a mixing ratio of about 3:1 for KDA and global attention, which reduces memory overhead while maintaining the model's expressive ability.

Yang Zhilin mentioned in his speech that Kimi Linear has completed training on a scale of 1.4T tokens and outperforms full attention and other baseline solutions in long - context, short - context and reinforcement learning tasks.

The more direct change is reflected in the inference efficiency. Within the context range from 128K to 1M, the decoding speed can be increased by about 5 to 6 times, and at the same time, it maintains stable performance in different length scenarios.

This change addresses a long - standing problem: as the context window continues to expand, the inference cost and latency increase synchronously, making it difficult to truly implement long - task capabilities. Kimi Linear transforms long - context from a "supportable ability" to an "efficiently usable ability".

03. Rewriting the residual connection: enabling each layer to actively obtain information

Compared with the optimizer and linear attention, Attention Residuals is also a particularly crucial attempt in Kimi's technical roadmap this time.

The residual connection is an extremely basic layer design in deep networks and has been used for about ten years.

Yang Zhilin mentioned that the traditional residual connection uses a fixed addition accumulation method. As the network deepens, the hidden state will continue to grow, and the deep - layer information is easily diluted. The Kimi team's approach is to replace the residual path with a dynamic aggregation based on Softmax attention, enabling the model to selectively obtain information from previous layers according to the input content.

This change shifts the information flow from "layer - by - layer superposition" to "on - demand reading", maintaining a more stable information representation in deep networks.

In this part, Yang Zhilin extended the relevant ideas of Ilya Sutskever, the former chief scientist of OpenAI, at NeurIPS 2024: If the residual connection is regarded as a simplified LSTM unfolded along the depth, then Attention can be understood as a further expansion of this information channel.

▲Ilya proposed that "rotating the LSTM by 90 degrees results in a residual connection", and Attention can be regarded as its extension.

Based on this understanding, Kimi proposed Attention Residuals and has open - sourced the relevant code and technical reports.

04. Visual reinforcement learning feeds back text capabilities, and cross - modality brings cognitive gains

In addition to the model's underlying architecture, Yang Zhilin also shared an important observation in the cross - modality research direction in his speech.

He mentioned that in the process of native vision - text joint pre - training, after introducing visual reinforcement learning (Vision RL), the model not only improves its performance in visual tasks but also reversely improves its pure text capabilities. The results of ablation experiments show that after training with Vision RL, the model's performance on text benchmarks such as MMLU - Pro and GPQA - Diamond increases by about 1.7% - 2.2%.

Yang Zhilin believes that this indicates that spatial reasoning and visual logic abilities can be transformed into deeper - level general cognitive abilities. The related work also points to a direction: the value of multi - modality training has shifted from "expanding input forms" to "improving underlying reasoning abilities".

He also mentioned that the Kimi team is promoting the "first open model with native, joint vision - text capabilities".

05. From single - agent to cluster collaboration: Kimi bets on Agent Swarms

In the last part of the speech, Yang Zhilin focused on Agent Swarms.

He mentioned in his speech that the future form of agents will shift from single - agents to a cluster system that can be dynamically generated. Kimi K2.5 introduces an Orchestrator, which can create multiple sub - agents according to task requirements and break down complex tasks into parallel sub - tasks for execution.

▲The Orchestrator dynamically generates sub - agents and executes tasks in parallel.

These sub - agents can assume different roles, such as AI Researcher, Physics Researcher, Fact Checker, etc., and complete the overall task through division of labor and collaboration.

Yang Zhilin further added that such a system can cover the entire process from input to output, including large - scale information acquisition (Input at Scale), parallel operations (Actions at Scale), task orchestration (Orchestration at Scale) and long - result generation (Output at Scale).

As the task complexity increases, the efficiency advantage of the agent cluster over a single agent will continue to expand. In experiments, the execution time can be shortened by several times.

He also pointed out that multi - agent systems are prone to "serial collapse", that is, seemingly multiple agents but actually reverting to single - agent execution. For this reason, Kimi designed a parallel reinforcement learning reward mechanism, including Instantiation reward, Finish reward and Outcome reward, to guide the model to truly break down tasks and execute them in parallel.

▲Three types of reward mechanisms are used to prevent "pseudo - parallel" and serial collapse.

06. Conclusion