Liang Wenfeng and Yang Zhilin: The Fourth Car Crash Incident

Why do both of them target visual understanding?

This is already the fourth time since 2025 that Liang Wenfeng, the founder of DeepSeek, and Yang Zhilin, the founder of Dark Side of the Moon, have precisely "collided" in their technical approaches.

On January 27th, Dark Side of the Moon released and open - sourced a new model, Kimi K2.5, which evolved from the co - existence of K2 and K2 - Thinking. In the official video, Yang Zhilin described it as an "all - around model," with visual understanding, coding, multi - modality, thinking and non - thinking modes, Agent, and Agent cluster capabilities all encapsulated in the same model.

In addition to a significant improvement in coding ability, a major highlight of K2.5 lies in the substantial enhancement of its "visual understanding ability." It can analyze pictures and videos uploaded by users and then program or answer questions accordingly.

Coincidentally, on the same day that K2.5 was released, DeepSeek also launched its new - generation model, OCR - 2. This model also made a major breakthrough in visual understanding, with a more innovative solution. DeepSeek innovated the "visual causal flow" mechanism, no longer needing to scan pictures in a fixed order. Instead, it can dynamically adjust the reading order according to the semantics and logic of the image content, just like a human being.

Repeatedly exploring the same technical path and releasing results on the same day several times, the telepathy between Liang Wenfeng and Yang Zhilin is hard to explain as a coincidence. Why do they both aim at the peak of visual understanding?

Four "Collisions"

In fact, the reason why Liang Wenfeng and Yang Zhilin always choose to release their model products and papers at the same time is not because of "involution." By dissecting their achievements, we can find that their "harmonious yet different" innovations in key technical paths are based on similar judgments of the pain points of large models and the industry.

On January 20th, 2025, DeepSeek - R1 quickly became popular after its launch. Kimi 1.5 was also released shortly after, and it also adopted the "reinforcement learning based on outcome rewards" approach.

On February 18th, 2025, Liang Wenfeng and Yang Zhilin successively published their latest papers on the attention architecture, focusing on solving the industry pain points of low efficiency in long - context processing and high computing power consumption under the Transformer attention mechanism.

Among them, as a co - author, Liang Wenfeng proposed the DeepSeek - NSA (Native Sparse Attention) architecture. Through a strategy that combines hierarchical compression, key token selection, and a sliding window, it significantly reduces the computing power consumption for long - context processing.

On the same day, as a co - author of the paper, Yang Zhilin proposed the MoBA (Mixed Block Attention) architecture and chose a different optimization path from NSA. Based on the Mixture of Experts (MoE) principle, through block processing and a dynamic gating mechanism, the model can autonomously switch between full attention and sparse attention.

However, NSA focuses more on hardware - level optimization, while MoBA tends to make flexible innovations within the Transformer framework. Although the paths are different, their core goals are the same: to solve the efficiency bottleneck and make the model more practical in complex tasks.

In April 2025, DeepSeek released the mathematical reasoning model, DeepSeek - Prover - V2. Through reinforcement learning of sub - goal decomposition to promote theorem proving, the model can "self - verify" the rationality of the reasoning process. Almost at the same time, Dark Side of the Moon also launched a special mathematical reasoning model, which also adopted the "self - verification" core method, greatly improving the stability and accuracy of theorem proving.

This "collision" originated from the fact that at that time, deep AI reasoning was still a technical difficulty in the industry. As a core scenario, mathematical reasoning is directly related to the implementation ability of large models in fields such as scientific research, finance, and engineering. Their simultaneous focus on this direction is based on a consistent exploration of verifying the implementation value of AI.

In the most recent competition, DeepSeek's OCR - 2 and Dark Side of the Moon's K2.5 both aimed at visual understanding. This is also by no means a coincidence.

Several months ago, "China Entrepreneur" learned from relevant sources that DeepSeek and Dark Side of the Moon have been secretly competing to see who can be the first to develop a visual - language model with cutting - edge capabilities, so that large models will no longer be "smart blind people."

Combined with the multi - modality evaluation report released by SuperCLUE in July 2025, we may find the answer to their efforts in the visual - language model.

The report points out that visual - language models are generally facing three major pain points: 1. Lack of professional knowledge, especially scoring low in professional fields such as medical image analysis and industrial applications. 2. Insufficient adaptation to complex scenarios, performing poorly in tasks such as autonomous driving and spatial reasoning. 3. Insufficient depth of multi - modality fusion, with a judgment accuracy of less than 65% in cases of inconsistent text and images.

It can be seen that visual understanding is the inevitable path for large models to move from "language interaction" to "full - scenario interaction" and has become a bottleneck restricting the commercial implementation of models. Liang Wenfeng and Yang Zhilin's simultaneous focus on this field stems from their similar insights into industry pain points - whoever can make a breakthrough first will take the initiative in the multi - modality commercial competition.

How to Overcome the Peak of Visual Understanding?

In fact, at the level of large - language models, domestic models are gradually narrowing the gap with overseas models. However, industry insiders told "China Entrepreneur" that at the level of visual understanding, the so - called "Big Three" overseas models, Google Gemini, OpenAI GPT 5.2, and Claude, have advanced to the next stage, while domestic large models are still in the stage of catching up and "making up for deficiencies."

For example, several months ago, there was a test on the Internet where large models were asked to identify car models. A Tesla had a Xiaomi logo pasted on it by the owner, and many large models misidentified it. "This shows that comprehensively processing visual information is still difficult for current multi - modality models," said the aforementioned insider.

In this release, Yang Zhilin demonstrated a video in which K2.5 can reproduce the function of a website by recognizing pictures or videos. Previously, domestic large models mostly needed to rely on language and instructions to achieve this. "You need to precisely tell the model that there is a button in the upper - left corner, and all requirements need to be described with instructions."

"A picture is worth a thousand words," said tech blogger Hailalu to "China Entrepreneur." In most cases, it is difficult for users to describe the front - end interface they want to code in words at once. The core significance of visual understanding is to upgrade large models from "reading text" to "understanding and using information."

K2.5 is Dark Side of the Moon's first answer sheet in visual understanding. The team conducted joint pre - training on the native multi - modality architecture design and large - scale visual text, using approximately 15 trillion Tokens for continuous training. On this basic foundation, they built the Visual Agentic Intelligence system. In short, K2.5 starts with visual understanding encoding, decomposes Agent tasks, and enhances Coding ability.

A person close to Dark Side of the Moon told "China Entrepreneur" that the most practical difficulties in training lie in the scarcity of multi - modality data and the processing of data. "The photos taken by ordinary people every day are of little use to the model. High - quality data is needed for the model to learn. Even Wikipedia only provides medium - quality data."

In addition, Dark Side of the Moon also adhered to the pursuit of "technical taste" in K2.5. "If you want the model to be more romantic and proficient in software UI interfaces and aesthetic design, you need to match it with appropriate data, which requires more aesthetic awareness of the world," said the aforementioned person.

Source: Screenshot from the official website

Early on January 29th, the Dark Side of the Moon team answered questions from netizens on Reddit. Yang Zhilin said, "The core of the model lies in taste, because intelligence itself is non - fungible."

Hailalu commented that Kimi is the first among domestic models with good coding ability to truly "open its eyes." AI practitioner Xu Zaishi also said that the biggest difference between K2.5 and other multi - modality models is that it combines vision, coding, and Agent capabilities more closely, which lowers the development threshold and allows non - programmers to create prototypes through screenshots and screen recordings.

In addition to front - end design, along with K2.5, Kimi also launched Kimi Code, which can be directly run in the terminal and integrated into mainstream editors such as VSCode and Cursor. Simply put, K2.5 can automatically detect the user's coding process and migrate the user's existing Skills (skill packages for AI Agents) to a new workflow.

While K2.5 focuses on solving problems at the engineering level, DeepSeek has made more innovations at the source of visual technology.

Traditional visual - language models (VLM) usually scan pictures in a fixed order from left to right and top to bottom. However, when humans understand pictures, they substitute their own semantic order and choices, such as looking at the title first and then the text.

OCR - 2 also mimics human logic. It replaces the original CLIP encoder and introduces a brand - new visual encoder, DeepEncoder V2. This architecture breaks the limitation of scanning images in a fixed order (from top - left to bottom - right) and mimics the "causal flow" logic of human vision.

From this perspective, although both DeepSeek and Dark Side of the Moon are making up for the shortcoming of visual understanding, their innovation points occur in different links. K2.5 improves engineering performance based on multi - modality models and is closer to the commercial implementation side, while DeepSeek focuses more on tracing back to the source of technology for innovation.

Clusters Redefine Agents

In addition to visual understanding, the Agent cluster function of K2.5 has also been praised by many industry insiders.

Xu Zaishi is engaged in the pre - training of large - language models. He noticed that Anthropic's Claude Opus performs outstandingly in programming scenarios, one reason being that it is very good at performing tasks through tool calls. However, many language models have a high error probability in tool calls. The Agent Swarm architecture introduced by K2.5, which evolves from a single Agent to an Agent cluster, means a key improvement in the model's capabilities.

In Yang Zhilin's introduction of the Agent cluster, K2.5 is no longer a single - handed agent that takes on everything but an "agent team" assembled instantly. That is, when a task is assigned, the main Agent can generate hundreds of "sub - Agents" and be in charge of them. Compared with the single - agent mode, the task execution efficiency can be improved by up to 4.5 times.

The Dark Side of the Moon team demonstrated a video in which they fed 40 papers on psychology and AI to the Kimi Agent cluster. K2.5 first read all the papers in order through multiple tool calls. Then, it derived several sub - Agents to be responsible for writing different chapters. Finally, the main Agent was responsible for reviewing and accepting the results, summarizing all the content into a PDF review dozens of pages long.

It is not easy to achieve the concurrent operation of hundreds of Agents, and the balance scheduler is a major difficulty. In the early stage of training, sub - Agents may abandon the parallel strategy due to failed coordination. The Dark Side of the Moon team adopted the PARL (Parallel Agent Reinforcement Learning) training method. The team guided the model to establish stable preferences through stage - by - stage rewards.

In addition, when 100 Agents work simultaneously, communication and computing power will generate a huge load, and the Agents may repeat information and interfere with each other, resulting in lower efficiency than a single - model. The team needs to let the model learn how to communicate autonomously and dynamically adjust the number of agents and resource allocation.

According to "China Entrepreneur," the entire Agent cluster of K2.5 is automatically created and coordinated by the K2.5 model, and users do not need to pre - define sub - agents or workflows. Even if a sub - agent fails, the main Agent can quickly sense and re - schedule.

Xu Zaishi explained that the lack of pre - definition means that the Agent cluster of K2.5 has a dynamic division of labor. The model itself will decide what roles are needed for the task and automatically create sub - Agents to work in parallel.

On January 29th, when answering netizens' questions about how K2.5's "agent swarm" solves the problems of delay and context information loss when running 100 parallel inference streams, Wu Yuxin, the co - founder of Dark Side of the Moon, said that each sub - think - tank of K2.5 can independently execute sub - tasks without "corroding" or polluting the context of the main scheduler. That is, sub - Agents essentially have their own working memories and only return the results to the scheduler when necessary.

"Since K2, the Dark Side of the Moon team has taken every step steadily," said Xu Zaishi. Although he believes that the product form of Dark Side of the Moon still needs time to be polished, in the long run, the breakthrough in Agent cluster technology is of great value. "This means that future agents will no longer require manual design of workflows, truly realizing the liberation of human labor."

This article is from the WeChat official account "China Entrepreneur Magazine" (ID: iceo - com - cn), author: Sun Xin, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Liang Wenfeng and Yang Zhilin, the fourth car crash

Four "Collisions"

How to Overcome the Peak of Visual Understanding?

Clusters Redefine Agents