Domestic large models make a turn on the same day: DeepSeek goes left, Kimi goes right. Has the era of competing for implementation begun?
On January 27th, two of the most closely watched domestic large model startups almost simultaneously released their latest and most significant open - source updates:
DeepSeek released and open - sourced DeepSeek - OCR 2, a crucial upgrade based on its industry - shaking DeepSeek - OCR last year; Kimi also released and open - sourced K2.5, continuing to advance its long - context, multi - modal, and "agent - enabled" approach.
On the surface, these are two model iterations in different directions.
DeepSeek - OCR 2 re - answers the question of "how a model should 'read' information." Through a new visual encoding mechanism, it enables the large model to learn human visual logic, compressing the originally expensive and lengthy text input into a higher - density "visual semantics."
Put simply, it attempts to change the way AI "reads documents." Instead of breaking an entire document into thousands of individual characters and "force - reading" them, the model can first look at the layout, grasp the key points, and then understand the meaning, just like a human. This means that in the future, having AI help you read long documents, search for information, and extract tables may be faster, cheaper, and more reliable.
Image source: DeepSeek
Kimi K2.5 takes a different direction: rather than just answering questions, it pushes AI one step further towards "being able to do work." Longer memory, stronger multi - modal understanding, and the ability to break down and execute complex tasks point to an experience closer to a "digital assistant" rather than just a chat window that only engages in conversations.
Kimi claims it to be the most intelligent and versatile model to date, supporting both visual and text input, thinking and non - thinking modes, as well as conversations and Agent tasks.
Image source: Kimi
One focuses on the transformation of the input efficiency of language models, and the other focuses on general intelligence and the ability to collaborate on complex tasks. However, at the same time, they jointly point to a more important question: the upgrade of large - model capabilities is shifting from "parameters and dialogue abilities" to a more fundamental and engineering - oriented reconstruction of capabilities.
What AI is upgrading is no longer just a smarter brain.
The Evolution from Input to Execution: Two Upgrade Routes for Domestic AI
The release of DeepSeek - OCR last year made the industry realize for the first time that the way large models input information word by word and token by token could be redesigned. With the latest release of DeepSeek - OCR 2, it solves an even more specific and difficult problem: how exactly should a model "read" a complex document?
In the past, the way AI processed documents was very mechanical. Whether it was a PDF, a contract, or a financial report, in essence, they were first broken into paragraphs of text and then sequentially fed into the model. The problems with this approach are obvious:
On the one hand, long documents quickly consume the context window, resulting in high costs and low efficiency; on the other hand, the relationships between tables, multi - column layouts, annotations, and the main text are often disrupted during the "character - breaking" process.
DeepSeek's answer in OCR - 2 is to further strengthen its "visual encoding" concept. Instead of treating the document as a string of text, it is regarded as a visual object to be "read."
Compared with the first generation, the key change in OCR 2 is not just the compression ratio but the introduction of a logic closer to human reading habits, shifting from the previous CLIP (slicing) architecture to an LM (language model) visual encoder based on Qwen2. The model no longer processes the entire page of content simultaneously and evenly but learns to distinguish the structure:
Where is the title, where is the table, which information is related to each other, which should be read first, and which can be read later.
Schematic diagram of operation. Image source: DeepSeek
In other words, it begins to understand that "the layout itself is part of the information."
The direct value brought by this change is not reflected in abstract evaluations such as "the model is smarter" but in a series of very specific improvements in user experience. For example, when you ask AI to quickly read a report of dozens of pages, it no longer needs to read every single character to give a conclusion; when processing complex tables, the problems of column misalignment and field mismatch no longer occur frequently.
More importantly, since the input is highly compressed, the same task can be completed at a lower cost and in a shorter time. This is why DeepSeek - OCR 2 is more significant for real - world AI applications. It has the potential to make AI more suitable for integration into real - world document processes, whether it is retrieval, comparison, summarization, or structured information extraction.
In this sense, OCR 2 does not solve a problem of model capabilities but a long - standing problem of "being difficult to use."
While DeepSeek - OCR 2 redesigned the "input end" of AI, Kimi K2.5 focuses on the ability of AI agents to complete complex tasks.
In fact, today, no matter how complex the question is, AI can answer it. However, once it comes to tasks that involve multiple steps, multiple materials, and require repeated reference to the context, the model often "forgets what came before and after" or only stays at the suggestion level. Although the capabilities of AI are already quite mature, many users still have similar experiences.
In K2.5, Kimi continues to focus on the "long - memory + multi - modal + agent" approach. In essence, it is trying to make AI shift from the "question - answering mode" to the "execution mode."
On the one hand, the long - context feature allows the model to remember conversations, materials, and intermediate conclusions for a longer time, reducing the cost of repeated explanations; on the other hand, the multi - modal capabilities enable AI to not only process text but also understand pictures, interface screenshots, and even more complex input forms.
More crucially, it continuously strengthens the "agent" capabilities. Kimi no longer just tells you "what should be done" but tries to break down the task into multiple steps and has implemented an "Agent cluster," which can call different capabilities at different stages and finally provide a relatively complete result. This ability determines whether AI can truly enter the workflow rather than just remaining as a consulting assistant.
Image source: Kimi
This is also why Kimi K2.5 emphasizes that it is "more versatile." What it pursues is not the limit of a single capability but the ability to undertake longer, more complex, and more real - world - like task chains.
In This Round, Large Models Are Competing on "Whether They Can Really Be Used"
Looking beyond DeepSeek - OCR 2 and Kimi K2.5, we can find that a group of mainstream large models in the past six months have surprisingly consistent upgrade directions. Whether it is OpenAI's GPT - 5.2, Anthropic's Claude 4.5, Google's Gemini 3, ByteDance's Doubao 1.8, or Alibaba's Qianwen Qwen3 - Max - Thinking, they have all shifted their focus from "how powerful the model is" to a more practical question:
To make AI penetrate deeper into the real - world work environment.
This is why, in this round of upgrades, there is less emphasis on parameter scale and single - point capabilities, and more focus on several aspects: the ability to remember, understand, handle processes, and complete tasks.
First, the ability of "memory" has been collectively enhanced.
In the past, large models were more like short - term conversation experts, good at answering questions on the spot but difficult to collaborate in the long term. Once the task became longer and the materials increased, users had to constantly repeat the background information. The upgrades of this recent batch of models are almost all aimed at solving this pain point: longer context and more stable state maintenance allow the model to follow the task through without "losing memory" after a few steps.
GPT - 5.2 has directly productized long - context and different inference modes, while Kimi K2.5 has embedded the long - context feature into the agent process, enabling the model to remember intermediate results during multi - step execution. These changes are making AI not just answer a question but have the ability to help users complete a task.
Second, there is a new understanding of the concept of "seeing."
If in the past, multi - modality mainly meant "being able to recognize images," now the focus of the upgrade is "whether it can understand." DeepSeek - OCR 2 represents a more radical and practical direction: instead of treating vision as a pre - step for text, it directly regards vision as information itself, allowing the model to first understand the structure, layout, and relationships, just like a human, before entering the semantic level.
This change is not limited to the document scenario. Whether it is GPT, Claude, or Gemini, they are all strengthening their ability to understand screenshots, interfaces, and complex images.
Image source: Gemini
The information in the real world is not originally arranged as text line by line. When the model begins to truly understand "how information is organized in an image," AI can be more naturally integrated into the real environment rather than just existing in a pure - text dialogue box.
Next, the most easily overlooked but crucial change in this round of upgrades is the role transfer of AI.
In the past, large models were more like "advisors," providing suggestions and answers but not being responsible for the results. Now, more and more models are being designed as "executors." Kimi K2.5's emphasis on agents is essentially about making the model learn to break down tasks, use tools, and run processes; GPT - 5.2's combination of different inference modes and tool calls is also aimed at reducing the gap between "suggestion" and "execution."
When AI starts to take over an entire process rather than just a single question, the criteria for evaluating its value also change. The key is no longer "whether it says it right" but "whether it can complete the process and do it stably." This is why the importance of "engineering" has been repeatedly mentioned in this round of upgrades.
Domestic AI has been particularly active in this regard. DeepSeek, Kimi, Qianwen, and Doubao all emphasize whether the model is easy to deploy, integrate into existing systems, and run in real - world businesses. On the other hand, both domestic and international AI have been emphasizing stronger product packaging in the past year to hide complex capabilities behind interfaces and services. In fact, the goals are the same: to make AI not just for "demonstration" but "usable" and "easy to use."
Conclusion
No model has achieved "Artificial General Intelligence (AGI)." However, looking at the longer - term timeline, more changes are taking place in less "eye - catching" areas: the input methods are being redesigned, tasks are being broken down and taken over, and models are required to remain stable in longer - term and more complex processes.
When models are seriously integrated into real - world daily life and work environments, and are repeatedly verified and called, the criteria for measuring their value also change. It is no longer about who has larger parameters or more amazing answers, but about who is more cost - effective, makes fewer mistakes, and is more worthy of long - term reliance.
From this perspective, the significance of DeepSeek - OCR 2 and Kimi K2.5 lies not only in the problems they individually solve but in the fact that they represent a more practical consensus: the next step for AI to enter the real world must go beyond question - answering.
This article is from "Lei Technology" and is published by 36Kr with permission.