DeepSeek's Latest Blockbuster Model: Major VLM Architecture Breakthrough Enables AI to Read Images Like Humans

This framework can be used to integrate multiple modalities such as additional text, speech, and vision.

▲ The cover image is generated by AI.

According to a report by Zhidongxi on January 27th, DeepSeek has just open-sourced its dedicated model for OCR scenarios, DeepSeek-OCR 2, and released its technical report simultaneously. This model is an upgrade of last year's DeepSeek-OCR model. Its new decoder enables the model to view images and read documents in a way more similar to humans, rather than like a mechanical scanner.

Put simply, previous models read in a mode from the top-left to the bottom-right, scanning the image in a carpet-like manner. In contrast, DeepSeek-OCR 2 can understand the structure and read step by step according to it. This new visual understanding mode allows DeepSeek-OCR 2 to better understand complex layout sequences, formulas, and tables.

In the document understanding benchmark test OmniDocBench v1.5, DeepSeek-OCR 2 achieved a score of 91.09%. With the training data and encoder remaining unchanged, it improved by 3.73% compared to DeepSeek-OCR. Compared with other end-to-end OCR models, this is already a SOTA result, but its performance is slightly inferior to Baidu's PaddleOCR-VL (92.86%) OCR pipeline.

Meanwhile, under a similar visual token budget, DeepSeek-OCR 2 has a lower edit distance (the amount of work required to edit to the correct text) in document parsing than Gemini-3 Pro, which proves that DeepSeek-OCR 2 maintains a high compression ratio of visual tokens while ensuring superior performance.

DeepSeek-OCR 2 has dual values: it can be used for exploratory research as a new VLM (Visual Language Model) architecture and as a practical tool for generating high-quality pre-training data to serve the training process of large language models.

Paper link: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf

Open-source address: https://github.com/deepseek-ai/DeepSeek-OCR-2?tab=readme-ov-file

01. Can't large models understand complex document structures? Observe the whole before reading to solve the problem

In terms of architecture, DeepSeek-OCR 2 inherits the overall architecture of DeepSeek-OCR, which consists of an encoder and a decoder. The encoder discretizes the image into visual tokens, and the decoder generates output based on these visual tokens and text prompts.

The key difference lies in the encoder: DeepSeek upgraded the previous DeepEncoder to DeepEncoder V2. It retains all the original capabilities but replaces the original CLIP-based encoder with an LLM-based one. At the same time, it introduces causal reasoning through a new architectural design.

The core problem that DeepEncoder V2 focuses on is that when the two-dimensional structure is mapped to a one-dimensional sequence with a linear order, the model is inevitably affected by this order when modeling spatial relationships.

This may be acceptable in natural images, but in scenarios with complex layouts such as OCR, tables, and forms, the linear order often seriously mismatches the real semantic organization, thus limiting the model's ability to express visual structures.

How does DeepEncoder V2 alleviate this problem? First, it uses a visual tokenizer to efficiently represent the image. Through window attention, it achieves about 16 times token compression. While significantly reducing the subsequent global attention calculation and video memory overhead, it retains sufficient local and medium-scale visual information.

It does not rely on positional encoding to specify the semantic order of visual tokens. Instead, it introduces causal flow queries to reorder and distill visual tokens in a content-aware manner. This order is not determined by spatial expansion rules. Instead, it is gradually generated by the model after observing the global visual context, thus avoiding a strong dependence on a fixed one-dimensional order.

Each causal query can focus on all visual tokens and previous queries. Thus, while keeping the number of tokens unchanged, it semantically reorders visual features and distills information. Finally, only the output of causal queries is sent to the downstream LLM decoder.

This design essentially forms a two-level cascaded causal reasoning process: first, the encoder semantically orders the disordered visual tokens through causal queries. Then, the LLM decoder performs autoregressive reasoning on this ordered sequence.

Compared with the method of forcibly imposing a spatial order through positional encoding, the order induced by causal queries is more in line with the visual semantics themselves, that is, in line with the normal habits of human reading.

Since DeepSeek-OCR 2 mainly focuses on encoder improvement and does not upgrade the decoder component. Following this design principle, DeepSeek retains the decoder of DeepSeek-OCR: a 3B parameter MoE structure with about 500 million active parameters.

02. Achieved a score of 91.09% in OmniDocBench, with an edit distance lower than Gemini-3 Pro

To verify the effectiveness of the above design, DeepSeek conducted experiments. The research team trained DeepSeek-OCR 2 in three stages: encoder pre-training, query enhancement, and decoder specialization.

In the first stage, the visual tokenizer and the LLM-style encoder acquire the basic abilities of feature extraction, token compression, and token reordering. In the second stage, the encoder's token reordering ability is further enhanced, and visual knowledge compression is also strengthened. In the third stage, the encoder parameters are frozen, and only the decoder is optimized, thus achieving higher data throughput under the same FLOPs.

To evaluate the model's effectiveness, DeepSeek selected OmniDocBench v1.5 as the main evaluation benchmark. This benchmark includes 1355 document pages, covering 9 major categories in Chinese and English (including magazines, academic papers, research reports, etc.).

DeepSeek-OCR 2 achieved a performance of 91.09% using only the minimum visual token limit (V-token maxmax). Compared with the DeepSeek-OCR baseline, under similar training data sources, it showed an improvement of 3.73%, verifying the effectiveness of the new architecture.

In addition to the overall improvement, the edit distance (ED) of the reading order (R-order) also decreased significantly (from 0.085 to 0.057). This indicates that the new DeepEncoder V2 can effectively select and arrange initial visual tokens based on image information.

Under a similar visual token budget (1120), DeepSeek-OCR 2 (0.100) has a lower edit distance in document parsing than Gemini-3 Pro (0.115), further proving that the new model maintains a high compression ratio of visual tokens while ensuring performance.

However, DeepSeek-OCR 2 is not omnipotent. On newspapers with extremely high text density, DeepSeek-OCR 2's recognition effect is not as good as that of other types of text. This problem can be solved by increasing the number of local crops in the future or providing more samples during the training process.

03. Conclusion: May become the beginning of a new VLM architecture

DeepEncoder V2 provides preliminary verification of the feasibility of the LLM-style encoder in visual tasks. More importantly, DeepSeek's research team believes that this architecture has the potential to evolve into a unified full-modal encoder. Such an encoder can compress text, extract speech features, and reorganize visual content within the same parameter space.

DeepSeek said that the optical compression of DeepSeek-OCR represents a preliminary exploration towards native multimodality. In the future, they will continue to explore integrating additional modalities through this shared encoder framework, which will be the beginning of a new VLM architecture for research and exploration.

This article is from the WeChat official account “Zhidongxi” (ID: zhidxcom), written by Chen Junda, and is published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

DeepSeek's latest blockbuster model: A major breakthrough in the VLM architecture enables AI to read images like humans.

01. Can't large models understand complex document structures? Observe the whole before reading to solve the problem

02. Achieved a score of 91.09% in OmniDocBench, with an edit distance lower than Gemini-3 Pro

03. Conclusion: May become the beginning of a new VLM architecture