StartseiteArtikel

DeepSeek wins first place again. It pioneered the "causal flow" visual reasoning, surpassing Gemini.

新智元2026-01-27 19:50
DeepSeek-OCR2 is open-sourced, introducing a causal flow vision encoder and refreshing the SOTA.

[Introduction] DeepSeek has open-sourced DeepSeek-OCR2, introducing a brand-new DeepEncoder V2 visual encoder. This architecture breaks the limitation of traditional models scanning images in a fixed order (from top-left to bottom-right) and instead mimics the "Causal Flow" logic of human vision.

DeepSeek has updated again!

This time, it's a major upgrade of the DeepSeek-OCR model: DeepSeek-OCR2.

Remember the previous generation of DeepSeek-OCR? That model that compresses everything visually.

This time, DeepSeek has taken it a step further and targeted the visual encoder, proposing a brand-new DeepEncoder V2 architecture, achieving a paradigm shift in visual encoding from "fixed scanning" to "semantic reasoning"!

DeepSeek-OCR2 can not only read complex documents in a logical order like humans but also set a new state-of-the-art in multiple benchmark tests.

Of course, following DeepSeek's convention, the Paper, Code, and Model are all open-sourced!

Project address: https://github.com/deepseek-ai/DeepSeek-OCR-2

Model download: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2

Paper address: https://github.com/deepseek-ai/DeepSeek-OCR-2/blob/main/DeepSeek_OCR2_paper.pdf

The core innovation of DeepSeek-OCR2 lies in endowing the model with causal reasoning ability (Causal Reasoning) through DeepEncoder V2.

This is like equipping the machine with "human reading logic", enabling AI not to just rigidly scan images from top-left to bottom-right but to flexibly adjust the reading order according to the content semantics.

DeepSeek-OCR2, Visual Causal Flow

DeepSeek points out in the paper that traditional vision-language models (VLM) usually process images in a raster-scan (Raster-Scan) order, that is, fixed from left to right and top to bottom.

This method forcibly flattens a 2D image into a 1D sequence, ignoring the semantic structure inside the image.

This is obviously contrary to human visual habits.

When humans look at pictures or read documents, their eyes follow the logical flow: first look at the title, then the main text, glance at tables column by column or row by row when encountering them, and automatically jump when encountering columns.

To solve this problem, DeepSeek-OCR2 introduces DeepEncoder V2.

Its biggest feature is replacing the original CLIP encoder with a lightweight large language model (Qwen2-0.5B) and designing a unique "Causal Flow Query" mechanism.

Detailed Explanation of the DeepEncoder V2 Architecture

DeepEncoder V2 mainly consists of two parts:

1. Vision Tokenizer

It follows the design of SAM-base (80M parameters) plus a convolutional layer, converting images into visual tokens.

2. LLM as the Visual Encoder

Here, DeepSeek uses a Qwen2-0.5B model.

It not only processes visual tokens but also introduces a set of learnable "Query Tokens".

The key innovation lies in the design of Attention Mask:

Bi-directional attention (Bidirectional Attention) is used between visual tokens to maintain global perception ability, similar to ViT.

While causal attention (Causal Attention) is used for query tokens, and each query token can only see the tokens before it.

Through this design, DeepEncoder V2 achieves two-level cascaded causal reasoning:

The encoder semantically rearranges visual tokens through learnable queries, and then the LLM decoder performs autoregressive reasoning on this ordered sequence.

This means that DeepSeek-OCR2 has "sorted out" the information in the image during the encoding stage instead of throwing it all to the decoder at once.

Fewer Tokens, Higher Accuracy

Experimental data shows that DeepSeek-OCR2 significantly improves performance while maintaining a very high compression rate.

In the OmniDocBench v1.5 benchmark test, DeepSeek-OCR2 achieved a comprehensive score of up to 91.09% with the fewest visual tokens (only 256 - 1120), a 3.73% improvement compared to the previous generation.

Notably, in the edit distance (Edit Distance) metric of the reading order (R-order), DeepSeek-OCR2 significantly reduced it from 0.085 in the previous generation to 0.057.

This directly proves that the new model is more logical and better understands the "reading order" when processing complex layouts.

In comparison with closed-source powerful models such as Gemini-3 Pro, DeepSeek-OCR2 is not inferior at all.

When both use about 1120 visual tokens, the document parsing edit distance of DeepSeek-OCR2 (0.100) is better than that of Gemini-3 Pro (0.115).

Not only is it at the top of the leaderboard, DeepSeek-OCR2 is also very powerful in the actual production environment.

DeepSeek revealed that when processing online user log images, the repetition rate of OCR results decreased from 6.25% to 4.17%; in the PDF data production scenario, the repetition rate decreased from 3.69% to 2.88%.

This means that the text generated by the model is cleaner and more accurate, which is of great value for the cleaning pipeline of LLM training data.

Towards True Multimodal Unification

DeepSeek mentions at the end of the paper that DeepSeek-OCR2 verifies the feasibility of "using an LLM as a visual encoder" through DeepEncoder V2.

This is not only an upgrade of an OCR model but also an important step towards native multimodality (Native Multimodality).

In the future, as long as the same encoder is equipped with different modality query embeddings (Query Embeddings), it can process data in multiple modalities such as text, pictures, and audio, truly realizing that everything can be tokenized and everything can be causally reasoned.

DeepSeek says that although optical character recognition (OCR) is one of the most practical visual tasks in the LLM era, it is only a small part of the grand picture of visual understanding.

DeepSeek will continue to explore and move towards more general multimodal intelligence.

Reference: https://huggingface.co/deepseek-ai/DeepSeek-OCR-2 

This article is from the WeChat official account "New Intelligence Yuan", edited by Dinghui Haokun, and published by 36Kr with authorization.