HomeArticle

Breaking news: DeepSeek opens source again. Vision is compression. 100 tokens outperform 7000 tokens.

新智元2025-10-21 09:30
DeepSeek-OCR decodes 10 times more text with a small number of visual tokens, efficiently compressing visual information.

A picture is worth a thousand words! The DeepSeek-OCR model boldly explores the boundary of visual-text compression. By decoding more than 10 times the text information from a small number of visual tokens, this end-to-end VLM architecture not only outperforms GOT-OCR2.0 on the OmniDocBench benchmark but also provides an efficient solution to the long-context problem of LLMs.

DeepSeek has released a new model!

On Github, DeepSeek has created a new repository for DeepSeek-OCR, aiming to explore the boundary of visual-text compression.

As the saying goes, a picture is worth ten thousand words. This is also true for LLMs!

Theoretically, the DeepSeek-OCR model has preliminarily verified the feasibility of "context optical compression" —

The model can effectively decode more than 10 times the number of text tokens from a small number of visual tokens.

That is to say, a single image containing document text can represent rich information with far fewer tokens than the equivalent text.

This indicates that optical compression through visual tokens can achieve a higher compression ratio.

As an intermediate modality connecting vision and language, the OCR task is an ideal testbed for the visual-text compression paradigm —

It establishes a natural compression-decompression mapping relationship between visual and text representations and provides quantifiable evaluation metrics.

DeepSeek-OCR has high practical value in the OCR task: In the OmniDocBench benchmark test, it surpasses GOT-OCR2.0 (256 tokens per page) with only 100 visual tokens; with fewer than 800 visual tokens, it outperforms MinerU2.0 (more than 6000 tokens per page on average).

Figure (a) shows the compression ratio (the number of real text tokens / the number of visual tokens used by the model) in the Fox benchmark test; Figure (b) shows the performance comparison on the OmniDocBench.

In practical applications, a single A100-40G graphics card can support the generation of more than 200,000 pages of training data for large language models/visual language models per day.

The new model can also parse charts, chemical equations, simple geometric figures, and natural images:

In different historical context stages, the visual-text compression of DeepSeek-OCR can reduce the number of tokens by 7 - 20 times, providing a feasible direction for solving the long-context problem of large language models.

This paradigm opens up new possibilities for rethinking the collaborative fusion of visual and language modalities and further improving the computational efficiency of large-scale text processing and intelligent agent systems.

This discovery will strongly promote the future development of visual language models and large language models.

Github: https://github.com/deepseek-ai/DeepSeek-OCR

HuggingFace: https://huggingface.co/deepseek-ai/DeepSeek-OCR

The open-source artifact DeepSeek-OCR explores context optical compression

Currently, open-source VLMs (visual language models) adopt three main visual encoder architectures, but each has its own drawbacks.

With the progress of VLMs, many end-to-end OCR models have emerged, fundamentally changing the traditional pipeline architecture and simplifying the OCR system.

But there is a core question:

For a document containing 1000 characters, at least how many visual tokens are needed for decoding?

This question is of great significance for studying the principle of "a picture is worth a thousand words".

DeepSeek-OCR aims to answer this question. It adopts a unified end-to-end VLM architecture, consisting of an encoder and a decoder.

The encoder (i.e., DeepEncoder) is responsible for extracting image features, tokenizing and compressing the visual representation. The decoder generates the required results based on the image tokens and prompt information.

Encoder: The innovative architecture of DeepEncoder

To verify the feasibility of "context optical compression", the visual encoder needs to meet the following characteristics:

  1. Be able to process high-resolution images;
  2. Maintain low activation overhead at high resolutions;
  3. Generate fewer visual tokens;
  4. Support multi-resolution input;
  5. Have a moderate parameter scale.

The researchers proposed a brand - new visual encoder, DeepEncoder. DeepEncoder has approximately 380 million parameters and is mainly composed of SAM - base and CLIP - large connected in series.

The visual perception feature extractor mainly uses window attention, and its main architecture is SAM - base (patch - size 16) with 80 million parameters;

The visual knowledge feature extractor uses dense global attention, and its main architecture is CLIP - large with 300 million parameters.

Between these two components is a 2 - layer convolutional module that performs 16× downsampling on the visual tokens.

DeepEncoder will compress the image size. For example, an image with an input size of 1024×1024 is divided into 1024/16×1024/16 = 4096 patch tokens.

The first half of the encoder is dominated by window attention and has only 80M parameters, so the activation memory consumption is acceptable.

Before entering the global attention module, the 4096 tokens pass through the compression module, and the final number of tokens will be reduced to 4096/16 = 256, making the overall activation memory consumption controllable.

Suppose there is an image containing 1000 optical characters. To test how many visual tokens are needed for decoding, the model needs to support a variable number of visual tokens.

That is to say, DeepEncoder needs to support multiple resolutions.

Dynamic interpolation position encoding can meet the above requirements.

The researchers designed multiple resolution modes to support multiple resolutions simultaneously during the model training process, thus enabling a single DeepSeek - OCR model to support multiple resolutions.

As shown in Figure 4 below, DeepEncoder mainly supports two input modes: native resolution and dynamic resolution. Each mode contains multiple sub - modes.

The native resolution supports four sub - modes: Tiny, Small, Base, and Large.

The dynamic resolution is composed of two native resolutions.

Supporting dynamic resolution is mainly to meet the application requirements of ultra - high - resolution input (such as newspaper images). Tiling is a secondary window attention method that can further effectively reduce the activation memory consumption.

In Gundam mode, the number of visual tokens output by DeepEncoder is n×100 + 256, where n is the number of tiles.

Gundam mode is trained together with the four native resolution modes to achieve the goal of a single model supporting multiple resolutions.

It is worth noting that the Gundam - master mode (a local view of 1024×1024 + a global view of 1280×1280) is obtained by continuing to train on the already trained DeepSeek - OCR model.

Table 1 below summarizes the resolutions and the number of tokens in each mode.

Decoder: DeepSeek - 3B - MoE

The decoder uses DeepSeekMoE, specifically DeepSeek - 3B - MoE.

During the inference process, the model activates 6 routing experts and 2 shared experts, with a total of approximately 570 million parameters activated.

The 3B DeepSeekMoE is very suitable for domain - centered visual language model (VLM) research —

It can obtain the expressive ability of a 3B model while enjoying the inference efficiency similar to that of a 500M small - scale model.

Specific results

On the Fox benchmark set, the researchers verified the compression and decompression ability of DeepSeek - OCR on text - dense documents and preliminarily explored the feasibility and boundary of "context optical compression".

As shown in Table 2 below, within a 10× compression ratio, the decoding accuracy of the model can reach approximately 97%, which is very promising.

Moreover, the output format is not exactly the same as that of the Fox benchmark, so the actual performance may be slightly higher than the test result.