Just now, DeepSeek achieved a significant breakthrough, breaking the shackles of the large model's context.
In the competition among large language models to continuously extend the context window, DeepSeek has just proposed a distinct technical approach.
According to a report by Zhidx on October 20th, this morning, DeepSeek open - sourced the DeepSeek - OCR model and for the first time introduced the concept of “Contexts Optical Compression”, achieving efficient information compression through text - to - image conversion.
The feasibility of this method has been verified. At a 10x compression ratio, the decoding accuracy of DeepSeek - OCR can reach 97%, nearly achieving lossless compression; at a 20x compression ratio, the accuracy still remains at approximately 60%.
After converting an equal amount of text tokens into visual tokens (images), DeepSeek - OCR can represent similar text content with fewer tokens, providing a new idea for solving the high computing power overhead of large language models in long - text processing.
In addition, DeepSeek - OCR also shows high practical application value. On OmniDocBench, it surpasses GOT - OCR2.0 (256 tokens per page) using only 100 visual tokens, and with fewer than 800 visual tokens, its performance exceeds that of MinerU2.0 (nearly 7000 tokens per page on average).
In a production environment, DeepSeek - OCR can generate more than 200,000 pages of training data per day on a single A100 - 40G GPU, providing support for large - scale document understanding and multimodal model training.
Currently, this model has been open - sourced on Hugging Face, and the technical report introducing the technical details and underlying theory of the DeepSeek - OCR model has also been publicly released. The DeepSeek - OCR team said that the model they open - sourced this time is a preliminary exploration of a potential solution, which is to use the visual modality as an efficient compression medium for text information.
It is worth mentioning that, different from DeepSeek's previous new models, which often had author teams of dozens of people, this paper has only 3 authors, namely Haoran Wei, Yaofeng Sun, and Yukun Li. Haoran Wei, the first author of the DeepSeek - OCR paper, is also the first author of the GOT - OCR2.0 paper. GOT - OCR2.0 is an OCR model released by Jieyue Xingchen last September.
Open - source address: https://huggingface.co/deepseek - ai/DeepSeek - OCR
Paper link: https://github.com/deepseek - ai/DeepSeek - OCR/tree/main
01. Optical compression can achieve a high compression ratio. How many visual tokens are needed for decoding?
In the past few years, the context ability of AI models has been continuously extended - from 4K to 128K, and then to millions of tokens, but at the cost of exponentially increasing computing power and video memory consumption.
However, text is actually a redundant form of information. The DeepSeek - OCR team believes that “an image containing document text can represent rich information with far fewer tokens than the equivalent digital text. This indicates that optical compression through visual tokens can achieve a higher compression ratio.”
Currently, the industry has conducted some exploration on VLM visual encoders and end - to - end OCR models. Based on previous research, the DeepSeek - OCR team discovered a key research question that has not been solved yet: For a document containing 1000 words, how many visual tokens are needed at least for decoding? This question is of great significance for studying the principle of “a picture is worth a thousand words.”
To address this question, DeepSeek built a verification system - DeepSeek - OCR. This model “opticalizes” text, compressing thousands of text tokens into hundreds of visual tokens, and then the language model decodes them back to the original text.
The architecture of DeepSeek - OCR is divided into two parts. One is DeepEncoder, a visual encoder designed specifically for high - compression, high - resolution document processing; the other is DeepSeek3B - MoE, a lightweight mixture - of - experts language decoder.
DeepEncoder: Significantly reduce the number of vision tokens
DeepEncoder adopts a dual - structure design of SAM + CLIP, achieving high - fidelity visual understanding through local window attention combined with global attention, and significantly reducing the number of vision tokens with a two - layer 16× convolution compression module.
For example, when a 1024×1024 document image is input, a traditional visual model will generate 4096 tokens, while DeepEncoder can compress them to only 256 tokens, making the number of activated memories more controllable.
In addition, it supports multiple “resolution modes”. From the lightweight Tiny (64 tokens) to the high - fidelity Gundam (795 tokens), the model can automatically select the compression level according to the task complexity.
The paper shows the compression effects of different resolutions. To the naked eye, the text in the image in the Tiny mode is slightly blurred but basically readable; while in the high - fidelity Gundam mode, the reading experience of the text in the image is basically the same as that of the original file.
The actual reading effect needs to refer to the pictures in the original paper.
In actual use, only 100 visual tokens are needed to accurately recognize an ordinary paper or slide page; for newspapers or scientific papers with dense text, high - precision restoration can be achieved through the Gundam mode.
DeepSeek3B - MoE: Only 570 million activation parameters
At the decoding end, DeepSeek uses its self - developed DeepSeek3B - MoE architecture, activating only 6 expert modules during inference, with a total number of activated parameters of approximately 570 million.
This “on - demand activation” mechanism enables the model to have strong expressive ability while maintaining low latency and high energy efficiency, making it extremely suitable for scenarios such as document OCR and image - text generation.
Data engine: From documents to charts, chemical formulas, and geometric figures
DeepSeek has also built a large - scale dataset, including four major data types:
(1) OCR 1.0 data: 30 million pages of multilingual documents and natural scene text, etc.;
(2) OCR 2.0 data: Chart, chemical formula, geometric figure parsing, etc.;
(3) General visual data: Inject basic image understanding ability into the model;
(4) Pure text data: Maintain language fluency and context modeling.
Thanks to this system, DeepSeek - OCR can not only recognize characters and break sentences but also understand charts, interpret chemical formulas, identify geometric figures, and process common text - image interleaved documents.
02. The 10x compression effect is almost lossless, and the representation effect of hundreds of tokens exceeds that of 7000 tokens
The overall training process of DeepSeek - OCR is relatively simple, mainly divided into two stages: independently training DeepEncoder and training the complete DeepSeek - OCR model.
In addition, the so - called “Gundam - master mode (ultra - high resolution)” is obtained by fine - tuning the pre - trained DeepSeek - OCR model with 6 million sampled data. Since its training protocol is the same as that of other modes, the DeepSeek - OCR team omitted the detailed description.
The training of DeepEncoder follows Vary's approach, using a lightweight language model and training based on the next - token prediction framework. At this stage, the model uses the aforementioned OCR 1.0 and OCR 2.0 data, as well as 100 million general image data sampled from the LAION dataset.
After the training of DeepEncoder is completed, the DeepSeek - OCR team uses multimodal data and pure text data and adopts a pipeline parallel strategy to train the complete model.
To verify the compression and decompression ability of DeepSeek - OCR in text - intensive documents, the research team selected the Fox benchmark for experiments. The experimental results show that at a 10x compression ratio, the decoding accuracy of DeepSeek - OCR can reach approximately 97%. This indicates that it is expected to achieve nearly lossless 10x text compression in the future.
When the compression ratio exceeds 10x, the performance declines. The main reasons include the increase in the complexity of the document layout and the blurring of long text at a resolution of 512×512 or 640×640. The former can be solved by rendering the text into a unified layout, while the latter may become a research feature of the “forgetting mechanism” in the future.
Even at a nearly 20x compression, the model can still maintain an accuracy of approximately 60%. These results fully demonstrate that optical context compression is a promising research direction and does not require additional computational overhead because the multimodal system itself already has a visual encoder structure.
In addition to experimental verification, DeepSeek - OCR also shows good capabilities in real - world scenarios and can build high - quality data for the pre - training of LLM/VLM. On OmniDocBench, with only 100 visual tokens (640×640 resolution), DeepSeek - OCR surpasses GOT - OCR 2.0, which uses 256 tokens. And with fewer than 800 tokens (Gundam mode), DeepSeek - OCR even surpasses MinerU 2.0, which requires approximately 7000 visual tokens.
Further analysis shows that different types of documents have different requirements for the number of tokens: slide - type documents only need approximately 64 visual tokens to achieve good results; books and reports can achieve stable performance with 100 visual tokens; due to the large amount of text in newspaper - type documents, the Gundam or Gundam - master mode needs to be used to achieve acceptable results.
03. From financial charts to chemical expressions, various types of documents can be deeply parsed
The DeepSeek - OCR team demonstrated the capabilities of DeepSeek - OCR in specific scenarios in the paper. DeepSeek - OCR has layout recognition and OCR 2.0 capabilities and can further parse document images through secondary model calls. DeepSeek calls this function “Deep Parsing.” The model can recognize different types of content in the image, including charts, geometric figures, chemical structural formulas, and natural images.
In financial research reports, DeepSeek - OCR can automatically extract the structured information of charts in the document, which is particularly important for the financial and scientific fields.
In the scenarios of books and papers, the deep - parsing mode can generate dense image descriptions, realizing automatic recognition and transcription of text - image content.
For chemical literature, the model can not only recognize chemical structural formulas but also convert them into SMILES format, showing potential application value in the STEM (science, technology, engineering, and mathematics) field.
In addition, DeepSeek - OCR can also parse the structure of plane geometric figures. Although the current task is still quite difficult, the model has shown a preliminary understanding ability of geometric elements and spatial relationships.
The PDF data on the Internet covers multiple languages, including Chinese, English, and a large amount of multilingual content