DeepSeek's ultimate ambition: Transform the basic language of large language models into images.
When we talk about DeepSeek, multimodality is rarely mentioned.
However, on October 20th, DeepSeek suddenly open - sourced DeepSeek - OCR. It seems to be an OCR (Optical Character Recognition) model and has achieved SOTA (State - of - the - Art) results on authoritative benchmarks such as OmniDocBench.
Then why did it suddenly venture into the OCR field? The answer lies in the biggest challenge currently faced by large language models: the computing power bottleneck in long - context processing.
The core argument of this paper is: Text information can be efficiently compressed through optical 2D mapping (i.e., rendered into an image), and then a VLM (Visual - Language Model) can extract the original information from the image.
In simple terms, it is to convert text content into an image form, representing the same information with far fewer visual tokens than the equivalent digital text.
In this regard, the great Andrej Karpathy also said that he was deeply inspired by this paper and felt that pixels might be a better input for LLMs than text.
He also listed four major benefits of doing this:
Information compression: He explicitly cited the view from the DeepSeek - OCR paper, which will bring "shorter context windows and higher efficiency".
More general information flow: The input is no longer limited to pure text and can also include "bold, colored text, and arbitrary images".
Stronger processing method: Images can be easily processed by "bidirectional attention", which is "much more powerful" than the autoregressive attention commonly used for text.
Eliminating the Tokenizer (at the input end): This is what he is most excited about. He strongly criticized the existing tokenizer, a long - standing problem.
This article will conduct an in - depth analysis of this concept and explore why DeepSeek uses the "visual hammer" to tackle the "text nail". This paper is very likely to change the entire input paradigm of future LLMs in a way where one picture is worth a thousand words.
01 Named OCR, Actually for Long Context
In the world of LLMs, the end of all competition seems to be the pursuit of "longer context". From a few thousand tokens to tens of thousands, and now to millions or even tens of millions of token windows, this arms race has never stopped.
The fundamental constraint behind this stems from the soul of the Transformer architecture, the attention mechanism.
Standard global attention allows each token in the sequence to see all other tokens, which endows the model with strong context understanding ability. However, in the current mainstream autoregressive models, the cost of this ability is high: since each token needs to establish a multiplicative relationship with all previous tokens to predict the next token, its computational complexity and memory usage increase quadratically with the sequence length.
Although the industry has proposed optimization techniques such as grouped attention, multi - query attention, and RoPE position encoding to reduce the number of query heads, these methods essentially attempt to optimize the quadratic complexity of token calculation but never really reduce the number of tokens themselves.
The engineers of the DeepSeek - AI team clearly noticed this elephant in the room. They broke out of the involution of optimizing attention calculation and raised a more fundamental question: Can we "compress" the number of tokens themselves?
This is the logical starting point of optical compression (Contexts Optical Compression).
To understand this, we first need to understand the differences between visual tokens and text tokens.
Visual tokens are the basic information units used by visual models when processing images. Text models (LLMs) read text tokens (words or sub - words), while visual models (VLMs) view visual tokens.
In the DeepSeek - OCR paper, visual tokens are obtained by first cutting a high - resolution image into small image patches. During the encoding process, each small image patch is converted into a digital vector (i.e., a token), and this vector represents all the information of the image patch. Therefore, a 1024*1024 image can be divided into 4096 visual tokens.
An image half the size can accommodate approximately 10,000 text tokens.
Therefore, for a document image containing 10,000 words, the number of tokens after visualization is reduced by half. After image compression, it may only require a few hundred visual tokens, while if input in text form, it would require more than 10,000 text tokens. This understanding that "the visual modality is naturally an efficient compression medium for text information" gave birth to the DeepSeek - OCR project.
DeepSeek - OCR is essentially a proof - of - concept for an "optical compression - decompression" system. It attempts to answer a fundamental question: How many text tokens can be decompressed from how many visual tokens?
The answer to this question will directly determine the feasibility of "visual compression" as a solution for long context. DeepSeek's current solution achieves a 10 - fold compression with almost no loss and a 20 - fold compression that is basically usable.
02 DeepEncoder, the Art of Compression
To achieve optical compression, the team needed an unprecedented visual encoder. It must be able to handle high - resolution input (because text images contain a large amount of details) and generate as few visual tokens as possible. At the same time, during this process, the activation memory should be low enough (otherwise, it loses the meaning of optimization).
The paper clearly states that all current mainstream VLM architectures (such as Vary, InternVL2, Qwen2 - VL) cannot meet these three requirements simultaneously.
For this reason, DeepSeek - AI designed the first core technological innovation in this paper: DeepEncoder.
DeepEncoder is a cascaded architecture with approximately 380 million parameters. Its workflow is like an intelligence - processing team, forming a three - level cascaded architecture.
The first level is an 80M - parameter SAM - base sensor. It is like an intelligence - gathering agent, specifically responsible for processing local details of high - resolution input. When facing a 1024×1024 image, it divides it into 4096 image patches. Through the window attention mechanism, it strictly limits the calculation within small windows, thus maintaining extremely low activation memory when processing a large number of local tokens.
The second level is the key to the entire architecture, a 16 - fold compressor (Conv 16x). Its role is like an information aggregator, a 2 - layer convolutional module. It receives 4096 pieces of "raw intelligence" from the first stage and, through a learnable 16 - fold downsampling, "compresses and refines" it into a "summary briefing" with only 256 visual tokens. It has learned during training how to retain the most important features for "decompressing text".
The third level is a 300M - parameter CLIP - large knowledge layer, which is like a general commander. This part does not look at the 4096 pieces of raw intelligence but only looks at this 256 - token summary briefing. Since the briefing is short enough, it can afford to use expensive global attention (Global Attention) to comprehensively cross - compare and carefully observe these 256 pieces of essential information, thereby understanding the long - distance relationships and global semantic structure between these compressed tokens.
The 256 tokens output by the encoder (DeepEncoder) are just a global visual summary. What is truly responsible for reproducing the complete context in order is the subsequent decoder DeepSeek - 3B - MoE. DeepSeek - 3B - MoE receives this visual token summary and generates text. It will refer to the visual evidence in the "middle part" of the global summary given by DeepEncoder and use its own language model to ensure the coherence of the context.
DeepEncoder's cascaded design of local processing first, then compression, and finally global understanding perfectly avoids the problems of all previous solutions.
Vary is like two independent experts, one looking at details and the other at the outline, and the LLM has to guess on its own in the end. DeepEncoder, on the other hand, is a single - path cascade, with information refined step by step, and its architecture is superior.
InternVL2 cuts large images into a large number of fragments, generating thousands of tokens and lacking global perspective and compression ability. DeepEncoder, through internal compression, only generates a few hundred tokens.
Qwen2 - VL tries to use global attention directly on thousands of tokens, which can easily lead to out - of - memory errors. DeepEncoder only uses global attention on the compressed 256 tokens, and the cost is controllable.
This design philosophy of "local perception first, then compression and refinement, and finally global understanding" perfectly solves the contradiction between high - resolution processing and low computational cost.
The experimental results prove the effectiveness of this design:
10 - fold compression ratio: When using 64 visual tokens (Tiny mode) to decode 600 - 700 text tokens, the compression ratio reaches 10.5 times, and the OCR accuracy is as high as 96.5%.
20 - fold compression ratio: When the compression ratio soars to nearly 20 times (e.g., using 64 tokens to decode 1200+ tokens), the model accuracy will decline but still remains at about ~60%, which is still usable.
The number of tokens required by Deepseek OCR varies with the document type: simple presentations require about 64 tokens; books and reports require about 100 tokens; complex newspaper content requires the so - called "Gundam Mode", using up to 800 tokens.
On the actual OCR benchmark OmniDocBench, it achieved a crushing victory:
DeepSeek - OCR (Small mode) with only 100 visual tokens outperformed GOT - OCR2.0, which uses 256 tokens.
DeepSeek - OCR (Gundam mode) using less than 800 visual tokens comprehensively outperformed MinerU2.0, which requires nearly 7000 tokens.
This means that with this method, it is actually possible to achieve ten times the current context limit length without reducing its accuracy. And due to its compression efficiency, a single NVIDIA A100 GPU can process more than 200,000 pages of documents per day; if equipped with 20 servers (each with 8 A100 GPUs), the system's daily processing capacity can be increased to about 33 million pages.
DeepSeek - OCR can also recognize and process various document types, including pure text, charts, chemical formulas, and geometric figures. It supports about 100 languages, can maintain the original layout, and can output pure text or generate descriptions of image content.
More importantly, this method does not require any additional infrastructure cost because multimodal systems already need visual encoders.
DeepSeek - OCR actually implements a brand - new text compression paradigm on the existing VLM infrastructure.
03 The Echo of Window Attention
Actually, the method used by DeepSeek may seem a bit familiar. Before the Transformer dominated everything, in the era of BERT and early RNNs, windows were the most mainstream compromise solution.
Whether it is the BPTT (Backpropagation Through Time) truncation of RNNs or the sliding window (Sliding Window) used by models represented by BERT, the logic is the same: if the model cannot process the entire context at once, it only looks at a fixed - size window.
This is actually a bit similar to DeepSeek - OCR, which also cuts the context into compressed image information for processing.
However, the past methods could not handle the information island effect. When attention is strictly limited to local windows, the model loses the ability to understand long - distance dependencies. The title of a document may not be effectively associated with its corresponding chart because they are trapped in different "attention windows". This limitation makes traditional window methods perform poorly in tasks that require global understanding.
However, in the Transformer era, the internal mechanism of this method has undergone a qualitative change. It inherits the advantage of high - efficiency window attention calculation and at the same time cleverly solves the information island problem through a hybrid architecture design.
Here, prior knowledge (Prior Knowledge) is the most crucial difference. BERT's window lacks prior knowledge and knows nothing about the content it processes. In contrast, the encoder and decoder of DeepSeek - OCR are models pre - trained on a large scale, and they have a deep understanding of visual structure, text layout, and language rules. This rich prior knowledge enables the model to reconstruct the original information through "intelligent reasoning" even under