HomeArticle

DeepSeek-OCR: Large model technology is now at a new crossroads.

锦缎2025-10-23 07:12
Is the image truly the king of information processing?

Imagine that in this era when AI technology is surging in like a tide, we suddenly discover that a simple image can actually carry a vast amount of text information with astonishing efficiency. This is no longer an "imagination" but a reality that has just occurred.

This week, DeepSeek open - sourced a model called "DeepSeek - OCR", which for the first time proposed the concept of "Context Optical Compression". The technical details and the underlying paper were also made public.

Although there isn't much discussion in the market yet, this might be a quiet but profound turning point in the history of AI evolution—it makes us start to question: Is the image really the true king of information processing?

01 The Hidden Power of Images: Why Images Might Outperform Text

Recall that the documents, reports, and books we handle daily are often broken down into countless text tokens. These tokens are like bricks that build the "understanding wall" of the model.

However, DeepSeek - OCR takes a different approach: it treats text as images. Through visual encoding, it compresses the entire page content into a small number of "visual tokens" and then decodes them back into text, tables, or even charts.

What's the result? The efficiency has increased by more than ten times, and the accuracy is as high as 97%.

This is not just a technical optimization but an attempt to prove that images are not slaves to information but its efficient carriers.

Take a thousand - word article as an example. Traditional methods may require over a thousand tokens for processing, while DeepSeek only needs about 100 visual tokens to restore everything with 97% fidelity. This means that the model can easily handle extremely long documents without worrying about computing resources.

02 Architecture and Working Principle

The system design of DeepSeek - OCR is like a precision machine, divided into two modules: the powerful DeepEncoder is responsible for capturing page information, and the lightweight text generator is like a translator, converting visual tokens into readable output.

The encoder combines the local analysis ability of SAM and the global understanding of CLIP. Then, through a 16 - fold compressor, it streamlines the initial 4096 tokens to only 256. This is the core secret of its efficiency.

What's even smarter is that it can automatically adjust according to the document complexity: a simple PPT only needs 64 tokens, a book report needs about 100, and a dense newspaper needs at most 800.

In comparison, it surpasses GOT - OCR 2.0 (which requires 256 tokens) and MinerU 2.0 (over 6000 tokens per page), reducing the number of tokens by 90%. The decoder uses a Mixture of Experts (MoE) architecture with about 3 billion parameters (about 5.7 billion when activated), which can quickly generate text, Markdown, or structured data.

In actual tests, a single A100 graphics card can process over 200,000 pages of documents per day. If expanded to 20 eight - card servers, the daily processing capacity can reach 33 million pages. This is no longer a laboratory toy but an industrial - grade tool.

03 A Profound Paradox: Why Are Images More "Economical"?

There is an interesting paradox here: although images clearly contain more raw data, why can they be represented with fewer tokens in the model? The answer lies in information density.

Although text tokens seem concise on the surface, they need to be expanded into vectors of thousands of dimensions inside the model; image tokens are like continuous scrolls that can encapsulate information more compactly. This is like human memory: recent events are as clear as yesterday, and distant memories gradually fade but still retain their essence.

DeepSeek - OCR has proven the feasibility of visual tokens, but the training of a pure visual foundation model remains a mystery. Traditional large - scale models succeed with the clear goal of "predicting the next word", while the prediction goal for image text is unclear—predicting the next image segment is difficult to evaluate; converting it to text takes us back to the old path.

So, currently, it is only an enhancement to the existing system, not a replacement. We are standing at a crossroads: there are infinite possibilities ahead, but we need to patiently wait for a breakthrough.

If this technology matures and is popularized, its influence will spread like ripples:

Firstly, it will change the "token economy": long documents will no longer be restricted by the context window, and the processing cost will be significantly reduced. Secondly, it will improve information extraction: financial charts and technical drawings can be directly converted into structured data accurately and efficiently. Finally, it will enhance flexibility: it can still run stably under non - ideal hardware conditions, democratizing AI applications.

Even better, it can also improve the long - conversation memory of chatbots. Through "visual attenuation": old conversations are converted into low - resolution images for storage, simulating human memory decline to expand the context without exceeding the token limit.

04 Conclusion

The significance of DeepSeek - OCR's exploration lies not only in the ten - fold increase in efficiency but also in its redrawing of the boundaries of document processing. It challenges the context limitation, optimizes the cost structure, and innovates enterprise processes.

Although the dawn of pure visual training is still far away, optical compression is undoubtedly a new option for us to move towards the future.

Index of related frequently asked questions:

Q: Why can't we directly train the foundation model from text images?

A: The success of large - scale models relies on the clear goal of "predicting the next word" and an easy - to - evaluate method. For text images, it is difficult and slow to evaluate predicting the next image segment; converting it to text tokens takes us back to the traditional path. DeepSeek chooses to fine - tune on the basis of existing models to decode visual representations but does not replace the token foundation.

Q: How does it perform in terms of speed compared to traditional OCR systems?

A: When processing a 3503×1668 - pixel image, basic text extraction takes 24 seconds, structured Markdown takes 39 seconds, and a complete analysis with coordinate frames takes 58 seconds. Traditional OCR is faster, but when the accuracy is the same, it requires thousands of tokens—for example, MinerU 2.0 needs over 6000 tokens per page, while DeepSeek only needs less than 800.

Q: Can this technology improve the long - conversation memory of chatbots?

A: Yes. Through "visual attenuation": old conversations are converted into low - resolution images to simulate memory decline, expanding the context without increasing token consumption. It is suitable for long - term memory scenarios, but the details of its production implementation need to be further elaborated.

This article is from the WeChat official account "Silicon - based Starlight", author: Garcia, published by 36Kr with authorization.