HomeArticle

AI has evolved again. DeepSeek has launched another "game-changing" new feature.

科技狐2025-10-24 19:47
Does AI also need to lose weight? DeepSeek plays the "optical slimming technique".

Recently, there's been a new trend in the AI circle. The DeepSeek team quietly open - sourced a small model with 3 billion parameters, named DeepSeek - OCR.

Don't be deceived by its small size; its concept is quite revolutionary: they actually plan to make AI read text by looking at images.

Yes, it's truly "recognizing characters by looking at pictures".

Moreover, it's not just about recognizing characters. They aim to make the "visual modality" a text compression medium, using images to represent text and replacing "text tokens" with "visual tokens" to achieve so - called Optical Compression.

To be honest, when I first saw the content of this paper, my first thought was: Do they want the language model to take art classes?

However, after careful consideration, it actually makes a lot of sense.

What's the biggest pain point of large language models (LLMs)? Processing long texts consumes a huge amount of computing power.

As we all know, the complexity of the attention mechanism of large models is quadratic. If you double the input, it has to calculate four times as much. If you ask it to remember an entire long document, it will quickly exhaust computing resources.

So, can we think differently? The DeepSeek team says: Since a single image can contain a lot of text, why not convert the text directly into an image and then let the model look at the image!

The paper provides a very intuitive example: The content that originally required 1000 tokens to express can now be handled with just 100 visual tokens, achieving a 10 - fold compression while maintaining 97% OCR accuracy.

Even more impressively, with a 20 - fold compression, it can still retain about 60% accuracy. This means that the efficiency of the model in "reading images" is actually higher than "reading text".

In other words, the model doesn't lose much information, but the computing power burden is reduced by ten times.

Many netizens were stunned when they saw this: Does AI consume less computing power in processing images than long texts? This goes against human intuition!

Some netizens sighed: DeepSeek wants to make the model "read documents as easily as scrolling through the Moments".

I think this operation can be called a "reverse dimensionality reduction strike".

In the past, we've been trying to make models understand text better and have a longer - range vision. DeepSeek does the opposite: it makes the model convert text into images and then "recognize text by looking at images". It's a bit like going back to the most primitive form of human communication: pictographs.

Now, let's talk about how this model is made. DeepSeek - OCR consists of two parts: DeepEncoder (image - based compression) + DeepSeek3B - MoE (decoding and restoration).

The former is the "compression engine" of the entire system. It combines two powerful visual models, SAM - base and CLIP - large:

SAM is responsible for the "window attention" that focuses on details, and CLIP is responsible for the "global attention" that captures the overall picture. There's also a 16× convolutional compression module in the middle, specifically designed to reduce tokens.

For example, a 1024×1024 image, which theoretically needs to be divided into 4096 blocks for processing, can be compressed into just a few hundred tokens by this compression module.

In this way, it not only maintains clarity but also doesn't overload the video memory.

Moreover, it supports multiple resolution modes: Tiny, Small, Base, Large, and a dynamic mode code - named "Gundam".

You read that right; even the naming of this model has a bit of a "nerdy spirit".

The decoder part is DeepSeek's forte: the MoE (Mixture of Experts) architecture.

Only 6 out of 64 experts are activated each time, plus two shared experts. In fact, only about 570 million parameters are used, but its performance is comparable to that of a 3 - billion - parameter model. It's fast and resource - efficient, truly a "champion among energy - saving lamps".

Its task is not complicated. It just needs to "decode" the text from those compressed visual tokens.

The whole process is a bit like an upgraded version of OCR. However, this time, the model is "guessing characters by looking at images" on its own, rather than being taught by humans, and it guesses very accurately.

Of course, to train this model well, it needs a large amount of data. DeepSeek really spared no expense this time: a total of 30 million pages of PDF documents in 100 languages, with Chinese and English accounting for 25 million pages.

They also created a "model flywheel": first, use a layout analysis model to roughly label the data, then use models like GOT - OCR for precise labeling, train the model once, and then label more data in reverse.

Through this cycle, the model trains itself and grows.

In addition, there are 3 million Word documents, specifically used to train formula recognition, HTML table extraction, and even strange image structures such as financial charts, chemical structural formulas, and geometric figures are also included in the training set.

DeepSeek also collected 10 million Chinese and English scene images each from open - source datasets like LAION and Wukong and labeled them with PaddleOCR.

It can be said that this training truly "covers everything from science and engineering to the arts", a truly intelligent model built with data.

So, what's the result? The paper presents several sets of results, which are very impressive.

In the OmniDocBench test, DeepSeek - OCR with 100 visual tokens outperformed GOT - OCR2.0 (256 tokens per page). With less than 800 visual tokens, it surpassed MinerU2.0 (over 6000 tokens per page).

It has stronger performance, shorter input, and faster inference.

This speed is simply like an "AI printing press".

However, what impressed me the most is the mind - blowing idea at the end of the paper: Can optical compression simulate human forgetting?

Human memory fades over time, with old memories becoming fuzzy and new ones remaining clear. The DeepSeek team wondered: Can AI also learn to "forget"?

If AI can "selectively remember" like humans, will it be more relaxed in long - term conversations?

They designed an experimental concept: Render the historical conversation content after the k - th round into an image. First, compress it to reduce the tokens by ten times. As time goes on, further reduce the image size. The smaller the image, the fuzzier the information, and eventually, it will be "forgotten".

Some netizens sighed after reading this: Isn't this simulating the human brain's memory mechanism!

Of course, some people poured cold water on it: DeepSeek has a surprisingly high hallucination rate. If it learns to "forget", it might forget even faster than humans.

After reading this part, I really feel that there's a philosophical implication. Should AI's memory be infinitely extended or learn to forget?

DeepSeek's answer is the latter. It uses the visual method to make the model "filter" out redundancy while "compressing", just like the human brain: only retaining useful information.

The significance behind this is greater than OCR itself. It's redefining the concept of "context": it's not about remembering a lot, but remembering precisely.

After all, although DeepSeek - OCR seems to be an OCR model, it's actually exploring a new paradigm: Can the visual modality efficiently carry language information?

While everyone is competing in the direction of "bigger, longer, and more expensive", DeepSeek has created a "smaller, faster, and more ingenious" model.

This is very much in line with DeepSeek's style.

Finally, I want to say: The evolution of AI may not always be about addition. Sometimes, subtraction is more elegant.

DeepSeek - OCR is a living example: a small 3B model has come up with a new idea for long - text compression and even touched on the boundary between "memory and forgetting".

If last year was about "who can remember more", then this year, it might be about "who can forget more smartly". And DeepSeek is leading the way again this time.

This article is from the WeChat official account "Technology Fox" (ID: kejihutv), written by an author named Lao Hu, and is published by 36Kr with permission.