StartseiteArtikel

Das neue Modell, das DeepSeek gestern Open Source gemacht hat, ist etwas seltsam.

差评2025-10-22 08:59
Wenn schon kleine Updates so interessant sind, was wäre dann mit R2...

DeepSeek has come up with something new again. It can store almost the same text information using only one-tenth of the original tokens. With such a compression ratio, even Shannon would shed tears, and von Neumann would fall silent.

Moreover, it has directly attracted a bunch of foreigners.

Yesterday, DeepSeek released a new model called DeepSeek-OCR. We are all familiar with OCR, which is used to recognize text in images.

Those who often use WeChat should know that when we click on an image in the WeChat client, we can directly copy the text in it.

Yes, this is an application of OCR technology.

However, DeepSeek's "Skill OCR" this time is just the opposite. It can convert a large amount of text into an image, which serves as the "memory carrier" for AI.

Yes, using text to store information is no longer sufficient for it.

In the past, all large models, whether it was ChatGPT, Gemini, Llama, Qwen, or DeepSeek itself in the past, used the same way to read data: text, that is, the commonly referred to tokens.

The prompts we write will be converted into a large number of tokens and given to the large model. The reference materials we provide will also be converted into a large number of tokens and given to the large model. Even for multi-modal large models that can recognize images, they first need to convert the images into a text description before giving it to the large model for recognition.

But is text tokens really the only way for large models to understand the world?

DeepSeek decided to try a new approach. After all, if we put an image and a text together, the former can obviously contain more information.

For example, this description is not comprehensive enough.

Since that's the case, can we directly use images to train large models?

So DeepSeek started working on it and found that the model trained with images is both good-looking and useful.

On the one hand, it can remember more content with fewer tokens.

In the task of testing document understanding ability, DeepSeek-OCR only used 100 visual tokens and outperformed GOT-OCR 2.0, which requires 256 tokens.

To be more impressive, it used less than 800 visual tokens and defeated MinerU 2.0, which requires an average of more than 6000 tokens.

This means that after we let large models start using images to remember data, the models will be able to achieve better expression effects with fewer token resources.

In addition, DeepSeek-OCR also supports multiple resolutions and compression modes to adapt to documents of different complexities:

For example, a PPT with only a picture background and title text may only need 64 visual tokens to represent.

If there is more text content on this page, it will automatically switch to the Large mode and use up to 400 visual tokens to record.

If that's still not enough, DeepSeek-OCR also supports the dynamically adjustable Gundam mode to remember images, focusing on remembering what needs to be remembered and distinguishing priorities.

Moreover, compared with traditional models that could only recognize text in the past, DeepSeek-OCR can remember more data.

DeepSeek-OCR can automatically recognize a bar chart in a paper and save it in Excel format.

The pictures of molecular structures of organic compounds in the article can also be automatically converted into the standard SMILES (Simplified Molecular Input Line Entry System) format for storage.

Not only can it remember the image itself, but DeepSeek-OCR will also remember the position of the image and what the text near the image is about...

Many two-dimensional information that was previously invisible will be captured again by DeepSeek-OCR.

Many people may not realize the value of this thing yet.

In the past two years, when developing large models, apart from the shortage of graphics cards, the biggest problem has been the lack of training data.

The conventional datasets have been used before. To obtain high-quality datasets, one either has to secretly crawl data from the Internet, spend a lot of money to buy it, or find a way to synthesize it.

However, now, a lot of data that was not collected before can be collected from two-dimensional information.

For example, in many research papers, in the past, large models could only learn the text information in them, but they were completely ignorant of the various charts and illustrations.

But after using DeepSeek-OCR, this missing part can be painlessly supplemented.

Actually, this is also what DeepSeek thought. In the paper, it specifically mentioned that this new model can collect more than 200,000 pages of training data for large models in a day on a single A100.

Therefore, after having DeepSeek-OCR, all the past data is worth re-recognizing with it.

Undoubtedly, this data will become the nourishment for the next large model.

On the other hand, after storing data in a two-dimensional way, the operation of the entire model becomes more resource-efficient.

We all know that when using large models, the longer the chat and the longer the context, the more likely the model is to have bugs.

This is because when the large model is running, it has to process the relationship between each word and all other words.

If you double the length of the conversation, the computational load of the entire model will quadruple. If you triple the length, the computational load will become nine times the original.

This is also one of the reasons why large model manufacturers are now limiting the context length. If you chat too much in a single conversation, the cost will skyrocket.

After using image memory, DeepSeek can compress the number of tokens to one-tenth of the original...

At the same time, there won't be a significant loss in performance.

In the paper, it can be seen that the newly launched DeepSeek-OCR can achieve 96.5% of the accuracy of the original model with only one-tenth of the original number of tokens.

Even if we compress it by 20 times, the accuracy of the model can still be maintained at about 60%...

At the same time, the researchers at DeepSeek also found something interesting.

They feel that the way large models store images with different levels of clarity is actually similar to the way we humans forget information.

For us humans, forgetting is a gradual process.

What just happened is a bit like the data stored by DeepSeek in Gundam mode, which is the clearest.

As time passes, the importance of this event will gradually decrease, and the storage format will also be downgraded from the largest Gundam mode all the way to the smallest Tiny mode, and the number of tokens it occupies will also become fewer and fewer.

If we introduce this concept into large models, we can store recent chat records in the "4K HDR Blu-ray" format, while those less important chat records from earlier years can be compressed into 480P files for storage.

Can the context ability of large models be enhanced through this way of active forgetting?

This idea is very interesting, but even DeepSeek itself hasn't given a clear