HomeArticle

After intensively reading the DeepSeek OCR paper, I vaguely caught a glimpse of the outline of the "world model".

X研究媛2025-10-27 11:21
Large language models should make a leap to a higher dimension.

DeepSeek OCR is a decent small OCR model, but it's overrated.

Some Zhihu users pointed out that even when compared horizontally with the best OCR models, it's not among the top ones.

In the following two cases, the 3B-sized (3 billion parameters) DeepSeek OCR failed to recognize "polar coordinates" in the expansion of mathematical formulas, and the recognition of the table structure was also incorrect. In contrast, PaddleOCR-VL, which is only 0.9B (900 million parameters) in size (from Baidu's open-source PaddlePaddle), outperformed it.

DeepSeek OCR is not entirely original. It's speculated that Google's Gemini, which supports a million-token context, might have used visual compressed tokens early on. On the same day, Glyph, released by a team from Tsinghua University and Zhipu, with a similar idea of "compressing long text into visual tokens as model input," didn't seem to trigger any "extended interpretations."

Every move of DeepSeek attracts huge attention. But thinking about it carefully, it's normal.

Among the large Internet giants in China, it's almost impossible to find an example of a company that, after monopolizing a profitable market, still explores cutting-edge technologies for the well - being of humanity. DeepSeek is described by its American counterparts as "having unfathomable strength." Its intrinsic values and organizational form are rare exceptions among Chinese enterprises.

Under the leadership of Liang Wenfeng, DeepSeek doesn't lack money and is overflowing with extreme and romantic technological idealism. It open - sources the most cutting - edge and valuable model training details. After V3 and R1 caused a global sensation, it almost voluntarily gave up huge traffic and didn't imitate OpenAI to build an easily achievable AI business empire... It doesn't follow the normal path to "grow and strengthen." It lives in the future rather than the present, and pursues the highly uncertain AGI with every word and deed. In a China full of following the trend, involution, plagiarism, and money worship, the birth of such a company is truly a "blessing of the nation."

Laypeople watch the excitement, while experts read the papers. Getting back to the point, the profound value of DeepSeek OCR is not about "true infinite context" or breaking some OCR model records in various evaluation sets and large - model arenas. Instead, it explores "continuous visual representation compression," which is subtly pointing towards an ultimate pursuit - the "world model."

What Karpathy wanted to say but didn't clearly state in his evaluation: It has "redirected" the frontier focus of large models from discrete language tokens to continuous visual tokens.

Compression is Intelligence

If we regard the brain as a biological computer, Ilya says we will eventually make a breakthrough. The most mysterious part of human thinking might be surprisingly "simple in essence."

Ilya has a belief that "if you can compress information efficiently, you must have gained knowledge; otherwise, you can't compress information. When you achieve efficient information compression, you got to have some knowledge."

Compression represents information efficiently by identifying patterns and regularities, which is closely related to intelligent behavior. A considerable number of researchers believe that compression might be the foundation of general intelligence, or even equivalent to intelligence, which is what Ilya firmly believes - "compression is intelligence."

Ilya might be only half - right. Compared with one - dimensional discrete language information, the successful compression of language gave birth to the world - famous ChatGPT. However, as higher - dimensional continuous information, end - to - end compression and unified representation extraction of vision are extremely difficult.

Today's powerful pre - trained large language models are highly unified in their underlying principles: they use a vast amount of internet text data to train a super - large neural network, which can be regarded as a large number of parameters. When a user inputs, fixed network node parameters are activated to participate in the calculation, thereby "predicting the output token with the highest probability." Specifically, the user's input text is converted into vectors through the process of Tokenization (dividing the text into words and symbols), and these input vectors are used for pattern matching in a high - dimensional vector space. That is, the activated neural network parameters participate in the calculation to calculate the next word with the highest probability.

In simple terms, an LLM guesses the next word based on the model parameters and context. Looking back at the development of large language models, the discovery of general algorithms, the Transformer architecture that made scaling truly feasible, and the combination of simple algorithms, large amounts of data, and the explosion of GPGPU computing power successfully compressed almost all text materials on the Internet, creating a very intelligent "token predictor."

The output of an LLM is "token by token" in an autoregressive manner, which means that each token needs to "interact" with the previous text once. If the input has 100,000 tokens, the model needs to perform 100,000×100,000 = 10 billion "interactions" for calculation. The longer the input context, the exponential growth of calculation is required to predict the next word.

No matter how large the video memory bandwidth and capacity are, they cannot handle the huge intermediate matrices generated during the calculation at once, and the inference delay will become longer and longer. Recent innovations in LLMs, such as the sparsity and optimization of attention layer calculations, which gave birth to MTP, NSA, DSA, the sparse activation of the dense FFN layer, and the routing activation of the super - large MoE expert network, are essentially aimed at solving computational problems.

Take DeepSeek as an example. Except that R1 publicly disclosed the pre - training + post - training reinforcement learning method for the first time as an open - source model, which reproduced the effect of O1's inference thinking chain and caused a sensation, almost all other innovations focus on improving the efficiency of the attention mechanism, activation parameter calculation, and inference decoding, as well as reducing hardware costs and improving data communication reliability during training.

On the surface, DeepSeek OCR is an OCR model, but in fact, it also aims at computational efficiency and tries to achieve efficient compression of long input contexts for the model.

The core of DeepSeek OCR is DeepEncoder, an encoder that uses vision tokens to encode input context information. It achieves over 96% OCR decoding accuracy under 9 - 10 times text compression, about 90% accuracy under 10 - 12 times compression, and still maintains about 60% accuracy under 20 times compression.

When the compression ratio is 10 times, it can almost achieve lossless compression. This means that for a model context that originally required 100,000 tokens, only 10,000 tokens are needed after visual encoding.

Moreover, DeepSeek's paper states that we can continuously adjust the compression ratio and make a smooth trade - off between the compression ratio and recognition accuracy. The key point is that DeepSeek compares this dynamic visual compression with human memory forgetting.

DeepSeek proposes a compression strategy similar to the biological forgetting mechanism:

Recent context: Maintain high resolution, consume more tokens, and the information is clear;

Distant context: Gradually reduce the resolution, consume fewer tokens, and the information is blurred;

This mechanism simulates the natural attenuation of human memory:

The longer the time, the fuzzier the memory;

The farther the distance, the weaker the visual perception;

Both show a pattern of gradual information loss (as shown in the figure)

In the paper, DeepSeek explains that the work of OCR represents a preliminary exploration of the boundary of visual - text compression, and studies the core issue of how many visual tokens are needed to decode N text tokens. The preliminary results are encouraging:

Optical context compression is not only technically feasible but also biologically reasonable. It provides a new perspective for long - context modeling. DeepSeek believes that this direction will become an important breakthrough in future LLM and VLM research.

DeepSeek - OCR achieves nearly lossless OCR compression at a compression ratio of about 10 times. At a compression ratio of 20 times, it still maintains 60% accuracy. These findings imply that in multi - round conversations, optical processing can be performed on historical records beyond k rounds to achieve a 10 - times compression efficiency; the rendered image of the old context can be gradually reduced to further reduce token consumption; by simulating the human memory forgetting mechanism, the older the content, the higher the compression ratio, the blurrier the image, and the information gradually disappears.

In the paper, DeepSeek emphasizes that optical context compression is still a nascent and promising research direction. DeepSeek - OCR is not only an excellent and commonly used OCR tool but also a model with great practical value. It has the ability to produce large - scale pre - trained data and can be an indispensable assistant in the LLM training process. In practical applications, this model can generate tens of millions of pages of training data per day, significantly improving the efficiency of multi - modal data construction.

The "Outline" of the World Model

From the perspective of a "biological computer," the human brain can be roughly summarized as follows: it uses multi - modal and some unified representation to perform extremely efficient information compression, achieving the modeling and prediction of the real world.

An LLM, on the other hand, "models and predicts the real world through the single modality of language."

If large language models can lead to AGI, does it mean that humans understand everything through language and can model the world through language? But there is an obvious flaw in this. Humans don't have the "acquired rather than innate" tokenizer like LLM Tokenization. Karpathy described the tokenization process as ugly and clumsy.

When a user's text input is converted into content that an AI can "read," it goes through a "Tokenizer" that cuts the sentence into "tokens." For example, "Hello, world!" might be cut into [Hello], [,], [world], [!], four tokens. The tokenization standards are not unified. Different vocabularies and tokenizers also mean different tokenization methods for each model, which have a certain impact on the final performance of the model.

Is the tokenization process of converting LLM text input into tokens really necessary? The DeepSeek - OCR paper inadvertently provides evidence: it proves that an AI can use only 100 "vision tokens" to "decompress" the original text containing 1000 "text tokens" with high accuracy, without the need for text tokenization.

Language deeply depends on visual experience and multi - modal foundations. Words themselves are a secondary abstraction of the perceived world. Why should our AI systems bypass the more primitive and rich representation layer? When a model understands text directly at the pixel level, it doesn't just see language but acquires a richer and deeper learning mechanism.

As mentioned before, compared with one - dimensional discrete language information, the end - to - end compression, unified representation extraction, and prediction of higher - dimensional and continuous visual information are difficult and have made little progress.

Yan LeCun, who always talks about the world model, once mentioned in a public interview how difficult it is to process continuous visual information:

"A typical large language model is trained on about 20 billion to 2 trillion tokens. A token is roughly a word. Usually, a token is represented by three bytes. So, 20 billion to 2 trillion tokens are about 10^14 bytes in total, which is 1 followed by 14 zeros. This is almost the sum of all public text on the Internet.

It would take a person hundreds of thousands of years to read all this material. It's a huge amount of information. Now, let's compare this data volume: A four - year - old child has been awake for a total of 16,000 hours. About 2 megabytes of information enter our visual cortex through our optic nerve every second. At 2 megabytes per second, in four years, the visual input is about 10^14 bytes of data. The amount of data a four - year - old child "sees" is the same as the text that would take you 400,000 years to read.

This shows that we can never achieve human - level AI by relying solely on text training. We must teach AI to understand the real world, which is very difficult. If we don't use words but frames in a video, convert these frames into tokens similar to words, and then try to train the system to predict what will happen in the video, it won't work.

We may not be able to predict where a specific word will appear in the text, but we can predict the probability distribution of all possible words. For videos, we can't do this. We can't represent the probability distribution of all possible video frames. Therefore, the technologies that are very effective for text, DNA sequences, or proteins don't work for videos or other natural signals."

Looking back, the real value of the DeepSeek - OCR paper doesn't lie in providing a good OCR tool. Instead, it serves as a "proof - of - concept." It uses experimental data to prove that the main information entry for AI can shift from language to vision, which is not only more efficient but also seems to be more in line with biological characteristics.

Karpathy also provided a key insight:

The task space of Vision→Text actually fully encompasses the task space of Text→Text. Any text can be "rendered" into an image without loss. But the reverse process from image to text will result in a large amount of information loss. This asymmetry implies a radical direction: unify all inputs into the visual modality while keeping the output as text.

This is not just a change from a "text - to - text" task to a "vision - to - text" task. It's a more fundamental transformation.

If the input end completely shifts to pixels, what we are actually building is no longer a traditional "large language model" but a text - generation system under visual conditions. The model no longer sees fixed - divided characters but more chaotic, disordered, and information - rich original signals. Along this new development path, the outline of the world model seems to be in sight.

Reading the conclusion of