HomeArticle

DeepSeek gives AI a cyber finger, and then it can see.

字母AI2026-05-01 12:41
OpenAI, Google, and Anthropic are all competing to see clearly, while DeepSeek is researching how to make AI understand.

One day before the May Day holiday, DeepSeek suddenly released a technical report on visual multimodal technology.

Before opening it, I had a general expectation in mind. It was nothing more than how far and how clearly the model could see.

After all, in the past year, most multimodal models have been competing in this direction. OpenAI talks about "thinking with images," allowing the model to crop, zoom in, and rotate images during the reasoning process. Gemini and Claude are also trying to enable the model to handle higher-resolution and more complex visual inputs.

The common assumption among everyone is that as long as the model can see more details, its visual reasoning ability will naturally be stronger.

However, after reading DeepSeek's report, you'll find that they have completely taken a different path.

DeepSeek doesn't focus on "letting the model see more pixels." Instead, they've placed their attention on a more fundamental problem.

Even if the model can see clearly, how can you ensure that it refers to the same thing as you do during the reasoning process?

Actually, this is the most easily overlooked Achilles' heel in multimodal reasoning.

When humans look at pictures, they can use their fingers to mark objects. For example, "This person is so-and-so," "That person is so-and-so." But how does the model know which one you're referring to?

The model can only use language to say "the one on the left," "the one on the top," "this line." Once the picture becomes complex, the language reference will drift, and the reasoning will collapse accordingly.

So DeepSeek said, why not give the model a "finger"?

It turns the points and bounding boxes into the basic units for the model's thinking, enabling the model to point at the object with this cyber finger while reasoning.

01 From Continuous Vision to Discrete Symbols

In this technical report, DeepSeek raised a very interesting question. They believe that the real difficulty for multimodal models lies not in seeing the image, but in stably referring to the same visual object during the continuous reasoning process.

For example, if you tell your friend, "The vegetables sold at Granny Zhang's stall in the vegetable market are the freshest." But there are so many old men and women in the vegetable market. Which one is Granny Zhang?

But if you directly point and say, "It's that one," your friend will immediately understand.

DeepSeek named this problem the "Reference Gap."

In the past year, almost all cutting-edge multimodal models have been trying to solve the "Perception Gap" problem.

Suppose there is a photo in front of you. If the photo is too blurry or has a low resolution, you may not be able to see the small words or distant details clearly. The same goes for AI. If the input image quality is insufficient or the processing method is incorrect, it will "fail to see clearly." This is the perception gap.

Models like GPT, Claude, and Gemini keep increasing the resolution and introducing high-resolution cropping, dynamic block partitioning, and multi-scale processing. The purpose is to enable the model to see more details.

This direction is of course valuable, but DeepSeek pointed out in the report that even if the model can see clearly, logical breakdowns will still occur in complex spatial reasoning tasks.

The problem lies in natural language itself.

If there are a dozen dogs in the photo and you say "the dog on the left," the model won't be able to understand which one you're specifically referring to.

Even more extreme, if you ask the model to count the number of dogs in the photo, it's very easy for the model to lose track of which dogs it has counted and which ones it hasn't during the reasoning process.

The report also mentioned extreme situations like maze navigation. Pure language simply cannot accurately describe irregular paths and complex topological relationships.

As a referential tool, language is inherently vague in the continuous visual space. It is good at abstract concepts and causal relationships, but there are fundamental limitations in its ability to express spatial positioning and topological relationships.

But DeepSeek is a general language model. So how should it solve this problem?

That's where the "finger" mentioned at the beginning of the article comes in.

The core concept they proposed is "Visual Primitives." Specifically, it elevates the two most basic spatial markers in computer vision, bounding boxes and points, to the "smallest units of thinking."

Although previous multimodal models could also draw boxes to mark objects, they only showed you the final result to prove "I've found it." It's like when taking an exam, you only hand in the answer without writing down the problem-solving process.

Some studies have also made AI draw boxes during the thinking process, but the purpose is only to "see more accurately." The boxes are just auxiliary tools. It's like using scratch paper when doing math problems. The scratch paper only helps you calculate more clearly and is not part of the problem-solving idea.

What DeepSeek wants to do is completely different.

They directly embed these spatial markers into the model's reasoning process, making them an organic part of the reasoning. When the model is thinking, it not only uses language to describe "I see a dog," but also outputs "I see a dog, and it's here: [[x1,y1,x2,y2]]."

This mechanism is called "point while it reasons" by DeepSeek.

Each step of the model's thinking is anchored to specific coordinates in the image.

The technical report gave such an example: The model starts from the starting point, explores, backtracks, and tries again all the way. Finally, it outputs a complete string of coordinate paths, and each coordinate corresponds to a point passed through in the maze.

In this way, the model won't "get lost" during the reasoning process. It won't be confused about what it's saying or referring to. Each visual object has a clear spatial anchor point, and the reasoning process becomes traceable and verifiable.

This technical route forms an interesting contrast with OpenAI's direction.

OpenAI clearly mentioned the concept of "thinking with images" in the official introductions of o3 and o4-mini. That is, the model can incorporate images into the reasoning chain and process images through cropping, zooming in, rotating, etc. The focus of this direction is to make the image itself a part of the thinking chain. The model can generate new images, modify images, and operate on images during the reasoning process.

OpenAI's route emphasizes general capabilities, with vision, code, search, files, and tool calls collaborating together. The model has a powerful "visual workbench" that can flexibly handle various visual tasks.

DeepSeek's route is a bit more "symbolic." It allows coordinates to enter the thinking chain. The model explicitly writes the coordinates of bounding boxes and points in the reasoning text, turning visual objects into reusable anchors during reasoning.

As a result, OpenAI's visual reasoning occurs internally. Users can only see the final answer and necessary explanations, and the intermediate visual processing process is a black box. DeepSeek deliberately makes the intermediate visual anchors explicit, making the reasoning process completely transparent.

The advantage of DeepSeek doing this is that the reasoning process is easier to train, check, and score. It also makes it easier to design formats, quality, and task-level rewards. Especially in tasks like maze navigation and path tracking, more detailed feedback can be given on path legitimacy and trajectory coverage.

The model not only learns to output the correct answer but also learns the method of reasoning with visual primitives.

02 Efficiency Is the Core

There is a very easily overlooked but extremely important detail in DeepSeek's report. When their model processes images, the number of tokens it uses is far less than that of other cutting-edge models.

The report has a comparison chart showing the number of tokens consumed by different models when processing an 800×800 resolution image.

Gemini-3-Flash uses about 1100 tokens, Claude-Sonnet-4.6 uses about 870 tokens, GPT-5.4 uses about 740 tokens, Qwen3-VL uses about 660 tokens, and DeepSeek uses about 361 tokens, and only retains about 90 entries in the KV cache.

This gap is not small. DeepSeek uses only one-third of the tokens that Gemini uses, and the number of KV cache entries is only about one-tenth.

How is this extreme efficiency achieved?

DeepSeek uses a mechanism called "Compressed Sparse Attention (CSA)."

You can understand it like this. If you show a family photo to a friend, you won't say, "There is a red area starting from the 237th pixel from the left..." You'll directly say, "My mom is on the left, and my dad is on the right."

DeepSeek-ViT first compresses the image into fewer visual tokens, and then CSA further compresses the representation of these visual tokens in the KV cache.

This mechanism was used in the DeepSeek-V4-Flash model and is now applied to visual multimodality.

The specific compression process is as follows. A 756×756 image contains 571,536 pixels. These pixels are first processed by ViT and divided into patches with a size of 14×14, generating 2,916 patch tokens. Then, a 3×3 spatial compression is performed, compressing every 9 adjacent tokens into 1 along the channel dimension, resulting in 324 visual tokens.

These 324 tokens enter the large language model for pre-filling. Finally, the CSA mechanism compresses these visual tokens in the KV cache by a factor of 4, ultimately only retaining 81 entries.

From 571,536 pixels to 81 KV cache entries, the overall compression ratio reaches 7,056 times.

Generally, large AI companies use brute-force methods to pile up computing resources, while DeepSeek makes trade-offs at the information theory level, only keeping the most intuitive and understandable information.

The most direct result is that the reasoning speed has increased significantly.

The number of image tokens directly affects the model's reasoning latency. During the autoregressive generation process, every time a new token is generated, the model needs to perform attention calculations on the KV caches of all previous tokens. If the image occupies 1,000 tokens, then attention needs to be calculated for these 1,000 tokens every time. If it only occupies 90 tokens, the computational load is significantly reduced.

For application scenarios that require real-time response, such as robot vision, autonomous driving, and real-time video analysis, the improvement in reasoning speed plays a decisive role.

And it also has low memory usage.

The KV cache is the memory bottleneck for large model reasoning. Especially when processing long contexts or batch reasoning, the KV cache will occupy a large amount of video memory. DeepSeek compresses the KV cache of visual tokens to 90 entries, which means that more images can be processed on the same hardware, or longer multi-round conversations can be handled.

This is very important for actual deployment. Many companies' multimodal models perform well in the laboratory, but encounter cost problems when it comes to actual deployment. The more tokens each picture consumes, the higher the reasoning cost and the fewer concurrent users can be supported. DeepSeek's efficiency advantage will be magnified during large-scale deployment.

It also indirectly increases the model's context capacity.

If a picture occupies 1,000 tokens, then only more than 100 pictures can be placed in a 128k context window. If it only occupies 300 tokens, more than 400 pictures can be placed. This is crucial for scenarios that require handling multi-image conversations, long video analysis, and understanding a large number of documents.

DeepSeek's model can handle more images in a single conversation, can compare and analyze dozens or even hundreds of pictures, and can track long-term changes in videos.

The most critical factor is the training cost.

Although the report mainly focuses on reasoning efficiency, this compression mechanism is also effective during the training phase. Fewer visual tokens mean a smaller computational graph, faster training speed, and lower hardware requirements.

DeepSeek has always been known for "achieving better results with fewer resources." From the reinforcement learning training of R1, to the MoE architecture of V4, and now to visual multimodality, this efficiency-first philosophy runs through it all.

But there is a key question. Will compression result in information loss?

DeepSeek doesn't deny that compression will cause information loss. Its claim is that in this set of spatial reasoning and counting tasks, the compressed representation is still effective enough.

Each step of compression retains the most important information for reasoning and discards redundancy and noise.

Actually, the visual primitive mechanism of DeepSeek mentioned earlier is also a form of information compression. A bounding box can accurately locate an object with 4 numbers, and a point can mark a position with 2 numbers. The information density carried by these discrete symbols is much higher than that of the original pixels.

From the experimental results, this compression doesn't damage performance. Instead, it brings improvements in some tasks.

This shows that for many visual reasoning tasks, the bottleneck doesn't lie in not seeing clearly enough, but in not finding the right representation method.

This efficiency advantage also proves that multimodal intelligence doesn't necessarily require a larger model, more computing power, or higher costs.

Since the birth of DeepSeek, there has always been an underlying thread in this company: "True intelligence doesn't lie in computing power, but in understanding the essence of the problem."

When you truly understand what visual reasoning needs, you don't need so many tokens. When you find the right representation method, you don't need such a large model.

From this perspective, DeepSeek's extreme efficiency is not the goal but a by-product. The real goal is to find the correct paradigm for visual reasoning. Efficiency only proves that this paradigm is correct.

03 Unfinished Business

In the limitations section of the report, DeepSeek candidly listed several problems with the current method. These problems are not minor flaws in technical details but point to the next stage of visual reasoning.

The first problem is trigger word dependency.

The report clearly states that the current ability to "think with visual primitives" requires explicit trigger words to be activated. That is, the model still can't naturally and autonomously decide "when to draw boxes and mark points."

It means that the model hasn't really learned to judge when to use visual primitives and when language is enough.

Ideally, the model should be able to make autonomous decisions based on the nature of the task. But when the user asks, "Count how many dogs are in the picture," the model should automatically switch to the visual primitive mode and use bounding boxes to assist in counting.

Technically, this requires establishing a metacognitive layer in the model. This metacognitive layer can evaluate the complexity of the current task, judge whether pure language reasoning is sufficient, and decide whether to call visual primitives.

DeepSeek hasn't implemented this metacognitive layer yet, but they've clearly defined the direction. Future versions may enable the model to learn to autonomously decide the reasoning strategy instead of relying on external triggers.

The second problem is resolution limitation.

The report mentions that due to the input resolution limitation, the model's performance in fine-grained scenarios is not good enough, and the output visual primitives are sometimes not accurate enough.

This problem is related to DeepSeek's efficiency-first strategy. To control the number of tokens, they limit the range of visual tokens to between 81 and 384. For images beyond this range, scaling processing will be performed.

This design is reasonable in most scenarios, but it will encounter bottlenecks in some tasks that require extremely high precision. For example, medical image analysis needs to identify tiny lesions, and industrial quality inspection needs to detect subtle defects. These scenarios have high requirements for resolution.

DeepSeek mentioned in the report that this problem can be solved by integrating existing high-resolution methods. That is, their visual