HomeArticle

DeepSeek "Open Eye" Sets the AI Community Ablaze: I Tested Its Capabilities with 12 Tricky Images to Find Its Limits

雷科技2026-04-30 09:12
DeepSeek has completed the final piece of the puzzle!

Five days after DeepSeek delivered a powerful punch with V4, completely detonating the tech circle, Chen Xiaokang, a researcher in charge of multimodality within DeepSeek, posted the following on X and attached the text:

Now, we see you.

(Image source: Lei Technology)

Yes, it means exactly what it says.

While everyone was still amazed by the price and coding ability of V4, DeepSeek suddenly started testing the image recognition mode. The multimodal ability that the whole network had been discussing for a whole year has finally been implemented.

The speed of this update really makes people wonder if Liang Wenfeng locked the development team in the computer room overnight to avoid being made into meme images of being irresponsible by netizens.

It should be noted that this test is not a full - scale test but a small - scale gray - scale test. Only some users can see it in the official DeepSeek App or web version. At this time, in addition to the original quick mode and expert mode above the input field, there will also be a new image recognition mode button, marked with "Image understanding function is in internal testing".

(Image source: Lei Technology)

Unfortunately, none of my colleagues were selected for the gray - scale test. The number of people selected by the DeepSeek official was actually zero!

Fortunately, I actually became the chosen one in ten thousand.

Since it's such a coincidence, I'd feel a bit guilty if I didn't test it for everyone. This time, I carefully selected 12 pictures to let everyone see what DeepSeek can actually "see".

Strong understanding ability, knowledge base needs to be updated

Without further ado, let's start the test with picture description.

The reason for putting this first is that it is the most widely used function of visual understanding in actual scenarios.

Take our daily life as an example. When we see a strange plant by the roadside that we can't name, or want to find the same link for a certain style of clothing, or even when we are in a foreign country and struggling with a menu full of foreign languages, our first reaction is probably to take a photo and ask AI, "What is this?"

This kind of "seeing is asking" interaction essentially tests the model's visual understanding ability.

This time, I prepared three pictures for testing: a Coser image, my experience in the museum, and a picture of a complex event scene.

(Image source: Lei Technology)

Prompt: Please describe this picture in detail, with the word count controlled within 250 words.

For the first picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

Yes, DeepSeek not only completely described the details of the whole picture but also recognized the characters in the picture. It even restored the background, lighting and other elements in the picture truthfully. Using this text in a text - to - image model can directly restore a highly similar picture.

You know, this effect was achieved without turning on the thinking mode.

For the second picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

Without turning on the thinking mode, this answer is just a simple description of the picture, without any analysis of the item. However, the description itself is quite in place. In the end, it can still be seen that this item is rich in Middle Eastern or Central Asian artistic styles and is likely to be a precious court or religious ritual utensil.

So, what if I turn on the thinking mode?

Now it starts to analyze. First, it disassembles the item, defines what it is, what its characteristics are, and what the environment it is in is like.

(Image source: Lei Technology)

Then it starts to define. It thinks it is in the style of Hindustan in the Qing Dynasty.

(Image source: Lei Technology)

So, what is the style of Hindustan in the Qing Dynasty? According to Wiki, it is a style of jadeware with Central Asian Islamic style introduced during the Qianlong period of the Qing Dynasty in the 18th century, mainly originating from the Mughal Empire in North India.

Coincidentally, I went to see the Mughal Empire exhibition, and it really found it.

(Image source: Lei Technology)

For the third picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

In addition to describing the picture and reading the text information, this time it also definitely determined that this is a picture of the event scene of the China International Building Materials Fair in Guangzhou. It can only be said that there is really no problem with its picture understanding.

Of course, the above contents are all about describing pictures based on what is seen. So, how about the recognition of newer information?

This time, I prepared three pictures from recent years. Prompt: What are the things in the picture? And state your basis, with the word count controlled within 200 words.

(Image source: Lei Technology)

For the first picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

Well... at least it can see the information of Pokémon from the picture, but the game "Pokopia" is too new and is obviously not in DeepSeek's knowledge base.

For the second picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

This time, it judged very accurately. This is indeed a tactical diagram of FM24 downloaded from 3dm.

For the third picture, DeepSeek's answer is as follows:

(Image source: Lei Technology)

It's not hard to see that it really lacks the latest product information, but it can judge it as the Xiaomi 11 Ultra through the secondary screen. It can only be said that DeepSeek's image recognition really grasps the logic.

Logical problems are also unsolvable

Next, let's try element recognition.

In simple terms, this part tests the AI's ability to observe details. Some of the questions in it are so difficult that even real people may not be able to answer them.

Hey, let's also see if DeepSeek is color - blind.

There are so many such pictures on the Internet. I simply searched for these pictures on Google for testing. You're welcome to take a look too.

(Image source: Lei Technology)

Let's test the first one first. Prompt: Please directly tell me how many tigers are in this picture.

Unexpectedly, this question made DeepSeek start to struggle with itself, constantly denying the result it counted last time. Finally, after counting six tigers twice, it firmly answered seven.

(Image source: Lei Technology)

The problem is that there are ten tigers in this picture, which is really embarrassing.

Let's test the second one. Prompt: There is a set of numbers hidden in this picture. Please directly tell me how many numbers there are and what they are.

(Image source: Lei Technology)

Well, this picture has stumped all AIs before, and DeepSeek also failed to recognize it.

The same is true for the third picture. It can be said that this kind of pictures based on reverse color and fragmentation are still the lifelong enemies of visual understanding.

(Image source: Lei Technology)

Finally, there are three graphic logic questions. Deep