Is the DeepSeek image recognition mode a new model? Here is a first-hand actual test.
Today, have you been included in the limited release of DeepSeek's image recognition mode?
People have been looking forward to DeepSeek's multimodal capabilities for a long time. Now, following the release of V4, there's a pleasant surprise. Before DeepSeek's official release of more information, the public has started to dig out various clues behind the "image recognition" feature from all directions.
There are actually quite a few discoveries.
For example, behind DeepSeek's image recognition mode, there seems to be a new model independent of V4 flash/pro.
Another example is that what's mentioned in the "Future Outlook" section of DeepSeek's V4 technical report might actually be almost done...
When I woke up today, I was also included in the limited release. Now, let me show you the results of my actual tests.
Actual Test of DeepSeek's Image Recognition Mode
In the image recognition mode, you can choose whether to enable in - depth thinking.
In non - thinking mode, this DeepSeek visual model is extremely fast, even faster than a lightning - fast five - strike whip.
When you click the send button, you hardly need to wait, and the answer pops up right away.
So, what's the difference in the reasoning ability of DeepSeek's image recognition mode between thinking and non - thinking modes?
Reasoning Ability
Let's start with a spatial reasoning question: To form the cube on the left by assembling the figures on the right without rotation, which figure should be added at the question mark?
In non - thinking mode, it gives an answer instantly, and then... it's instantly wrong.
After enabling in - depth thinking, DeepSeek successfully solves the problem and gives the correct answer D.
However, as you can see, it takes more than four minutes to think about this problem.
We can intuitively feel how long this thinking process is -
In the middle of the thinking process, DeepSeek has actually found the correct answer:
But then it says "wait a minute", and then... it goes on and on.
Someone also reported this problem under the tweet of DeepSeek researcher Chen Xiaokang.
Let's try finding differences in pictures: Find all the differences between the two pictures.
In non - thinking mode, DeepSeek quickly finds 7 differences.
It's quite obvious that there are many hallucinations. For example, it's not clear where the key in the tray in the 5th point comes from, and there's no white empty plate between the apple and the banana in the 7th point.
In thinking mode, it only takes 16 seconds to find 12 differences.
But... I'm not sure if it's because of the picture itself, but there are even more hallucinations.
Practical Functions
There's still room for improvement in the reasoning part. So, is DeepSeek's image recognition mode reliable in terms of practical functions?
Let's try OCR.
When you input the abstract of the DeepSeek V4 technical report into DeepSeek's image recognition mode without enabling in - depth thinking, it still gives results lightning - fast and even helpfully hyperlinks the open - source link.
The plain text seems fine. Let's see if DeepSeek can handle tables.
There's no problem, and the format can be neatly arranged in Markdown.
Another popular new way to use it is sending a web page image to DeepSeek, and it can directly restore the HTML (this can be achieved in non - thinking mode).
The buttons in it are all usable. For example, when given the link to the API documentation, it can automatically configure the jump.
DeepSeek can also successfully pass the "hidden image" test.
But it occasionally fails in the color - blindness test.
According to the image recognition mode's own answer, its knowledge is the same as that of DeepSeek V4 flash/pro, up to May 2025.
From its world knowledge, a blogger noticed something strange: The visual model knows about "Ta", while V4 flash/pro doesn't.
Does it mean that the visual model in the image recognition mode is independently trained?
After verification, when flash is offline, it really doesn't have knowledge about this person. But the image recognition mode finds information from April 2026.