HomeArticle

Humans Understand Instantly, AI Collapses: A Simple Test Causes Top Models Like GPT-5 and Gemini to Fail Collectively

量子位2025-09-10 10:36
AI relies on pattern matching and doesn't understand the structure of text.

Text that humans can understand at a glance has completely stumped AI.

A research team from institutions such as A*STAR, NUS, NTU, Tsinghua University, and Nankai University has recently made a new discovery:

Whether it's OpenAI's GPT - 5, GPT - 4o, Google's Gemini, Anthropic's Claude, or even domestic models like Qwen and LLaVA, they all perform extremely poorly and "flip over" when faced with some text that is "visible but unreadable".

Cut and Stack: AI at a Loss

The VYU team designed two small experiments:

1. They selected 100 four - character Chinese idioms, cut each Chinese character horizontally, vertically, and diagonally, and then reassembled the fragments.

Humans can read them without any pressure, but AI gets almost all of them wrong.

2. They picked 100 eight - letter English words, colored the first and second halves red and green respectively, and then overlaid them.

For humans, this poses almost no challenge - because our visual system is extremely sensitive to the red/green channels, and our brains can automatically separate the colors and then piece together the complete words.

But for AI models, the results are completely different:

Even the latest large - scale models often stumble on these problems.

Whether it's Gemini 2.5 Pro:

Or Kimi 2 (Switch to 1.5 for visual understanding):

(PS: The final inferred answer of Kimi 2 is "hardline".)

Or Qwen3 - Max - Preview:

None of them can get the correct results.

AI Doesn't Understand Symbol Segmentation and Combination

After analyzing this phenomenon, the VYU team believes that the root cause is that AI relies on pattern matching and doesn't understand the structure of text.

The reason humans can "understand" is that we rely on structural priors - we know that Chinese characters are composed of radicals, and we know that English is combined by letters.

Large - scale models simply recognize text as "image patterns" and lack a mechanism for symbol segmentation and combination.

Therefore, as long as the text is slightly perturbed (but humans can still understand it), AI will completely collapse.

The reason this problem is worth studying is that it is related to the core challenges of AI implementation:

In education and accessibility applications, AI may not be able to correctly recognize "non - standard text".

In the collation of historical documents and scientific notes, AI cannot restore the meaning from incomplete text like humans.

In security scenarios, attackers can even use this "blind spot" to bypass AI censorship.

The VYU team believes that to enable AI to have human - like resilience, we must rethink how Vision - Language Models (VLMs) integrate vision and text -

Perhaps we need new training data, a structure prior that focuses more on segmentation, or a brand - new multi - modal fusion method.

More importantly, this result also reminds us that human reading comprehension has never been a single - modality process but relies on the comprehensive ability of multiple perceptions and reasoning.

Paper link:

https://zjzac.github.io/publications/pdf/Visible_Yet_Unreadable__A_Systematic_Blind_Spot_of_Vision_Language_Models_Across_Writing_Systems__ArXiv.pdf

Zhang Yang

This article is from the WeChat official account "QbitAI". The author focuses on cutting - edge technology. 36Kr has obtained authorization to publish it.