HomeArticle

The anticipation for Google's Gemini 3 release is at an all-time high. A historian claims that it has solved two of the oldest problems in the AI field.

36氪的朋友们2025-11-13 11:16
Google's AI model shows breakthroughs in handwriting recognition and symbolic reasoning, possibly Gemini-3.

On November 12th, a recent article titled "Has Google Quietly Solved Two of AI’s Oldest Problems?" has been rapidly circulating within the artificial intelligence community.

The author is Mark Humphries, an associate professor of history at Wilfrid Laurier University in Waterloo, Canada. This scholar, who originally studied North American history in the 20th century, has recently shifted his focus to digital humanities and the application of artificial intelligence. In his Substack column "Generative History," he revealed that a mysterious model he tested in Google AI Studio demonstrated "almost perfect" handwritten text recognition ability and "spontaneous, abstract, and symbolic reasoning" phenomena.

Note: The AI Studio interface shows an A/B test.

Google's AI Studio is an open experimental platform where users can test prompts and compare the performance of different models. In the past week, some users have noticed that the system randomly generates two answers and asks them to choose the better one. This is a common A/B testing method used by large AI laboratories before launching a model (to compare the effects of two or more solutions and determine which one is better). Based on this, the outside world speculates that the model being tested might be the upcoming Gemini-3.

Humphries' experiment was originally intended to verify the performance of this model in the task of "transcribing handwritten historical documents." However, he unexpectedly observed a deeper phenomenon: the model not only achieved the level of human experts in transcription accuracy but also could reason, correct, and explain on its own when faced with ambiguous or incomplete information, as if it were "understanding" historical materials rather than just recognizing characters.

He wrote, "I thought it would take several more years for AI to make a breakthrough in the field of historical documents. However, the capabilities demonstrated by this model are close to those of real human experts, and even exceed expectations in logical judgment and context restoration."

If these results are confirmed, it will mark a crucial moment in the history of AI: machines can not only "read" handwritten symbols but also "think" about the logic behind them like scholars. This means that AI may have simultaneously overcome two of the oldest problems in AI research - handwritten text recognition and symbolic reasoning.

01. From "Prediction Machine" to "Understander"

Handwritten Text Recognition (HTR) is one of the earliest topics in the history of AI research. As early as the 1940s, researchers tried to make computers recognize human handwriting. In 1966, IBM released the IBM 1287 machine, which could read numbers and some Latin letters and is regarded as the beginning of AI handwritten text recognition. For decades, researchers have continuously improved algorithms and visual models, but they have always been limited by one problem: machines can only recognize patterns and cannot understand semantics.

Note: Recognizing historical manuscripts

Humphries pointed out that recognizing historical manuscripts is much more complex than ordinary text. This is not only a visual problem but also a challenge in language and cultural understanding. Manuscripts from the 18th and 19th centuries are full of spelling mistakes, inconsistent grammar, ambiguous symbols, and semantic ambiguities. Understanding these contents requires the simultaneous use of linguistics, historical background, social common sense, and logical reasoning.

He explained, "People think the difficulty of ancient documents lies in handwriting recognition, but the real challenge is inferring the author's intention - that is a combination of visual recognition and logical reasoning."

In his research, handwritten text recognition has become an ideal scenario for testing the limits of LLM (Large Language Model) capabilities. This is because it requires the model to integrate perception (Vision), language (Language), world knowledge (World Knowledge), and logic (Reasoning) into the same task. If the model can make a breakthrough in such a complex task, it may indicate the emergence of more extensive intelligent capabilities.

Note: The performance evolution of Transkribus, humans, and Google models in handwritten text recognition (HTR) over time

From GPT-4 to Gemini-2.5-Pro, the accuracy of AI in the HTR field has been continuously improving. By the end of 2024, Gemini-2.5-Pro could achieve a Character Error Rate (CER) of 4% and a Word Error Rate (WER) of 11% on complex manuscripts, reaching the level of professional human transcription. The new model tested by Humphries further reduced the CER to 0.56% and the WER to 1.22% - which means only one letter or punctuation is incorrect for every 200 characters.

He pointed out that this cross-generational improvement highly conforms to the "Scaling Laws": for every order of magnitude increase in the model's parameter scale, its ability to perform complex tasks shows a predictable exponential growth. If this law continues to hold, the model may automatically cross the boundary of logical reasoning that was previously considered "unique to humans."

02. From Transcription to Reasoning: Unexpected Findings from the Experiment

To verify the performance of the model, Humphries uploaded a set of 18th-century handwritten account books and letters. These materials are often full of spelling mistakes, scribbled handwriting, and inconsistent formats. The testing process was extremely cumbersome - he had to refresh the interface repeatedly and wait for the system to provide an A/B comparison opportunity, sometimes trying more than thirty times.

The results were unexpected. The model was not only almost perfect in word and character recognition but also demonstrated an "active reasoning beyond the task requirements."

Note: A page from the journal of an Albany merchant

The most representative example comes from the journal of a merchant in Albany, New York, in 1758. The ledger records, "To 1 loaf Sugar 145 @1/4 0 19 1." Human scholars know that this means "purchasing a sugar cone, 1 shilling and 4 pence per pound, with a total price of 0 pounds, 19 shillings, and 1 penny." However, the manuscript is extremely blurry, and it is not clear whether the number "145" is "14.5" or "1.45."

Note: A close-up of the transcription

Note: A close-up of the original document

Almost all AI models would make mistakes in this case - they might misread "145" as 145 pounds or misarrange the numbers and units. However, the new Gemini model reasoned out the correct answer on its own: "14 lb 5 oz."

It didn't make a blind guess but arrived at the answer through logical calculation:

Note: The results of the test

1 shilling and 4 pence = 16 pence, and the total price of 0 pounds, 19 shillings, and 1 penny = 229 pence. 229 ÷ 16 = 14.3125, which is 14 pounds and 5 ounces. The model not only calculated correctly but also automatically standardized the writing by adding the "lb" and "oz" units in the output.

Humphries was surprised to find that "it seems to know that the accounts don't balance and actively conducts reverse calculations and corrects the units. This is not prediction; this is reasoning."

This means that when faced with ambiguous or ambiguous input, the model can establish an "internal problem representation" and arrive at a reasonable conclusion through multi-step logical calculations. This is exactly the core characteristic of "symbolic reasoning" that AI has long been considered unable to achieve.

In the past, the GPT or Gemini series often made hallucinations or numerical errors in similar tasks. However, this model not only calculated correctly but also demonstrated context consistency and semantic stability. It was not required to perform mathematical verification but spontaneously completed the reasoning process while "understanding the text" - this phenomenon shocked the researchers.

03. From Emergent Intelligence to Theoretical Shock

Symbolic Reasoning is considered the core of human cognition. It means that an individual can manipulate abstract symbols in the mind and execute logical rules, rather than relying solely on pattern matching. Since the 1950s, artificial intelligence has been trying to enable machines to master this ability. However, in the era of deep learning, symbolic reasoning was considered an area that neural networks could not reach. Humphries' discovery has broken this assumption.

He pointed out, "Strictly speaking, this model is not designed as a symbolic system, and it does not have an explicit logical module. However, its behavior is consistent with symbolic reasoning - it can detect ambiguities, propose hypotheses, conduct verification, and output correct explanations."

In other words, this is an emergence of implicit reasoning (Emergent Implicit Reasoning). The model doesn't really "know" what it's doing, but its internal high-dimensional representation is sufficient to form a structure equivalent to reasoning. It doesn't explicitly operate rules but can naturally generate logical patterns in a sufficiently complex statistical network.

This phenomenon has a profound impact on AI theory. In the past, people distinguished "statistical learning" (Pattern recognition) and "symbolic reasoning" (Symbolic Manipulation) as two completely different forms of intelligence. Now, they seem to be starting to merge, and machines may "learn to reason" at a sufficient scale without explicit rules.

What's even more remarkable is that this ability is not an isolated case. Multiple users in the AI community have reported similar experiences: the new model can spontaneously demonstrate multi-step logical thinking in tasks such as chemical formula derivation, manuscript date inference, and ancient currency conversion.

This forces researchers to re-examine the definition of "understanding": when AI can propose and solve problems on its own without external instructions, is it still a "prediction model"? Or has it started to form a primitive "cognitive structure"?

The academic discussion has spread rapidly. Some AI theorists believe that this proves that the Scaling Laws can indeed bring "symbol-like intelligence," meaning that reasoning ability may stem from statistical complexity itself. Others are more cautious, believing that these phenomena may just be coincidental contextual associations rather than true understanding.

In any case, this experiment reveals a fact: the understanding ability of AI may be moving from "probability" to "concept."

04. Echoes of History and the Threshold of the Future

For the historical research community, this breakthrough is revolutionary. If AI can transcribe and understand handwritten historical materials with expert-level accuracy, it will completely change the way of archival research. Hundreds of millions of historical letters, account books, diaries, and manuscripts will be quickly digitized and automatically analyzed in a structured manner.

This means that historical research may no longer rely on manual page-by-page input but enter the "era of machine co-reading." In the future, AI can not only help you read but also help you interpret.

However, this also brings new ethical and methodological challenges: when AI replaces human understanding with probabilistic reasoning, is the "right of interpretation" of history also being reshaped? Will its "corrections" instead create biases? If the reasoning basis of the model is not transparent, its "historical reconstruction" may incorporate machine biases.

Humphries wrote at the end of the article, "If AI can read history like a human, it will also make mistakes like a human. We must learn to co-read and co-think with it rather than rely on it completely."

From a technical perspective, this achievement is also shocking. Handwritten text recognition and symbolic reasoning were once the two most difficult problems in AI research: one belongs to the visual field, and the other belongs to cognitive logic. Now, they have been simultaneously overcome in the same model, which may mean the dawn of general intelligence.

The experiments on Gemini also show another trend: general large models are gradually surpassing dedicated systems. In the past few decades, AI researchers tended to design specialized architectures for specific tasks (such as OCR and speech recognition). However, today's multi-modal LLMs can achieve higher accuracy with less training and stronger generalization. This indicates that AI research is shifting from "specialization" to "unification."

Humphries summarized it in one sentence: "Handwritten text recognition may just be an excuse. What we are really witnessing is the moment when AI starts to understand the world." Perhaps, as he said, "It took humans eighty years to make machines read handwriting, and now machines are starting to understand humans."

This article is from "Tencent Technology," author: Wuji. Republished by 36Kr with permission.