2025: AI Still Can't Read Clocks, 90% of People Answer Correctly While Top AIs Fail

The accuracy of AI in reading clocks is only 13.3%. The ClockBench test reveals the bottleneck in visual reasoning.

The accuracy rate of ordinary people is 89.1%, while the best result of AI is only 13.3%. On the new visual benchmark ClockBench, the "elementary school problem" of reading an analog clock stumped 11 large models. Why can't AI read the clock accurately? Is there a problem with the test or is AI really not up to the task?

90% of people can solve the clock - reading problem, but top - tier AI models all failed!

Alek Safar, the creator of the AI benchmark and a serial entrepreneur, launched the visual benchmark test ClockBench, which focuses on testing AI's ability to "understand" analog clocks.

The results are astonishing:

The average accuracy rate of humans is 89.1%, while the best result among the 11 mainstream large models participating in the test is only 13.3%.

In terms of difficulty, it is comparable to the "Ultimate AGI Test" ARC - AGI - 2 and more difficult than the "Ultimate Human Exam".

ClockBench contains a total of 180 clocks and 720 questions, which demonstrates the limitations of current cutting - edge large language models (LLMs).

Paper link: https://clockbench.ai/ClockBench.pdf

Although these models have shown amazing reasoning, mathematical, and visual understanding abilities on multiple benchmarks, these abilities have not been effectively transferred to "reading the clock". Possible reasons:

The training data does not cover enough memorable clock features and time combinations, so the model has to establish a mapping between the hands, scales, and readings through reasoning.

The visual structure of the clock is difficult to fully map to the text space, resulting in limited text - based reasoning.

There is also good news: The best - performing model has shown some (albeit limited) visual reasoning ability. Its clock - reading accuracy and median error are significantly better than the random level.

More research is needed to determine whether these abilities can be obtained by expanding the existing paradigms (data, model scale, computing/reasoning budget) or if a completely new approach is necessary.

How does ClockBench test AI?

In the past few years, large language models (LLMs) have made significant progress in multiple fields, and cutting - edge models have quickly reached "saturation" on many popular benchmarks.

Even the latest benchmarks specifically designed to test both "expert knowledge and strong reasoning ability" have seen rapid breakthroughs.

A typical example is Humanity’s Last Exam:

On this benchmark, OpenAI GPT - 4o scored only 2.7%, while xAI Grok 4 increased to 25.4%;

After combining optimization methods such as tool use, the result can even reach the 40–50% range.

However, we still find that AI performs poorly on some tasks that are easy for humans.

Therefore, benchmarks such as SimpleBench and ARC - AGI have emerged, which are specifically designed to be easy for ordinary people but difficult for LLMs.

ClockBench is designed inspired by this idea of "easy for humans, difficult for AI".

The research team based on a key observation: It is equally difficult for both reasoning and non - reasoning models to read an analog clock.

Therefore, ClockBench constructs a robust dataset that requires high visual accuracy and reasoning ability.

What exactly does ClockBench contain?

36 newly designed custom clock faces, with 5 sample clocks generated for each face
A total of 180 clocks, with 4 questions set for each clock, resulting in 720 test questions
11 models with visual understanding ability from 6 laboratories were tested, and 5 human participants were recruited for comparison

The questions are divided into 4 major categories:

1. Determine if the time is valid

There is a clock 🕰️, and the large model needs to determine if the time shown on the clock is valid.

If the time is valid, the large model needs to break it down into several parts and output them in JSON format:

Hours, Minutes, Seconds, Date, Month, Day of the week

As long as the clock face contains the above information, the LLM is required to output them all.

2. Addition and subtraction of time

This task requires the LLM to perform addition and subtraction on the given time to get a new time.

3. Rotate the clock hands

This task is about operating the clock hands. The task requires the model to select the hour/minute/second hand and rotate it clockwise or counter - clockwise by a specified angle.

4. Time zone conversion

This task is about times in different places 🌍. For example, given the daylight saving time in New York, the model needs to calculate the local time in different places.

The results are unexpected

What unexpected findings are there in the results?

There is not only a huge gap in accuracy between the model and humans, but also completely different error patterns:
The median error of humans is only 3 minutes, while that of the best model is as high as 1 hour

The error of the weaker models is about 3 hours. Considering the 12 - hour clock face cycle feature, it is equivalent to random noise.

Another interesting finding is that there are significant differences in the difficulty of reading certain clock features:

The models perform worst when reading uncommon and complex clocks and in scenarios with high - precision requirements
The orientation of Roman numerals and circular numbers is the most difficult to recognize, followed by the second hand, cluttered backgrounds, and mirrored clocks

Except for clock - reading, other questions are actually easier for the models:

The best - performing model can answer questions about time addition and subtraction, hand rotation angles, or time zone conversion with high precision, and the accuracy rate can reach 100% in some scenarios

In the comparison of the performance of different models, the general trend is: Larger - scale reasoning models are generally better than smaller - scale or non - reasoning models.

However, there are also some noteworthy phenomena:

Google's Gemini 2.5 series models often lead other models in their respective categories;
Anthropic series models generally lag behind similar models;
Grok 4's performance is far below expectations, which is not commensurate with its scale and general ability.

GPT - 5 ranks third, and the reasoning budget has little impact on the results (the scores of medium and high budgets are very close). It is worth thinking about: What factors restrict GPT - 5's performance in such visual reasoning tasks?

In the original dataset, 37 out of 180 clocks represent invalid (impossible) times. Both humans and models have a higher success rate in identifying "invalid times":

The difference among humans is not significant: The accuracy rate on invalid clocks is 96.2%, while on valid clocks it is 89.1%;
The difference among models is obvious: The average accuracy rate on invalid clocks is 349% higher, and all models perform better in this type of task;
Gemini 2.5 Pro is still the overall best model, with an accuracy rate of 40.5%;
Grok 4 is an outlier: It has the highest accuracy rate in identifying invalid clocks, reaching 64.9%, but the problem is that it marks 63.3% of the clocks in the entire dataset as invalid, which means the result is likely to be a "random guess".

On the clock faces where the models can correctly read the time, there is an obvious overlapping phenomenon:

61.7% of the clocks were not correctly read by any model;
38.3% of the clocks were correctly read by at least 1 model;
22.8% of the clocks were correctly read by at least 2 models;
13.9% of the clocks were correctly read by at least 3 models;
8.9% of the clocks were correctly read by at least 4 or more models.

Overall, the distribution and validity data show that the correct answers of the models are concentrated on a small number of clocks rather than being evenly distributed.

References

https://x.com/alek_safar/status/1964383077792141390

https://clockbench.ai/

This article is from the WeChat public account "New Intelligence Yuan", author: KingHZ. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

It's 2025, yet AI still can't read clocks. 90% of people can answer correctly, while top AIs all failed.

How does ClockBench test AI?

The results are unexpected

References