It's 2025, yet AI still can't read clocks. 90% of people can answer correctly, while top AIs all failed.
The accuracy rate of ordinary people is 89.1%, while the best result of AI is only 13.3%. On the new visual benchmark ClockBench, the "elementary school problem" of reading an analog clock stumped 11 large models. Why can't AI read the clock accurately? Is there a problem with the test or is AI really not up to the task?
90% of people can solve the clock - reading problem, but top - tier AI models all failed!
Alek Safar, the creator of the AI benchmark and a serial entrepreneur, launched the visual benchmark test ClockBench, which focuses on testing AI's ability to "understand" analog clocks.
The results are astonishing:
The average accuracy rate of humans is 89.1%, while the best result among the 11 mainstream large models participating in the test is only 13.3%.
In terms of difficulty, it is comparable to the "Ultimate AGI Test" ARC - AGI - 2 and more difficult than the "Ultimate Human Exam".
ClockBench contains a total of 180 clocks and 720 questions, which demonstrates the limitations of current cutting - edge large language models (LLMs).
Paper link: https://clockbench.ai/ClockBench.pdf
Although these models have shown amazing reasoning, mathematical, and visual understanding abilities on multiple benchmarks, these abilities have not been effectively transferred to "reading the clock". Possible reasons:
The training data does not cover enough memorable clock features and time combinations, so the model has to establish a mapping between the hands, scales, and readings through reasoning.
The visual structure of the clock is difficult to fully map to the text space, resulting in limited text - based reasoning.
There is also good news: The best - performing model has shown some (albeit limited) visual reasoning ability. Its clock - reading accuracy and median error are significantly better than the random level.
More research is needed to determine whether these abilities can be obtained by expanding the existing paradigms (data, model scale, computing/reasoning budget) or if a completely new approach is necessary.
How does ClockBench test AI?
In the past few years, large language models (LLMs) have made significant progress in multiple fields, and cutting - edge models have quickly reached "saturation" on many popular benchmarks.
Even the latest benchmarks specifically designed to test both "expert knowledge and strong reasoning ability" have seen rapid breakthroughs.
A typical example is Humanity’s Last Exam:
On this benchmark, OpenAI GPT - 4o scored only 2.7%, while xAI Grok 4 increased to 25.4%;
After combining optimization methods such as tool use, the result can even reach the 40–50% range.
However, we still find that AI performs poorly on some tasks that are easy for humans.
Therefore, benchmarks such as SimpleBench and ARC - AGI have emerged, which are specifically designed to be easy for ordinary people but difficult for LLMs.
ClockBench is designed inspired by this idea of "easy for humans, difficult for AI".
The research team based on a key observation: It is equally difficult for both reasoning and non - reasoning models to read an analog clock.
Therefore, ClockBench constructs a robust dataset that requires high visual accuracy and reasoning ability.
What exactly does ClockBench contain?
- 36 newly designed custom clock faces, with 5 sample clocks generated for each face
- A total of 180 clocks, with 4 questions set for each clock, resulting in 720 test questions
- 11 models with visual understanding ability from 6 laboratories were tested, and 5 human participants were recruited for comparison
The questions are divided into 4 major categories:
1. Determine if the time is valid
There is a clock 🕰️, and the large model needs to determine if the time shown on the clock is valid.
If the time is valid, the large model needs to break it down into several parts and output them in JSON format:
Hours, Minutes, Seconds, Date, Month, Day of the week
As long as the clock face contains the above information, the LLM is required to output them all.
2. Addition and subtraction of time
This task requires the LLM to perform addition and subtraction on the given time to get a new time.
3. Rotate the clock hands
This task is about operating the clock hands. The task requires the model to select the hour/minute/second hand and rotate it clockwise or counter - clockwise by a specified angle.
4. Time zone conversion
This task is about times in different places 🌍. For example, given the daylight saving time in New York, the model needs to calculate the local time in different places.
The results are unexpected
What unexpected findings are there in the results?
- There is not only a huge gap in accuracy between the model and humans, but also completely different error patterns:
- The median error of humans is only 3 minutes, while that of the best model is as high as 1 hour
The error of the weaker models is about 3 hours. Considering the 12 - hour clock face cycle feature, it is equivalent to random noise.
Another interesting finding is that there are significant differences in the difficulty of reading certain clock features:
- The models perform worst when reading uncommon and complex clocks and in scenarios with high - precision requirements
- The orientation of Roman numerals and circular numbers is the most difficult to recognize, followed by the second hand, cluttered backgrounds, and mirrored clocks
Except for clock - reading, other questions are actually easier for the models:
- The best - performing model can answer questions about time addition and subtraction, hand rotation angles, or time zone conversion with high precision, and the accuracy rate can reach 100% in some scenarios
In the comparison of the performance of different models, the general trend is: Larger - scale reasoning models are generally better than smaller - scale or non - reasoning models.
However, there are also some noteworthy phenomena:
- Google's Gemini 2.5 series models often lead other models in their respective categories;
- Anthropic series models generally lag behind similar models;
- Grok 4's performance is far below expectations, which is not commensurate with its scale and general ability.
GPT - 5 ranks third, and the reasoning budget has little impact on the results (the scores of medium and high budgets are very close). It is worth thinking about: What factors restrict GPT - 5's performance in such visual reasoning tasks?
In the original dataset, 37 out of 180 clocks represent invalid (impossible) times. Both humans and models have a higher success rate in identifying "invalid times":
- The difference among humans is not significant: The accuracy rate on invalid clocks is 96.2%, while on valid clocks it is 89.1%;
- The difference among models is obvious: The average accuracy rate on invalid clocks is 349% higher, and all models perform better in this type of task;
- Gemini 2.5 Pro is still the overall best model, with an accuracy rate of 40.5%;
- Grok 4 is an outlier: It has the highest accuracy rate in identifying invalid clocks, reaching 64.9%, but the problem is that it marks 63.3% of the clocks in the entire dataset as invalid, which means the result is likely to be a "random guess".
On the clock faces where the models can correctly read the time, there is an obvious overlapping phenomenon:
- 61.7% of the clocks were not correctly read by any model;
- 38.3% of the clocks were correctly read by at least 1 model;
- 22.8% of the clocks were correctly read by at least 2 models;
- 13.9% of the clocks were correctly read by at least 3 models;
- 8.9% of the clocks were correctly read by at least 4 or more models.
Overall, the distribution and validity data show that the correct answers of the models are concentrated on a small number of clocks rather than being evenly distributed.
References
https://x.com/alek_safar/status/1964383077792141390
https://clockbench.ai/
This article is from the WeChat public account "New Intelligence Yuan", author: KingHZ. Republished by 36Kr with permission.