HomeArticle

The latest global AI IQ rankings have been announced. Fortunately, no one has surpassed Einstein.

直面AI2025-08-19 13:20
Who is a real genius and who is just pretending to work hard?

If one day, AI has to sit down and take a standard IQ test just like humans, what would the result be? This sounds like a plot from a science fiction novel, but a fun project called "Trackingai.org" has turned it into reality.

This project doesn't use those technical terms and performance scores that dazzle ordinary people. Instead, it designed a set of test papers referring to human IQ tests, and let the world's top large language models have a direct and pure "IQ" showdown.

The core highlight of this showdown has long gone beyond a simple comparison of technical performance. It's more like a "Most Powerful Brain" challenge in the AI world, trying to measure how "smart" these digital brains are in a way we're most familiar with.

There are two testing methods. The first is the most widely recognized Mensa IQ test in the world. That is, if your IQ exceeds 130, you can join the Mensa Club composed of global elites. The second is an intelligence Q&A test set specially designed to test the performance of models.

In this challenge, the newly released GPT - 5 Pro, Google's painstakingly developed Gemini 2.5 Pro, and Grok 4, led by Elon Musk and known for its personality, jointly staged an exciting intellectual competition. Meanwhile, some former champions and unexpected "dark horses" also left their marks on this list. Their performances are also full of stories and inspiration. This is not just a game about numbers and rankings. It's a unique window for us to observe the evolution of AI's cognitive abilities and understand the similarities and differences between them and human thinking.

01

The IQ Show of the "Big Three"

In this highly anticipated AI IQ test, three "candidates" are undoubtedly the focus of the whole event. They are OpenAI's GPT - 5 Pro, Google's Gemini 2.5 Pro, and xAI's Grok 4. These three models represent the highest level of today's closed - source commercial large models. Every update and release of them affects the nerves of the entire tech community. Therefore, when they meet on the stage of the same IQ test, everyone wants to know who is the smartest "brain".

Let's first look at the Mensa group. The highest - ranked is Google's Gemini 2.5 pro, with an IQ of 137.

As mentioned before, in the human IQ evaluation system, a score above 130 is considered "extremely superior", which is what we usually call a genius. A score above 140 is regarded as the best among geniuses. Einstein's IQ is estimated to be around 160 by later generations.

This score indicates that Gemini 2.5 Pro's ability to handle complex tasks such as logical reasoning, abstract thinking, and pattern recognition can already be comparable to that of the top small group of people in human society. It is no longer just a program that imitates and repeats, but shows a certain degree of problem - solving ability close to human high - order wisdom.

Next is OpenAI's o3. Strangely enough, the performance of o3 is lower than that of o3 Pro, but o3's IQ is higher than that of o3 Pro. As the latest member of the GPT series, Chat GPT - 5 has an IQ of only 121.

The last protagonist is Grok 4 under Elon Musk. Since its release, Grok has been known for its unique style and unrestricted answering methods and is considered a very personalized AI. Naturally, its IQ performance has also attracted much attention. The test results show that Grok 4 has an IQ score of 125. Although this score is not as dazzling as the previous two contestants, it has already exceeded the average level of humans and entered the "superior" category.

Common sense tells us that the latest large models should have the highest IQ. However, Gemini 2.5 Pro is the oldest model here, followed by Grok 4, and finally Chat GPT - 5. The reason for this result is probably that their developers made trade - offs in answering such questions. Let's take a look at how they answer the questions to observe why their intelligence levels go against common sense.

Take this question as an example. The Mensa IQ test consists of several graphic reasoning questions. In the 18th test question, a 3x3 grid is given, and eight grids have been filled with patterns composed of different lines. The AI is required to find the rule and select a correct pattern from six options to fill in the ninth space. According to the rule, option C should be filled in the lower - right corner.

GPT - 5 Pro's answer systematically observed the pattern changes in each row and column of the grid and pointed out the logical progressive relationship among them. By analyzing the pattern evolution of the existing patterns, it inferred what kind of pattern was needed in the space to meet both the horizontal and vertical rules. Based on this grasp of the overall situation and the inference of the detailed evolution, it finally accurately found the correct option that could complete the entire logical puzzle.

Gemini 2.5 Pro's answer is also correct, but it found a completely different problem - solving path. It keenly identified a clear "rotational symmetry" rule and pointed out that the third row of the entire grid is actually the result of rotating the first row 90 degrees clockwise. Based on this simple and elegant rule, it easily deduced that the pattern in the space of the third column should also be the pattern in the corresponding position of the first column rotated 90 degrees, thus getting the correct answer. This shows its powerful pattern recognition ability, indicating that it can discover the internal logic of the problem from different dimensions and find equally effective but different - thinking solutions.

Grok 4's problem - solving process seems more exploratory. It first comprehensively analyzed various possibilities in rows and columns and tried to find rules from multiple dimensions such as the theme (horizontal lines, vertical lines, cross - lines) and quantity of the lines. After some analysis and elimination, it also locked the core of the problem - there is a 90 - degree rotational symmetry relationship in the entire figure. It clearly pointed out that the third row is the result of rotating the first row 90 degrees and, based on this, rotated the pattern in the third column of the first row and finally accurately deduced the correct answer C. Although its thinking path seems more tortuous, this multi - angle attempt finally led to the correct result, showing a logical reasoning ability that is not so direct but equally effective.

From this simple example, we can see that the IQ score is not just a cold number. What it reveals behind is the difference in the paths, logical rigor, and final effects adopted by different AIs when "thinking" and solving problems. GPT - 5 Pro shows powerful abstract and systematic thinking, Gemini 2.5 Pro shows efficient pattern recognition ability, and Grok 4 finally solves the problem through a more exploratory analysis path. This IQ show of the "Big Three" clearly outlines the gradient of the current top - level AI intelligence.

In the dataset group, the result changes again. This time, the ranking is in line with common sense. GPT - 5 Pro ranks first, Gemini 2.5 pro ranks second, o3 Pro ranks third, and Grok 4 ranks fourth. Compared with the Mensa test, the dataset group is more difficult, and there are a large number of test questions.

02

"Regrets" and "Little Surprises"

On this AI IQ ranking list, besides the top - notch stars, the figures and positions of other models are also thought - provoking. Their stories may better reveal some deep - seated trends and challenges in the current development of artificial intelligence. Among them, the most regrettable is Meta's Llama series.

The Llama series, especially its subsequent versions, was once a flag in the field of open - source large models. While giants like OpenAI and Google were making great progress on the path of closed - source models, Meta chose to open its powerful models to researchers and developers around the world, greatly promoting the prosperity of the entire AI ecosystem. Llama was once regarded as the hope of the open - source force and was able to compete with top - level closed - source models. However, in this IQ test ranking list, Llama 4 Maverick only scored 98 points.

The number 98 itself is not low. It is very close to the average human IQ of 100. This means that Llama 4 Maverick already has the problem - solving ability comparable to that of ordinary people. But the problem is that its competitors scored 121, 125, and even 137. In such a competition of top - level contestants, just reaching the "average level" is far from enough. The former open - source king now has a huge and obvious gap with the top - level closed - source models in a pure intellectual competition.

Meta has started to take action. Recently, a large number of reports have pointed out that Meta is sparing no expense to actively recruit top - level AI researchers and engineers from competitors such as Google and OpenAI by offering attractive salaries and resources. This "talent poaching" war is a crucial step for Meta to make up for the gap and regain its strength. The future performance of Llama will largely depend on the result of this talent competition.

However, the list is not only full of losers. There are also "little surprises" that cannot be underestimated. The test data of Deepseek R1 stopped at the end of May, which means it uses a relatively old version. But in this case, its IQ score reached 102 points.

The number 102 itself is only slightly higher than the average level, but its significance needs to be considered in context. It exceeds the popular Llama 4 Maverick. More importantly, as a model with not - so - timely data updates, the intelligence level it shows has begun to approach those top - level models that have just been released and incorporate the latest technological achievements. The existence of this "dark horse" sends a very positive signal.

DeepSeek R1's perseverance and achievements strongly illustrate a truth: in improving the "IQ" of AI, blindly pursuing the latest data and larger model sizes is not the only way. The design of the model architecture, training methods, and algorithm optimization also play a crucial role. A well - designed and efficiently trained model architecture may perform better in underlying logical reasoning and problem - solving abilities even without "absorbing" the latest knowledge.

It's like a student. Whether a student is smart or not depends not only on how many books he has read but also on whether he has mastered efficient learning methods and a clear thinking framework. DeepSeek R1's performance shows us another possibility, that is, to achieve a higher "IQ cost - performance ratio" through smarter algorithms and architectures. This is undoubtedly a great encouragement for research teams and open - source communities with relatively limited resources. It reminds the entire industry that while chasing scale and data, we should not ignore the more fundamental innovations from model design and training methods themselves.

03

Don't Take This Test Result Too Seriously

The greatest significance of this way of simulating human IQ tests is that it builds a bridge of communication. For a long time, the indicators for evaluating the performance of AI models, such as MMLU, HellaSwag, ARC, etc., although very important in academia and industry, are like a high wall for the general public. These abbreviations and the technical details behind them make it difficult for people to understand where an AI is "smart". The concept of IQ has long been deeply rooted in people's hearts.

When we can say "this AI has an IQ of 137", its intelligence level immediately becomes concrete, perceptible, and comparable. This popularized measurement greatly lowers the threshold for the public to understand the abilities of AI and allows us to discuss and think about the development of artificial intelligence in a more intuitive way. It tells us that the "smartness" of AI is no longer just the result of programmers' code performance but is truly reflected in the ability to solve those puzzles and problems that require our brains.

The fact that the IQ of large models can exceed 130 not only means that AI's ability to handle standardized test questions is getting stronger and stronger. More deeply, it marks a qualitative leap in AI's cognitive ability. They are evolving from simple information retrieval and pattern matching to being able to perform complex logical reasoning, abstract thinking, and multi - step problem - solving. They have gone very far on the road of imitating human wisdom and even begin to show abilities beyond ordinary humans in some aspects.

Trackingai.org also stated on its official website that conducting IQ tests on large models is more for entertainment because the IQ of large models cannot be completely equivalent to human IQ.