2000s Young Academic Star Zhou Lexin Publishes Another Paper in Nature with AI Capability Measurement Standard after Two Papers in Two Years

Where is the excellence of this paper?

Nature has recently published a research paper titled "General scales unlock AI evaluation with explanatory and predictive power". The research team consists of 26 scholars and engineers from institutions such as Princeton University, the University of Cambridge, Microsoft Research, OpenAI, DeepSeek, Meta, and the Polytechnic University of Valencia.

The first author and corresponding author, Leixin Zhou, listed four institutions in his identity information: Princeton University, the University of Cambridge, Microsoft Research Asia, and the Polytechnic University of Valencia. The corresponding authors also include Xing Xie from Microsoft Research Asia and José Hernández - Orallo from the University of Cambridge.

This is one of the largest - scale and most systematic studies on AI evaluation methodology in recent years.

Two papers in Nature in two years, and the first author is a post - 2000s

This is already the second time that Leixin Zhou has published a paper in Nature within two years.

In September 2024, at the age of 23, Leixin Zhou published his first paper in Nature as the first author: "Larger and more instructable language models become less reliable".

This paper put forward a conclusion that shocked the entire AI circle at that time: Larger and newer AI models are actually less reliable. At that time, he and his team analyzed several mainstream AI model series such as GPT, LLaMA, and BLOOM. They found that as the model size increased and more human feedback training was added, these models were more likely to give wrong answers when answering users' questions. Strangely enough, new - generation models (such as GPT - 4) no longer "know they don't know" and avoid answering when faced with questions beyond their capabilities like old models. Instead, they are more likely to give wrong answers boldly. Researchers call this phenomenon "over - confidence".

As soon as this paper was published, it immediately sparked heated discussions. More than 200,000 netizens on Reddit alone watched and discussed it.

What does the newly published paper say?

Less than a year after the previous paper was published, Leixin Zhou is back with his second Nature paper. This time, he not only pointed out the problems but also proposed a complete solution.

The title of this paper is "General scales unlock the explanatory and predictive power of AI evaluation". At the beginning of the paper, it pointed out a fundamental problem: The existing AI evaluation method, which is to let AI solve problems and then score, actually can't explain what abilities an AI really has.

For example, if you see an AI getting 90 points in a math test, what can this number tell you? Nothing.

You can't infer whether it can solve another math problem based on this, let alone predict whether it can handle other tasks such as reading comprehension, code writing, and image analysis. The reason is simple: the score is just a score. It is a product of multiple factors such as ability, test difficulty, and question type, and it can't be disassembled at all.

This is why many people say that "AI evaluation is a black box": you don't know why an AI is right or why it is wrong.

The solution of Leixin Zhou's team is to label each question and each AI and establish a unified "measurement standard".

Specifically, they designed a "general scale" with 18 dimensions. These 18 "rulers" can be roughly divided into three categories:

Elemental ability scale (11): It includes basic abilities such as attention scanning, content expression, concept learning and abstraction, logical reasoning, metacognition (knowing whether one can do something), and thinking modeling.

Knowledge scale (5): It covers knowledge in fields such as common sense, natural science, applied science, formal science, and social science.

Difficulty auxiliary scale (2): Whether the question is "non - mainstream" (the more non - mainstream, the more difficult) and the length of the question.

For example, using their method, a math question will be labeled with information such as how high the logical reasoning ability is required, what field of knowledge is needed, whether the question is "non - mainstream", and how long the question is. Then, the AI model will also be labeled with the same dimensions to form an "ability portrait" - for example, a certain model has a logical reasoning level of 4.5 and a knowledge level of 3.8. By comparing the two, we can predict whether the AI can solve this question.

The core idea of this method is not only to score the AI's abilities but also to label the difficulty of each test question and then compare the two under the same set of standards.

The researchers conducted large - scale experiments with 15 mainstream AI models and 20 benchmark tests (covering multiple fields such as math, reading comprehension, science, and language), analyzing a total of more than 16,000 questions and nearly 300,000 labeled data. The results are exciting:

Prediction within the distribution (test questions come from the same source as training questions): The scale - based predictor achieved an AUROC of 0.84 (an indicator of the ability to distinguish success from failure) and a calibration error of only 0.01. This means that when predicting the probability of an AI answering a question correctly, the judgment is not only accurate but also the probability estimation is very reliable.

Prediction outside the task distribution (predicting the performance of an AI on a brand - new task): The accuracy only slightly decreased to 0.81, still far better than other methods.

Prediction outside the benchmark distribution (predicting the performance of an AI on a brand - new benchmark it has never seen): The accuracy remained at 0.75.

In contrast, prediction methods based on text embedding (such as GloVe) or direct fine - tuning of language models performed significantly worse on these tasks, especially in out - of - distribution prediction. This shows that the new method has stronger generalization ability and is not easy to "memorize" the patterns in the training data.

The process for explaining and predicting the performance of new AI systems and benchmark tests: The upper part is the system process: Run a new AI system in the ADeLe suite, draw the dimensional feature curve, extract the ability portrait, and optionally train a simple evaluator; the lower part is the task process: Apply the DeLeAn rules to a new task using a standard large - scale model, generate the demand histogram and portrait, and predict the system's performance on the new task with the help of the evaluator.

What else was discovered?

In addition to proposing the evaluation method, the paper also revealed some unexpected conclusions.

First, many benchmark tests are "cheating". The researchers analyzed 20 mainstream AI benchmark tests and found that most of them did not actually measure what they claimed to measure. For example, a math test claimed to test "mathematical reasoning ability", but in fact, the requirement for reasoning ability was not high, while the requirement for specific field knowledge was very high. In other words, these tests may only be testing whether an AI can solve a certain question, rather than whether it has real abilities. More seriously, many tests have the problem of "contamination" - an AI may have seen similar questions during training, resulting in inflated scores.

Second, a larger model does not necessarily mean better. The researchers discovered the "diminishing marginal returns" effect in large - model scaling. Compared with the conclusion in the 2024 paper that "the larger the model, the worse it is", Leixin Zhou revised his statement: the larger the model, the smaller the benefits, and the training method may be more critical than the scale. When the number of model parameters is already very large (e.g., more than 7 billion parameters), the improvement in ability brought by further increasing the scale becomes smaller and smaller. More importantly, some models using the "chain - of - thought" technology (such as OpenAI o1 and DeepSeek - R1, which show the thinking process before giving an answer) have far greater improvements in logical reasoning than simply increasing the parameters.

Why is this paper important?

This paper addresses a problem that everyone knows but no one can solve: how can we really "see" an AI's abilities? This problem is directly related to whether AI can enter real - world application scenarios safely and reliably.

The current industry practice is to conduct a benchmark test (such as a math question bank), let the AI take the test, get a score, and then announce that "our company has won again". However, this evaluation method has three fatal problems:

First, it can't explain why an AI loses. The score can't tell you what abilities an AI lacks.

Second, different tests can't be compared. Is a score of 90 in math the same as a score of 90 in reading comprehension?

Third, it can't predict the performance on new tasks. You know an AI can solve math problems, but do you know if it can write code?

The method proposed by Leixin Zhou's team is like a "ruler" for AI abilities, which substantially solves the above three problems. The researchers even used it to discover the "diminishing marginal returns" effect in large - model scaling.

This method can not only be used to evaluate AI more scientifically but also play a role in actual deployment: enterprises can judge in advance whether an AI is suitable for a certain task, and security departments can predict where an AI may "fail".

What makes this paper remarkable?

This is not a casual "AI leaderboard - chasing" study.

First, it solves a real and significant problem. The dilemma of AI evaluation is not just theoretical. The credibility and interpretability of AI are issues of concern across the industry. Governments, enterprises, and regulatory agencies around the world are asking: How can we know whether an AI system can be trusted? This paper provides a possible answer framework.

Second, it provides actionable tools. The paper not only has concepts but also tangible resources: a detailed scoring standard (DeLeAn) with 18 dimensions, a database of 16,000 labeled questions (ADeLe), open - source code, and a platform. These resources are now open - source, and other teams can use them directly after reading the paper. The code and data open - source platform is here: https://github.com/Kinds - of - Intelligence - CFI/ADELE

Meanwhile, its empirical results are very convincing. The consistency between human and AI labeling is 0.86, and the prediction model far outperforms the baseline on new test sets. However, the paper is not without limitations. Are the 18 dimensions complete? Will GPT - 4o as a "scorer" have systematic biases? How to expand when future AI exceeds the upper limit (5+) of the current scale? The authors also frankly discussed these issues in the paper and provided an open platform for the community to iterate together.

Leixin Zhou, the first author. Image source: Leixin Zhou's personal webpage

Leixin Zhou, the first author and corresponding author, is currently a doctoral student in the Department of Computer Science at Princeton University. He is supervised by Professor Peter Henderson and closely collaborates with Professor Tom Griffiths, an expert in cognitive science. His research interests span computer science and cognitive science. He has interned at several top - tier institutions, including Microsoft Research Asia, OpenAI, Meta AI, and the European Commission. These experiences have enabled him to understand both the academic frontiers and the actual needs of the industry and policymakers.

In the era of rapid iteration of AI development, this is the first time someone has systematically, on a large scale, and reproducibly transformed AI evaluation from "competitive sports" to "standard measurement". In the past, when we looked at the leaderboard, it was like looking at Olympic results - it only told you who was faster, not why. Now, we finally have a "physical health standard" form.

For users, this means that in the future, when you see an evaluation report of an AI product, it may no longer be "an overall score of 92.3", but a clear portrait:

"This model has a logical reasoning ability equivalent to a demand level of 4.1 and is suitable for analyzing legal documents of medium complexity; its open - domain knowledge ability level is 3.8, and it is not recommended for high - precision medical diagnosis."

Isn't this the first step towards the "trustworthy AI" that we've always wanted?

Paper information

Article title: General scales unlock AI evaluation with explanatory and predictive power

Published journal: Nature

Publication date: April 1, 2026

This article is from the WeChat official account "Guokr Hard Technology". Editor: Ou Wu. Republished by 36Kr with permission.