Large models “fight” in the college entrance examination again: From meeting the admission requirements of undergraduate colleges to getting into top - tier universities (985 institutions)
In the past year, the realm of large models has nearly become synonymous with "rapid advancement." Technology has been iterating on a weekly basis, and the boundaries of their capabilities have expanded from poetry writing and painting creation to video generation and scientific discovery.
However, beyond the grand narratives, how can we find a precise and objective scale for AI's capabilities?
Perhaps no other means can touch the hearts of every Chinese person more directly than the "National College Entrance Examination" (Gaokao).
Last year, GeekPark conducted a mock Gaokao assessment for AI. Continuing last year's tradition, GeekPark has once again set up an "AI Gaokao" examination room this year, allowing mainstream large models at home and abroad to enter the exam room again.
The "AI examinees" who re - entered the exam room not only overcame their weakness in liberal arts last year but also achieved high scores that would place them among the top 1,000 in Shandong Province.
However, just when we thought it had "evolved," it often exposed its true "intelligence" in unexpected places.
Some key findings are as follows:
AI is expected to break into top universities for the first time: This year, the comprehensive capabilities of AI have for the first time shown the potential to get into top universities. Compared with 2024, all the large models participating in the test have achieved significant leaps in both liberal arts and science scores. Since Shandong Province adopts a score - conversion enrollment strategy and it's impossible to directly compare with the score ranges, we estimate that Doubao, the top scorer in this Gaokao, could rank between 500 - 900 in the province and get into the humanities and social science majors of prestigious universities such as Renmin University, Fudan University, Shanghai Jiao Tong University, and Zhejiang University.
Large models are no longer severely imbalanced, and science has made faster progress: The average total score of liberal arts for each model has increased by 115.6 points, while that of science has increased by 147.4 points on average. Although the growth rate of science is more rapid, its average total score of 181.75 points is still lower than the 228.33 points of liberal arts. Overall, the total score performance of large models this year is no longer severely "imbalanced."
Mathematical ability has been greatly enhanced, surpassing Chinese and English: Mathematics is the subject with the most significant improvement this year, with the average score increasing by 84.25 points compared to last year. AI's performance in mathematics even exceeds that in Chinese and English, which indicates that in the future, AI may be better at handling questions with strong logic and standardized problem - solving paths.
Multimodal ability is the key to differentiating performance: From last year to this year, the visual understanding ability of the models has been significantly improved, especially in subjects with a large number of image - based questions. Compared with last year, the average scores of physics and geography have increased by about 20 points, and that of biology has increased by 15 points. The overall performance of chemistry is slightly weaker. Only the Doubao model passed, but the average score of all models has also increased by 12.6 points compared to last year. As an Easter egg, we also tried to let AI answer questions in a video stream this year.
01 From getting into a key university to aiming for top universities
If last year's AI was just an excellent student who barely reached the admission line of a key university, then this year, they have grown into top students capable of aiming for China's top universities.
What kind of transformation has taken place behind this?
Before delving into the specific changes, let's first introduce the domestic and foreign "examinees" participating in this exam:
Doubao, DeepSeek (R1 - 0528 version), ChatGPT (o3), Yuanbao (Hunyuan t1), Kimi (k1.5), Wenxin Yiyan, Tongyi Qianwen.
To better fit the users' experience, this evaluation was conducted on the public PC versions of each model, and the evaluation was carried out by sampling twice and taking the average score.
The purpose is to examine the comprehensive capabilities of the models. The evaluation method this time is to directly let the models recognize images and answer questions. DeepSeek - R1 still does not support answering questions based on image recognition, so only pure text questions were tested, and the final results have limited reference value.
Other test details are as follows:
The 2025 new Gaokao Shandong paper was selected as the test paper for this evaluation. There are two reasons: First, the Shandong paper is one of the Gaokao papers that can be obtained most quickly online, ensuring the timeliness of the evaluation. Second, its comprehensive difficulty ranks among the top in all provinces - its Chinese, mathematics, and English use the national paper I, while the other subjects are self - proposition. Such a high - difficulty "ruler" can better explore the upper limit of the current large models' capabilities.
To ensure fairness and examine the general basic capabilities of the models, in the products where the network connection function of the models can be turned off, the network connection function of the models was uniformly turned off to eliminate the possibility of "searching for answers." o3 and Wenxin cannot turn off the network connection. However, by checking the models' thinking processes, it was found that Wenxin did not search for answers online, while o3 searched for answers a small number of times, but there was no obvious gain, and the scoring rate was even lower than when answering questions without network connection. At the same time, we defaulted to turn on the in - depth thinking mode but did not turn on the research mode to simulate the instant Q&A scenario of users under standard interaction.
For non - multiple - choice questions, two professional students were invited to grade each subject. If there was a difference of more than 1/6 of the question's score, a third person was introduced for discussion and scoring (consistent with the real Gaokao grading process), and high - school teachers who had participated in real Gaokao grading were invited for random checks to standardize the scoring for questions with differences.
In the scoring process, we made two special arrangements: We specially invited senior teachers to anonymously review the AI's compositions to ensure objectivity and fairness. In addition, since the test questions for the English listening part could not be obtained, we set all models to get full marks in this part.
Finally, the scores of each examinee are as follows:
In the past year, the in - depth thinking ability of large models has brought about obvious improvements in the models' capabilities.
The models no longer directly produce answers but gradually analyze and decompose problems, check intermediate results, and even self - correct, which has led to a significant improvement in the models' performance in mathematical and physical exams.
In the mathematics exam with a full score of 150 points, even the worst - performing AI model in this test scored as high as 128.75 points - which is also an excellent level among human examinees.
Looking back at last year, the best - performing model only scored 70 points, not even reaching the passing line.
The improvement in mathematical ability has directly driven a significant increase in the overall Gaokao scores of large models this year.
Multimodal ability has become another key factor determining the performance differences of large models.
In last year's Gaokao test, many models did not have mature image recognition capabilities. The evaluation method adopted by GeekPark at that time was: models capable of image recognition used a combination of images and text input, while models unable to recognize images only input text, supplemented by Markdown/LaTeX formats to help recognize formulas.
This year, multimodal ability has become a standard feature of mainstream models. Therefore, for the first time in the test, we used pure image - based questions (except for DeepSeek).
Among multiple models, the most advanced versions of Doubao and ChatGPT are multimodal versions, showing obvious advantages in image - related questions.
Qwen3 and Wenxin X1 are both language models. When dealing with image - related questions, they may answer after using OCR to recognize text or call a visual model, and their performance in image - related questions is relatively weak.
However, even for Doubao and ChatGPT, which scored the highest in image - related questions, the scoring rate for image - related questions is only 70%, which is a significant gap compared to the highest scoring rate of 90% for text - related questions. It can be seen that there is still much room for improvement in large models' multimodal understanding and reasoning.
It can be predicted that with the continuous improvement of multimodal ability, AI's Gaokao scores will continue to increase next year. Failing to outperform AI will eventually become the norm for most humans.
However, AI still did not achieve a perfect score. What held back the top - level AI? The answer may be more interesting than expected.
02 AI geniuses approaching full marks in mathematics all failed on a basic question
In the entire AI Gaokao evaluation, after "repeating a year," the "AI examinees" made remarkable progress in the mathematics subject.
In the 2024 evaluation, the then AI examinees performed poorly in fill - in - the - blank questions and solution questions, with scores generally hovering between 0 and 2 points. Finally, the average mathematics score of the 9 participating models was only 47 points.
This year, it is completely different.
It can be seen that whether it is objective multiple - choice questions or complex subjective solution questions, the accuracy rate of the new - generation large models is far better than before. This clearly shows that the capabilities of large models, especially their core reasoning ability, have achieved fundamental breakthroughs.
If last year's models were just "beginners" who could barely apply basic formulas such as differentiation and trigonometric functions, then this year's models have evolved into "problem - solving experts" who can calmly handle complex derivations and proofs.
To some extent, such a result is expected. Since AI entered the era of reasoning models, a landmark progress has been the significant improvement in mathematical and physical abilities.
When the model has the ability of self - thinking and self - correction, it is like a child who used to answer questions without thinking growing into an adult who thinks deeply before giving an answer, and its logical ability has achieved a qualitative leap.
It should be noted that the mathematics questions in the 2025 new national standard paper I of the Gaokao were generally considered extremely difficult by examinees, "like a competition paper." The final questions such as derivatives and conic sections had obscure ideas and a large amount of calculation, and there were even cases where top students were brought to tears.
However, in the face of such a high - difficulty paper, the top - level large models still performed with ease.
In comparison, the progress of AI's multimodal ability is secondary. In the mathematics subject, there are only 20 points for image - related questions, which is not the key to the significant score increase of the models this time. And most models also scored 15 points in image - related questions.
Why 15 points?
This is very interesting. These large models, which all scored above 130 points as a whole and would be considered top students in mathematics in human society, all made mistakes on the same multiple - choice question.
What stumped them was not a final - question but a single - choice question - and not even a very difficult one.
The mathematical principle of this question is very simple. It is a basic vector addition and subtraction question. Just connect the points (0, 2) and (2, 0) on the graph, and the target vector can be obtained, with a modulus of 2 times the square root of 2.
Even for people with little knowledge of mathematics, by visually observing the line segment in the graph, they can estimate that its length will not exceed 3.3.
However, such a question stumped all the top - level AI in mathematics.
The core contradiction lies in: The question is not difficult, but the graph is.
For large models, the visual information in this graph is extremely chaotic: dotted lines, solid lines, coordinate axes, numbers, and text are intertwined, and there are even multiple overlaps between text and key line segments. This kind of "dirty data" in vision has become a nightmare for AI's accurate recognition.
Taking Doubao, which performed the best in mathematics this time, as an example, its problem - solving process exposed the root of the problem: it made a mistake from the very beginning when reading the question information.
If the question is misread from the start, no matter how powerful the underlying mathematical reasoning ability is, it will ultimately be like water without a source and a tree without roots.
03 AI writing essays: Good at giving examples but poor at speculative sublimation
As so - called large language models, Chinese and English have always been AI's traditional strengths.
Interestingly, after the significant improvement in large models' mathematical and logical abilities, their Chinese and English abilities seem a bit inadequate.
This is also consistent with the real world: A top - level examinee may get a full score in mathematics but it is extremely difficult to get the same score in Chinese. AI also seems to have hit the same bottleneck.
By carefully studying the Chinese test paper, it can be found that the points where AI lost marks are quite interesting. In the multiple - choice question part, except for Doubao and DeepSeek - R1, the error rate of the other models is above 20%.
This phenomenon may reveal a different dilemma for AI compared to humans: For human examinees, when organizing language and presenting views, they may lose points more easily due to omissions; but for AI, reading a long passage of material and accurately distinguishing every subtle semantic