AI writes the college entrance examination essay, Hunyuan gave DeepSeek-V4 a full score
Here's what happened. These days happen to coincide with the college entrance examination in 2026. Moreover, Anthropic released its Mythos - level large - scale model yesterday. So I wondered if I could ask several current high - profile large - scale models to try writing this year's college entrance examination composition.
I selected two large - scale models from overseas and two from domestic ones, namely GPT - 5.5, Fable - 5, DeepSeek - V4, and Hunyuan 3 Preview.
The topic is this year's college entrance examination composition topic in Beijing:
Choose one of the following two topics and answer according to the requirements. The composition should be no less than 700 words.
(1) Knowledge is infinite, and there are methods for reading. Cheng Duanli, a scholar in the Yuan Dynasty, compiled the "Reading Schedule by Years", which in detail specified the reading order and intensive reading methods of core classics in different stages, accompanying scholars from childhood to youth. Whether it is personal reading and growth or the development of the country and society, it is necessary to make good plans and proceed step by step; it also requires putting in efforts and doing solid work.
Please write an argumentative essay with the title "Making Plans and Putting in Efforts".
Requirements: The argument should be clear, the arguments should be substantial, and the reasoning should be reasonable; the language should be fluent, and the writing should be clear.
(2) "Savoring the essence" means holding a flower and chewing it carefully to taste its fragrance, which is a metaphor for carefully pondering and comprehending the essence of poems and essays. This process of repeated savoring and earnest understanding is very important in many aspects, such as reading classics, appreciating art, and perceiving life. The process of savoring the essence is often an unforgettable experience...
Please write a narrative essay with the title "Savoring the Essence".
Requirements: The ideas should be healthy; the content should be substantial and reasonable, with detailed descriptions; the language should be fluent, and the writing should be clear.
However, I thought that if I were the judge, it would be too subjective. So I created a loop. After these four models completed their answers, I let them play the role of examiners in turn to blindly score all the papers.
The scoring criteria are as follows:
First - class essays: 42 - 50 points. The theme is accurately and profoundly presented, the content is substantial, the structure is well - developed, and the language is appealing.
Second - class essays: 34 - 41 points. They meet the requirements of the topic, are clearly expressed, and the content is relatively complete, but they are slightly lacking in depth or language.
Third - class essays: 25 - 33 points. They basically meet the requirements of the topic, but the content is vague, the structure is ordinary, or the expression is mediocre.
Fourth - class essays: 16 - 24 points. They deviate significantly from the topic, the content is weak, the logic is chaotic, or there are many language problems.
Fifth - class essays: 0 - 15 points. They seriously deviate from the topic, are incomplete, are obviously plagiarized, or are basically unreadable.
Moreover, each score should be accompanied by a brief review, including the advantages and disadvantages of the essay.
The examiners cannot see the names of the students, only anonymous essays.
The standard for exiting the loop is that the self - inspection of scoring strictness is qualified.
The prompt for the self - inspection part is "Please state whether you find that you may be affected by factors such as writing style, familiarity, or speculation about the author. If so, please re - calibrate the score."
After each examiner gives an evaluation, they also need to conduct a self - inspection of their evaluation. That is to say, only when the self - inspection is qualified can the final answer be output.
This is an exam of AI against AI and also a review of AI by AI.
Both GPT - 5.5 and Fable - 5 chose to write argumentative essays.
Their papers are highly similar: they both cited "If you plan ahead, you will succeed; if you don't, you will fail" at the beginning, argued that "plans determine the direction, and efforts determine the distance", gave examples of Wang Xizhi, Yuan Longping, and the reform and opening - up, and ended by elevating the theme to "youth in the new era" and "the other shore of ideals".
The structure is complete, the logic is clear, and the language is fluent. However, they also have a common problem: the materials are too common, and the expressions are too formulaic.
DeepSeek - V4 chose to write a narrative essay. It wrote about the "Book of Songs" in the grandfather's study, the afternoon when the phoenix tree leaves were falling, the epiphany of "The peach tree is young and lovely, with blossoms bright and gay" in the setting sun, and the evening when it opened the "Book of Songs" because of a misunderstanding in friendship. The narrative has plots, details, and a sense of growth.
Hunyuan 3 Preview also chose to write an argumentative essay. Compared with the previous two argumentative essay candidates, the materials in its paper are slightly different - it added examples of Huawei chips and Qian Xuesen, but the overall framework is still the syllogism of "plans are important + efforts are important = success".
As mentioned before, each examiner cannot see who the author is, only "Essay 1", "Essay 2", "Essay 3", and "Essay 4".
Finally, the report cards of the four students are as follows:
The average score of GPT - 5.5's argumentative essay given by the four examiners is 43.25 points.
The average score of Fable - 5's argumentative essay is 44 points.
The average score of DeepSeek - V4's narrative essay is 46 points.
The average score of Hunyuan 3 Preview's argumentative essay is 43.25 points.
The narrative essay is slightly better than the argumentative essays, but the difference is not significant. The average scores of the three argumentative essays are almost the same because their evaluations are also almost the same: the topic is accurately understood, the structure is complete, the logic is clear, but the materials are common, the expressions are formulaic, and the depth of thought is insufficient.
What's more interesting is the dispersion of the scores.
For the same essay, the scores given by different examiners can differ by 8 points. This shows that even for AI, when facing the highly subjective task of scoring compositions, the standards can vary.
Some examiners value the depth of thought more, some value the language expression more, some have a higher tolerance for clichés, and some have stricter requirements for details.
The self - inspection mechanism is precisely designed to make each examiner aware of their own preferences and try to return to objective standards.
Hunyuan 3 Preview is the kindest.
The average score it gave to the four essays is 48 points, which is higher than that of the other three examiners.
It gave 48 points to GPT - 5.5's argumentative essay and a full score of 50 points to DeepSeek - V4's narrative essay. Its comments are also extremely gentle: "The topic is fully grasped, the structure is clear and progressive... The arguments are appropriate, the reasoning is coherent, and the language is fluent and expressive."
In contrast, Claude Fable - 5 is the strictest examiner. The average score it gave to the four essays is only 42.25 points, nearly 6 points lower than that of Hunyuan 3 Preview. It has the lowest tolerance for clichés and repeatedly wrote in the comments that "the language contains many clichés" and "the content lacks personalized thinking".
What's more interesting is that GPT - 5.5 gave its own essay 41 points, which is in the upper range of the second - class essays. Its comments are merciless: "The arguments are relatively common, the discussion mostly stays at the level of positive interpretation and familiar examples, the distinctiveness of thought is not strong enough, and some sentences are slightly clichéd."
During the self - inspection, it wrote: "I did not make judgments based on the author's identity, writing tool, or 'whether it is like AI'... I should not give excessive points because of the neat language, nor should I deliberately lower the score because of the relatively conventional expression. 41 points is more appropriate."
It spares no mercy in self - criticism.
Among the four essays, the most special one is DeepSeek - V4's narrative essay.
It wrote about the "Book of Songs" in the grandfather's study, and the language is very beautiful: "The dark - yellow pages are like autumn leaves, emitting the mellow fragrance after the fermentation of time." "Those sentences are like fireflies on a summer night, flickering on and off."
This intensive use of metaphors made the examiner DeepSeek - V4 can't help complaining when evaluating its own essay: "Some of the language is a bit deliberate... Although the metaphors are beautiful, they seem a bit artificial when arranged intensively."
However, Hunyuan 3 Preview believes that "the details are rich, the whole essay echoes the theme with the images of 'flowers' and 'fragrance' throughout, and the emotions are sincere... There are no obvious flaws."
The three argumentative essays expose another problem: they are all too similar.
The argumentative essays of GPT - 5.5, Fable - 5, and Hunyuan 3 Preview all cited "If you plan ahead, you will succeed; if you don't, you will fail" at the beginning, all gave an example of Wang Xizhi, all used clichés like "the other shore of ideals" and "steady progress", and even have the same structure: the importance of plans, the importance of efforts, and the unity of the two.
Examiner Claude Fable - 5 repeatedly mentioned this problem in the comments: "The examples are mostly well - known celebrity stories", "The discussion stays at a conventional level", and "The language contains many clichés."
However, Hunyuan 3 Preview still adheres to the principle of kindness and gave high scores of 47 - 48 points to these "formulaic essays".
The final statistical data is even more interesting: the average score of DeepSeek - V4's narrative essay is 46 points, which is the highest among the four students. The average scores of the three argumentative essays are almost the same, all between 43 - 44 points.
Generally speaking, narrative essays are more likely to stand out, while argumentative essays are prone to falling into formulas.
Especially when AI writes argumentative essays, they will all choose the "safest" writing style. They can accurately understand the topic, have a complete structure, and clear logic, but they also lack "personality" the most.
Score Summary Table
Explanation of Scoring Criteria
The college entrance examination composition is scored using a five - dimension scoring system:
• Understanding the topic and establishing the theme (15 points): Whether the topic is accurately understood and the theme is clearly and profoundly presented.
• Content and thought (15 points): Whether the materials are substantial and the thought has depth and individuality.
• Structure and logic (8 points): Whether the structure is complete and the argumentation/narration is rigorous.
• Language expression (8 points): Whether the language is fluent and has expressiveness and appeal.
• Standardization and completion (4 points): Whether the format is standard and the word count meets the requirement.
Grade division:
• First - class essays: 42 - 50 points
• Second - class essays: 33 - 41 points
• Third - class essays: 25 - 32 points
• Fourth - class essays: below 24 points
Essays
GPT - 5.5
Making Plans and Putting in Efforts
The ancients said, "If you plan ahead, you will succeed; if you don't, you will fail." Life is like a long journey. Without a direction, even the most urgent steps may deviate from the destination; if you only have a map but are reluctant to take a step, you will never reach the distance. Therefore, whether it is personal study and growth or the development of the country and society, two things are indispensable: first, make good plans and clarify the path; second, put in sufficient efforts and implement them steadily. Plans determine where we go, and efforts determine how far we can go.
Making plans is to make our efforts more targeted. Cheng Duanli, a scholar in the Yuan Dynasty, compiled the "Reading Schedule by Years", which divided the study of scholars into different stages and arranged the order of classic reading and the methods of intensive reading. This arrangement seems strict, but it actually reflects the law of learning: laying a foundation in childhood and delving into the principles in youth. Proceeding step by step can lead to a solid accumulation and a sudden outburst. It is the same for doing things as for reading. If a student wants to improve their grades, they cannot just shout "I will work hard" but should be clear about their weak subjects, daily study tasks, and weekly review rhythm. When the goal is clear, time will not be wasted; when the steps are reasonable, efforts will not become blind consumption.
However, no matter how good the plan is, if it is not put into practice, it is just a piece of paper. What really makes a change is not the plan written on paper but the daily action of adhering to completing the plan. Wang Xizhi practiced calligraphy by the pond, and the pond water turned black. That's why he became known as the "Sage of Calligraphy". Yuan Longping walked in the fields for a long time and conducted repeated experiments, which enabled hybrid rice to benefit the world. Their achievements were not obtained by chance. After clarifying their goals, they put in efforts in the most simple and