The accuracy rate of GPT-4o is only 24%. An authoritative Chinese education benchmark: A dual test of knowledge and emotional intelligence
The School of Intelligent Education at East China Normal University has released OmniEduBench, which for the first time evaluates the educational capabilities of large models from the dual dimensions of "knowledge + cultivation". After testing 24,000 Chinese-language questions, the experimental results show that top AIs such as GPT-4o can solve problems, but they are far inferior to humans in cultivation capabilities such as inspiring thinking and providing emotional support, exposing a key shortcoming of AI as a teacher.
In recent years, large models have made astonishing progress in knowledge answering, mathematical reasoning, and other aspects.
However, when these technologies are introduced into complex educational environments, a key question arises: Are our existing evaluation methods sufficient? How can we comprehensively evaluate their capabilities? Is a good "AI teacher" just a "problem-solving expert"?
Current evaluation benchmarks, especially in the Chinese-language field, have two major limitations:
Single dimension: The vast majority of benchmarks (such as C-Eval, MMLU, etc.) mainly focus on the model's knowledge reserve and understanding ability, that is, the "knowledge dimension". In addition, most of the benchmark question types are simple and it is difficult to cover all question types in real exam scenarios.
Ignoring capabilities: To a large extent, they ignore the indispensable "cultivation dimension" (Cultivation Capabilities) in educational scenarios, such as heuristic teaching, emotional support, moral and value cultivation, and guiding critical thinking.
Recently, researchers from East China Normal University have launched OmniEduBench, a brand-new benchmark specifically designed to evaluate the "comprehensive educational qualities" of Chinese large models, which contains 24,602 high-quality question-and-answer pairs.
The research points out that most existing benchmarks focus on the knowledge dimension and seriously ignore the crucial "cultivation capabilities" in real educational scenarios.
Project homepage: https://mind-lab-ecnu.github.io/OmniEduBench/ Paper link: https://arxiv.org/pdf/2510.26422 Code repository: https://github.com/remiMZ/OmniEduBench-code/tree/main
The first author of the paper is Zhang Min, an associate researcher at the School of Intelligent Education of East China Normal University, whose main research direction is multimodal large models and AI-enabled education. The research team found that even top closed-source models such as Gemini performed poorly in specific evaluation dimensions of OmniEduBench, indicating that current large models still have a significant gap in truly "understanding education".
OmniEduBench, covering all educational stages and all disciplines
The core innovation of OmniEduBench lies in its unique dual-dimensional evaluation system.
Dimension one: Knowledge Dimension
This part contains 18,121 items, aiming to comprehensively examine the model's mastery of subject knowledge.
Covering all educational stages: It covers five difficulty levels from primary school, middle school, high school, university to professional exams.
Covering all disciplines: It includes 41 different disciplines, from humanities and history (such as the history of ancient Chinese literature) and science and engineering (such as advanced mathematics and plant physiology) to professional fields (such as law and comprehensive medicine).
Diverse question types: It includes 11 common exam question types, such as single-choice, multiple-choice, fill-in-the-blank, short-answer, term explanation, case analysis, and essay questions.
Dimension two: Cultivation Dimension
This part is the essence of OmniEduBench, containing 6,481 items, focusing on evaluating the model's "soft power" in real teaching interactions.
Focusing on core competencies, it revolves around 6 sub - fields and 20 specific teaching topics, such as:
Thinking & Cognitive Skills: Critical thinking and problem-solving abilities.
Personalized Development: Heuristic teaching and interest-driven learning.
Emotional & Mental Health: Empathy and growth mindset.
Character & Values: Sense of responsibility and integrity.
For example, in the "cultivation dimension", the model needs to face a situational question like this: "Some students were laughing and playing around during a visit to a martyrs' cemetery. I'm very angry. How should I handle it?"
What is examined is not only knowledge but also the model's emotional intelligence, values, and educational wisdom.
Leak-proof and highly challenging
To ensure the quality and challenge of the benchmark, the construction process of OmniEduBench is extremely strict, going through four stages:
Multi-source collection (927K): It aggregates public data (21K), private data such as internal test papers (106K), and uses LLMs to generate scenario-based question-and-answer pairs (800K) to ensure the diversity and uniqueness of data sources.
Structured cleaning (657K): It unifies the format, extracts metadata such as subject, grade, and question type, and conducts standardized cleaning processes such as deduplication, removing sensitive content, and eliminating dependence on external information.
Double-model difficult question screening (50K): To prevent the model from "memorizing questions", two powerful models are used for "adversarial" screening. First, QWQ-32B is used to filter out simple questions it can answer correctly, and then the stronger Qwen3-235B is used for a second screening, only retaining high-difficulty samples.
Expert finalization (24.6K): Finally, 50 master's students and 5 senior experts conduct the final manual review and quality verification. The final sampling quality inspection shows that the overall quality is 4.8/5, the answer accuracy is 4.8/5, and the annotator consistency is as high as 0.90.
Experimental results: Even the strongest closed-source models struggle
The research team conducted a comprehensive test on 11 mainstream closed-source and open-source LLMs (including GPT-4o, Gemini-2.5 Pro, Claude-4 Sonnet, Qwen series, DeepSeek-V3.1, etc.) on OmniEduBench, and the results are thought-provoking:
Finding one: GPT-4o performs poorly in the knowledge dimension In the knowledge dimension, only Gemini-2.5 Pro has an average accuracy rate of over 60% (62.76%). Surprisingly, even a powerful model like GPT-4o performs poorly in this test, with an accuracy rate of only 24.17%, far lower than many top open-source models (such as QwQ-32B with 53.87%). This may indicate that the GPT series has obvious "adaptability issues" when dealing with diverse and localized Chinese educational exam-style questions.
Finding two: "Cultivation" ability is a collective shortcoming, with a huge gap from the human level In the more critical cultivation dimension, all models expose their shortcomings. Although the task form is relatively simple (mostly multiple-choice questions), even the best-performing model (QwQ-32B, with an accuracy rate of 70.27%) still has a huge gap of nearly 30% compared with human performance in this field. This shows that current LLMs generally lack advanced educational capabilities such as empathy and heuristic guidance.
Finding three: The high-difficulty subset (OmniEduBench HARD) exposes the limitations of top models The research team also constructed a high-difficulty subset, OmniEduBench HARD. On this subset, the performance of all LLMs dropped "precipitously". Even the strongest Gemini-2.5 Pro has an accuracy rate of less than 50%, fully demonstrating the challenge and discrimination of this benchmark.
Why is OmniEduBench important?
Testing real "usability": Educational AI should not just be a "problem-solving tool". OmniEduBench for the first time systematizes and quantifies the interaction ability in educational scenarios, prompting the industry to focus on the value of models in real interaction scenarios such as inspiration and feedback.
Based on local "adaptability": The language culture and teaching practice of Chinese education have their own uniqueness. OmniEduBench is a native Chinese educational benchmark, which is more "down-to-earth" from data to task definition and can more accurately evaluate the performance of models in local environments.
Conclusion and outlook
The release of OmniEduBench provides a much-needed and more comprehensive perspective for the evaluation of Chinese large models in the field of education.
It clearly reveals the shortcomings of current LLMs: Although the models have made great progress in knowledge acquisition, there is still a long way to go in achieving the core goal of education - "cultivation".
The research team said that future work will explore more complex question types in the cultivation dimension and introduce multimodal educational scenarios to continuously promote the development of the comprehensive capabilities of LLMs and MLLMs in the field of education.
Reference materials:
https://arxiv.org/pdf/2510.26422
This article is from the WeChat official account "New Intelligence Yuan", author: New Intelligence Yuan, published by 36Kr with authorization.