Nanjing University Team Exposes Large Models' High-Score Myth: Humans Score 90, Strongest Model Only 49

The evaluation scores of existing large models are becoming increasingly saturated, but there is a significant gap between these scores and the real user experience.

[Introduction] The evaluation scores of existing large models are becoming increasingly saturated, but there is a significant gap between these scores and the real - world experience. Led by the team of Fu Chaoyou from Nanjing University, and at the invitation of the Google Gemini evaluation team, a new video understanding benchmark, Video - MME - v2, has been launched. With an innovative hierarchical ability system, group - level non - linear scoring, and over 3300 man - hours of high - quality annotation, it reveals phenomena such as the huge gap between models and humans (49 vs 90), the inflated traditional Acc indicator, and the fact that 'Thinking' does not always bring improvement.

More than a year ago, the Video - MME team led by Fu Chaoyou released its first - edition Benchmark, which has been widely used by models like Gemini and GPT for video understanding evaluation.

According to Paper Digest statistics, Video - MME ranks first in influence among all accepted papers at CVPR 2025 (with over 1100 citations).

In recent years, the team has further systematically sorted out the evaluation of multimodal large models and published a review work MME - Survey, comprehensively analyzing existing Benchmarks from aspects such as ability coverage, evaluation methods, and indicator design.

For this reason, the team realized earlier and more clearly that the existing evaluation paradigm is gradually 'distorted'. Multimodal large models have made rapid progress in video understanding. The scores on various Benchmarks are approaching saturation, but the real - world experience is still lacking. Against this background, Video - MME - v2 was officially released.

Paper: https://arxiv.org/pdf/2604.05015

Homepage: https://video - mme - v2.netlify.app/

MME - Survey: https://arxiv.org/pdf/2411.15296

Video - MME - v2 is an evaluation benchmark for next - generation video understanding capabilities. After nearly a year of preparation, it was completed by 12 annotators and 50 independent reviewers, with over 3300 man - hours of annotation time invested.

Different from traditional Benchmarks, it features a carefully designed three - layer hierarchical ability system and a grouped non - linear scoring method.

The evaluation results show that the non - linear score of human experts is 90.7 (the traditional Acc is 94.9), while the score of the current strongest commercial model, Gemini - 3 - Pro, is only 49.4, and the best result of the open - source model Qwen is 39.1.

What does Video - MME - v2 measure?

The first core design of Video - MME - v2 is to break down video understanding into a three - layer hierarchical ability system.

Layer 1: Information retrieval and aggregation. This is the most basic layer of video understanding, focusing on whether the model can accurately identify and extract key facts from cross - frame and cross - modal information.

Layer 2: Temporal understanding. Based on the first layer, the second layer further examines whether the model truly understands the time dimension. It requires the model not only to understand the static pictures of different frames but also to grasp the sequential relationship of actions, how states change, and why events occur.

Layer 3: Complex reasoning. Based on the second layer, the third layer is closer to real - world tasks, requiring the model to reason in more complex and open scenarios. This is also the layer closest to 'human - like understanding': not only to understand but also to infer, explain, and synthesize. Figure 1 intuitively shows the structure of these three - layer abilities.

Figure 1 The ability level distribution of Video - MME - v2 and the ability ranking of some models

Video - MME - v2 is not just 'adding more questions', but using a new evaluation method

The second key innovation of Video - MME - v2 answers the question of 'how to measure'.

This work does not continue to use the traditional method of 'independent scoring for each question' but introduces group - level evaluation. That is, instead of only looking at whether the model answers a single question correctly, it looks at whether it shows consistency and coherence in a group of related questions.

Ability consistency group: Check if the model'really knows'

It focuses on whether the model can maintain stability in different question forms, granularities, and aspects of the same ability.

For example, if a model truly has spatial understanding ability, it should not only be able to answer 'where the object is' but also 'how its relative position to another object changes'.

Reasoning coherence group: Check if the model is'really reasoning'

It focuses on whether the model can follow a reasonable logical chain to reach a conclusion when a complex problem requires multi - step reasoning.

For example, in a complex plot video, the model may need to first discover a key visual clue, then identify abnormal details, then infer the purpose of the characters, and finally draw a conclusion.

If there is a mistake in one of the steps, even if the final answer is 'correct by chance', this kind of correctness cannot be regarded as truly reliable reasoning.

To match the group - level evaluation, the Video - MME team further adopted a non - linear scoring mechanism. This is also one of the representative designs of Video - MME - v2.

For the ability consistency group, the scores of four related questions are not simply averaged but use incentive scoring (the more questions a model answers correctly in a group, the more rewards it gets). This means that answering a few questions sporadically cannot get a high score; only when the model shows stable performance in the same group of questions can the score really increase.

For the reasoning coherence group, the 'first - error truncation' mechanism is further adopted. That is, once a step is wrong, even if the subsequent answers are correct, no more scores will be given.

Why is it more difficult and more reliable?

The persuasiveness of a Benchmark not only lies in 'clever design' but also in 'robust data'.

The team strictly controls all aspects such as the data source, annotation process, and quality inspection standards of Video - MME - v2, and has invested a high human cost.

The final dataset contains 800 videos and 3200 questions; a total of 12 annotators and 50 independent reviewers participated. After 5 rounds of cross - review and closed - loop revision, more than 3300 man - hours were accumulated. For more details, please refer to the homepage and technical report.

What are the evaluation results?

In the main leaderboard results, the group - level non - linear score of humans reaches 90.7, and the average accuracy reaches 94.9; while the group - level non - linear score of the current best - performing commercial model, Gemini - 3 - Pro, is 49.4.

Among the open - source models, Qwen3.5 - 397B - A17B - Think (512 frames) has a group - level score of 39.1.

This means that even the current strongest video models still have a huge gap with humans under a more strict evaluation framework that emphasizes consistency and coherence.

The paper also specifically points out that the models show a significant performance decline from Level 1 to Level 3. This shows that the weakness in high - level complex reasoning is not just due to a 'weak reasoning module', but often because there are already problems in the previous information aggregation and temporal modeling, which accumulate layer by layer and ultimately drag down complex understanding.

Figure 2 The top 10 in the current evaluation (for the full list, please refer to the homepage)

The advantage of non - linear scoring: From 'answering one question correctly' to'stable understanding of a group of questions'

In traditional evaluations, the average accuracy (Avg Acc) is the most commonly used indicator, but it is essentially the result of independent statistics for each question and is easily affected by 'random hits'.

In contrast, the group - level non - linear scoring (Non - Lin Score) proposed by the team, by modeling the structural relationship between questions, emphasizes the overall performance of the model in the same ability dimension, thus being able to more truly depict whether the model'stably understands the video'.

Further, non - linear scoring also reveals an important phenomenon in model capabilities: there is a significant ability loss from 'correct answer for a single question' to'stable correct answers within a group'.

For this reason, the team introduced an interpretable indicator - the ratio of Non - Lin Score/Avg Acc to measure the degree of this loss.

The experimental results show that the ratio of the current strongest model, Gemini - 3 - Pro, is about 75%; the ratio of Doubao - Seed - 2.0 - Pro is about 72%; and the ratio of some small and medium - sized models (such as LLaVA - Video - 7B) is even as low as about 40%.

The lower the ratio, the more likely the model is to have the phenomenon of 'only answering some questions correctly within a group', and the weaker the stability and robustness. This shows the advantage of non - linear scoring in truly depicting the ability level and revealing the robustness of the model.

Figure 3 The ratio results of Non - Lin Score/Avg Acc for different models

A very noteworthy discovery: Thinking does not always work

In the context of today's large models, 'Thinking' has almost become the default enhancement option. However, a very interesting and important discovery of Video - MME - v2 is that the benefits of Thinking are not unconditional and highly depend on text clues.

The paper's experiments show that after enabling Thinking, the model usually gets a more obvious improvement in the'subtitled' setting than in the 'pure visual' setting.

For example, Qwen3.5 - 122B - A10B - Think (64 frames) brings an improvement of +3.8/+5.8 in the subtitle - free and subtitled settings respectively. This shows that explicit text semantics are still an important 'anchor point' for many models when performing multi - step reasoning.

On the other hand, Thinking may also cause degradation. Qwen3 - VL - 8B shows a decline of -0.6 in the subtitle - free setting, and KimiVL - 16B shows a performance decline of -3.3/-3.3 overall, and the degradation even reaches -4.0/-3.9 at Level 3, which emphasizes more complex reasoning.

This shows that the 'inference enhancement' of some current models is still more adept at using language clues in essence, rather than stably extracting evidence to support reasoning from vision and audio. Once the text anchor points are insufficient, Thinking may not only fail to bring improvement but may also introduce more noise.

Figure 4 The impact of enabling Thinking on model performance under subtitle - free and subtitled settings

Summary

In the next stage of video understanding, Video - MME - v2 aims to promote a transformation in the evaluation concept, emphasizing that what really needs to be compared is who can truly understand what is and has happened in continuous, dynamic, and multimodal information, just like a human. For more content and details, please refer to the homepage and technical report.

Author introduction

The Project Lead of the Video - MME series is Teacher Fu Chaoyou from Nanjing University:

Fu Chaoyou is a researcher, assistant professor, and doctoral supervisor at the Pattern Recognition Laboratory of Nanjing University. He was selected for the 'Youth Talent Support Project' of the China Association for Science and Technology.

He graduated with a doctorate from the Pattern Recognition Laboratory of the Institute of Automation, Chinese Academy of Sciences in 2022. His research direction is multimodal content analysis. He has been cited more than 8700 times on Google Scholar, with two first - author papers cited over a thousand times each and six first - author papers cited over a hundred times each.

His open - source projects have accumulated more than 20,000 GitHub Stars. His representative works include the VITA multimodal large - model series (VITA - 1.0/-1.5, Long - VITA, VITA - Audio), the MME multimodal evaluation benchmark series (MME, Video - MME, MME - RealWorld), and the Awesome - MLLM community.

He serves as an editorial board member of Pattern Recognition/IEEE T - BIOM, a domain chair of ICLR/ICML conferences, a member of the Youth Work Committee of CSIG, and an executive member of the CCF - AI/-CV special committees.

He has won

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

A team from Nanjing University exposes the high-score myth of large models: humans score 90 points, while the strongest model only scores 49 points.