The first multi-round, open-view video question-answering benchmark, with the system classifying 9 major hallucination tasks.
Reported by New Intelligence Yuan
[Introduction] The WildVideo benchmark addresses the "hallucination" problem of multimodal models in video question - answering. For the first time, it systematically defines nine types of hallucination tasks and constructs a large - scale, high - quality video dialogue dataset that covers dual perspectives and supports both Chinese and English. It uses the form of multi - round open - ended questions and answers, which is close to real - world interaction scenarios and comprehensively evaluates the model's capabilities.
In recent years, large models have made significant progress in the field of multimodal understanding and are now able to process image, text, and even video content in the open world.
However, a common and serious problem, "hallucination," has always restricted its practical application.
Especially in dynamic and continuous visual scenarios, the model may generate answers that contradict the video content, violate common sense, or are inconsistent in multi - round conversations.
Current mainstream evaluation benchmarks mostly focus on single - round, single - perspective, and multiple - choice question settings, which are difficult to truly reflect the capabilities and deficiencies of models in open, continuous, and interactive dialogue scenarios. The limitations of this evaluation system hinder our understanding and optimization of the model's performance in practical applications.
To fill this gap, a research team from the National University of Defense Technology and Sun Yat - sen University proposed WildVideo, a systematic multi - round open - ended question - answering evaluation benchmark for real - world video - language interaction.
Paper link: https://ieeexplore.ieee.org/document/11097075
Project homepage: https://chandler172857.github.io/WildVideo - leaderboard/
Github: https://github.com/yangsongyuan18/WildVideo
Dataset: https://huggingface.co/datasets/yangsongyuan18/wildvideo
This work for the first time systematically defines nine types of hallucination tasks from three levels: perception, cognition, and context understanding, and constructs a large - scale, high - quality video dialogue dataset that covers dual perspectives and supports both Chinese and English. It aims to conduct a more comprehensive and rigorous stress test on large multimodal models and has been officially accepted by TPAMI 2025.
Design Concept and Core Contributions of WildVideo
Evaluation Framework Close to Real - World Interaction The design of WildVideo is completely centered around "real - world applications." It abandons the traditional single - choice/true - false question forms and uses open - ended questions and answers, simulating scenarios in real conversations where there are no preset options.
More importantly, it introduces multi - round dialogue evaluation (up to 5 rounds), requiring the model to have the ability to understand the context coherently, associate information, and resolve references, which is a commonly missing link in previous video evaluations.
Fine - Grained and Multi - Dimensional Hallucination Classification System The research team systematically classifies the hallucinations that the model may produce in video tasks into three major categories and nine sub - items:
Perceptual Hallucination: It includes two dimensions: static (object attribute recognition) and dynamic (action understanding, visual positioning, and cross - frame consistency), testing whether the model's basic understanding of video content is accurate and stable.
Cognitive Hallucination: It is divided into common - sense cognition (causal relationship, cross - modal reference) and world knowledge cognition, requiring the model not only to "see" but also to make reasonable inferences based on common sense and external knowledge.
Context Understanding Hallucination: Specifically designed for multi - round conversations, it includes context omission (understanding the omitted information in the conversation) and cross - round retrieval (associating key information in historical conversations), directly evaluating the core ability of the model in continuous conversations.
Rich and High - Quality Dataset
The benchmark contains 1,318 videos. Among them, 874 are paired first - person and third - person videos from the Charades - EGO dataset, recording daily human activities to simulate different human observation perspectives. In addition, 444 YouTube videos covering global events and cultural phenomena are introduced to enrich the world knowledge background.
The dataset finally contains 13,704 single - round question - answer pairs and 1,585 multi - round dialogues. The data construction process combines the generation ability of powerful LLMs with multiple rounds of manual review and enhancement by experts at the PhD level from multiple countries, ensuring the challenge of the questions, the accuracy of the answers, and the naturalness and fluency of the dialogues.
Main Experimental Findings and In - Depth Insights
The research team comprehensively evaluated 14 mainstream open - source and commercial models (such as GPT - 4o, Claude - 3.5 - Sonnet, Gemini series, LLaVA - Video, InternVL, etc.) on WildVideo and revealed several key findings:
Overall Performance Reveals Great Challenges Even the most advanced current models face severe challenges on WildVideo.
In single - round tasks, the best - performing GPT - 4o has an accuracy rate of only 62.1%. When the tasks are extended to multi - round conversations, its accuracy rate further drops to 52.7%. This clearly shows that the complexity of handling multi - round interactions is much higher than that of single - round question - answering, and the capabilities of existing models have significant shortcomings.
Imbalanced Ability Structure
Perception Level: The model performs best in static "object" recognition tasks, while its performance drops significantly in "action" recognition and "visual positioning" tasks that require understanding of timing, revealing its deficiency in processing dynamic information.
Perspective Preference and Language Difference
Perspective Preference: Almost all models perform systematically better on third - person (external perspective) videos than on first - person (self - perspective) videos. Researchers analyze that this may be because first - person videos have more motion blur, sudden perspective changes, and occlusions, posing higher requirements for the model's dynamic perception.
Chinese - English Bilingual Evaluation: WildVideo provides a complete Chinese - version evaluation set. Experiments show that the models generally perform worse on Chinese tasks than on English tasks. The best - performing model, GPT - 4o, only gets 54.0% in Chinese multi - round tasks, which provides a clear diagnostic tool for the optimization of Chinese multimodal models.
Trade - off between Lite and High - Performance Models Comparing GPT - 4o/GPT - 4o mini and Gemini 1.5 Pro/Gemini 1.5 Flash, it is found that the more powerful versions lead in most tasks.
Interestingly, the lightweight Gemini 1.5 Flash outperforms its high - performance version in multi - round context understanding tasks, suggesting that there may be different optimization paths between efficiency and long - context processing ability in model design.
Significance and Future Outlook
The release of WildVideo not only provides a new and more rigorous evaluation "ruler" for the community but also points out an important evolution direction for the research of large multimodal models:
Promote the Upgrade of Evaluation Paradigm: It promotes the video understanding evaluation from "static snapshot question - answering" to "dynamic continuous dialogue" and from "objective selection" to "open - ended generation," getting closer to the final application.
Refined Diagnosis of Model Deficiencies: Its detailed hallucination classification system can help researchers accurately locate the specific links where the model fails (whether it can't see accurately, think correctly, or remember), so as to make targeted improvements.
Promote the Development of Multi - Round Dialogue Technology: The benchmark clearly reveals the vulnerability of current models in multi - round interactions, which will encourage the academic and industrial circles to invest more in key technologies such as dialogue state management, long - term memory mechanism, and reference resolution.
Support Cross - Language and Cross - Cultural Optimization: The parallel design of Chinese and English provides an important evaluation basis for the development of more globally applicable multimodal models.
WildVideo is like a comprehensive "physical examination center." It tells us that although current multimodal models seem powerful, they still need to make breakthroughs in multiple key capabilities such as dynamic perception, in - depth reasoning, and coherent interaction on the way to real and practical video dialogue intelligence.
This work has open - sourced the relevant benchmark data. It is expected that it can continuously drive the video - language interaction field to develop in a more reliable and intelligent direction.
Reference: https://ieeexplore.ieee.org/document/11097075
This article is from the WeChat official account "New Intelligence Yuan". Editor: LRST. Republished by 36Kr with permission.