How to Distinguish Real from Fake AI-Generated Videos? A Review of Dynamic, Traceable, and Interpretable Detection Systems
In the past two years, video generation models have been evolving at a rapid pace. From the stunning effects when Sora was first released at the end of 2024, to the multi - point breakthroughs of video generation models such as Google Veo, Sora 2, the Kling series of models, and Seedance 2.0 at the beginning of this year, the quality of AI - generated videos has undergone a qualitative leap. It can generate videos of several minutes, with multiple characters and complex scenes, achieving movie - level realistic effects.
On the opposite side of the ever - changing generation side, the research community's attention to AI video detection has been lukewarm.
In reality, it is not difficult to observe that the much greater deceptiveness brought by the multimodal nature of videos compared to pictures is having a huge social impact:
On various social platforms, false AI - generated videos are emerging frequently, and the quantity, quality, and coverage breadth are all increasing rapidly. When users ask base models such as Grok and Doubao whether a video is AI - generated, the answers they get are often just yes - or - no judgments lacking interpretability and credibility. On platforms like Xiaohongshu, real - shot videos are often labeled as "suspected to be AI - generated".
There is a huge gap between the rapid development of the generation side and the lack of attention on the detection side. We must pay timely attention to: In the era of rapid iteration of AI video generation, to what extent has the research on AI - generated video detection developed, what kind of paradigm shift is it undergoing, and in which directions does it need to develop in the future.
Against this background, researchers from MBZUAI, Renmin University of China, and Harvard University co - wrote and published a fifty - page review. For the first time, they sorted out the technical path from low - level visual perception to high - level world - class reasoning from both visual and language directions. Based on this, they analyzed the currently urgently needed dynamic, traceable, and interpretable credible detection system with multi - layer evidence coupling. This review has now been accepted by ACL 2026.
Paper link: https://www.researchgate.net/doi/10.13140/RG.2.2.31713.88168
GitHub link: https://github.com/dxhou/AI - Generated - Video - Detection
Homepage link: https://AIgcvdetection.github.io
Redefining the Goals of AI - Generated Video Detection
Figure 1 | The complete process of AI - generated video detection: from the generation side, dual - perspective detection, to evidence collection
Before the explosion of generative AI, AI - generated videos would leave relatively obvious visual artifacts. Based on this premise, in the early Deepfake scenarios represented by face - swapping, frame - level visual perception verification was effective enough.
In the past two years, in the era of rapidly developing generative AI, the quality of videos has gradually crossed this "premise", and it has become increasingly difficult for the human eye to judge the authenticity of real and complete videos. At this time, detection that only outputs binary classification judgments can no longer meet the requirements. It is urgent to answer: What kind of evidence does the detector rely on to support a credible judgment.
This review first pushes forward the boundary of the detection problem: It points out that the detection output needs to move from "true - false binary classification to interpretable and credible structured judgments", thereby advancing the detection object to verify the gap between the "virtual world" and the "real world" in the video.
Therefore, the review first re - defines the detection goal as "factual fidelity verification", that is, to verify whether propositions such as "who, when, where, and what happened" in the video content are consistent with the real world both perceptually and cognitively. In addition to verification between vision and modalities, it is necessary to further judge whether these propositions contained in the video content conflict with external "facts, physical laws, and world knowledge, etc.".
Three Paradigms of AI - Generated Videos as Detection Objects
Figure 2 | Three types of AI - generated video paradigms defined in this review
Since 2020, AI - generated videos have undergone a paradigm shift: from the early Deepfake period when videos were locally modified through GANs, to audio - visual recombination such as lip - syncing and voice replacement, and then to the full synthesis of AI videos supported by the "world simulator" like Sora promoted by latent space diffusion models. The review divides AI - generated videos into the following three paradigms:
Local Manipulation Video (LMV) with Real Carriers Retained
LMV has long been the most typical and mature paradigm in traditional Deepfake detection. In this type of video, local areas of a real - shot video are processed, such as face - swapping and background replacement. However, most of the original video structure, such as the scene, character actions, camera movements, and lighting relationships, usually remains. Therefore, most early methods focused on local artifacts, frequency - domain features, geometric anomalies, and regional consistency. As the capabilities of generation models in local fusion, lighting adaptation, and identity transfer become stronger, and platform processing and secondary dissemination further erase many small traces, the focus of LMV paradigm detection is gradually shifting to the robustness of detection methods in different scenarios.
Audio - Visual Editing (AVE) under Cross - Modal Coupling Constraints
The AVE paradigm mainly emerged in 2024. In this type of AI - generated video, the established corresponding relationships within the video, such as the relationship between the picture itself and the sound, lip - sync, speaker identity, speaking rhythm, and subtitle content, are changed. This includes face synthesis driven by voice, re - dubbing of the original video, lip - syncing, and speaker replacement. This requires the detection side to shift from looking at visual artifacts to checking whether the relationships between several modalities within the video are truly established, and to find truly judgmental clues by considering the sound, lip - sync, identity, and content together.
End - to - End Generative Video Synthesis (GVS)
In the GVS paradigm that emerged in 2025, the model directly generates an entire video based on conditional information such as text, images, and noise, without relying on a real video as a base, bringing new challenges to the detection side.
This type of video usually looks real in a single frame or in a short period of time, but often has loopholes in the long - term spatio - temporal sequence: for example, the actions of characters or their positions in the scene cannot be connected before and after, the shape and movement of objects change in a way that does not conform to physical laws, or the events in the video cannot be established in the real world.
Correspondingly, the detection idea for the GVS paradigm cannot be limited to local and inter - modal consistency. It needs to move to a higher level, starting from long - term consistency, common sense, physical laws, narrative and causality, and the authenticity and traceability at the proposition level, to verify whether the content itself is credible in the long - term spatio - temporal sequence, and to see whether the video content can be established at all levels in the real world.
Four - Layer Detection Method Spectrum from Visual - Language Dual Perspectives
Figure 3 | Vision - Language Dual - View four - layer framework: the first two layers are more inclined to the visual perspective, and the last two layers move towards the language perspective
Currently, the modal perspectives for AI - generated video detection have diverged, and can be divided into two core scientific problems: The first type starts from the visual modality, focusing on low - level signal forensics and spatio - temporal consistency of the picture.
The other type starts from the language modality, with the core concerns including the cross - modal language information of the video itself, to judge whether "the video is narrating in a well - aligned manner between modalities"; and using the language modality to introduce reasoning related to world knowledge and facts, to judge whether "the content of the video can withstand the test of knowledge, facts, and laws in the external real world".
This review captures this trend of change and proposes to organize the research methods and evaluation paradigms of AI - generated video detection from the visual - language dual perspectives. On this basis, it further proposes the following four - layer method picture from low - level perception to high - level cognition.
It includes the following four layers:
Layer 1, Intrinsic Cues Analysis: The First Sieve
The research question that the methods in layer 1 focus on is: Whether the video conforms to the statistical laws that a real video needs to meet at the low - level visual signal, and whether there are low - level clues introduced by AI model generation or editing operations in the video.
At the low - level signal, a real video will meet the corresponding statistical characteristics, and a video obtained through real - shot processing will naturally match the acquisition, encoding, and post - processing processes. However, the AI - generation process often leaves clues that deviate from the real video distribution, such as a single style, watermarks and artifacts corresponding to the model, and detectable rigid physiological signals. The methods in the first layer conduct forensics by modeling, extracting, and amplifying these low - level signals from the visual perspective. This includes detecting:
Pixel and geometric anomalies such as frequency - domain, texture, boundary, and noise patterns;
Physiological signals on the human face such as pulse coupling, minute muscle movements, and blinking rhythm;
Whether there is a systematic offset between real and forged videos in the feature space.
Layer 2, Spatiotemporal Consistency: Checking "Whether a Video Flows Smoothly"
The methods in Layer 2 target the concept of "the spatio - temporal sequence combination of multiple frames of a video", and the research question they focus on is: Whether the image flow of the video meets the characteristics that the object movement process in a real video needs to meet in the spatio - temporal dimension. A real - shot video is limited by the continuous camera trajectory and the real - world environment scene, and the main and background pictures between adjacent frames will present a continuous and predictable spatio - temporal change pattern that conforms to physical feasibility and camera movement. However, AI - generated videos may have spatio - temporal discontinuities such as object or background distortion and sudden blurring of local areas in the long - term sequence. This includes detecting:
Temporal and motion inconsistencies such as local object deformation, background drift, sudden blurring, and abnormal motion residuals;
Human behavior and interaction dynamics such as facial expression changes, identity dynamics, and the interaction rhythm between the main characters in the picture;
Physical and frequency anomalies related to time frequency and picture continuity.
Layer 3, Cross - Modal Consistency: Multi - Modal Verification Inside the Video
Layer 3 is a very critical turning point in the entire framework: the detection starts to enter the multi - modal verification inside the video, and the research question it focuses on is: Whether the various modalities such as the picture, sound, and subtitles in the video "tell the same content in all aspects".
In a real video, the modalities such as audio, text, and picture are often highly aligned. However, AI - generated videos may have systematic mismatches between lip - sync and voice, identity and voiceprint, and picture and text. The methods in the third layer conduct fine - grained and multi - angle consistency analysis of the inter - modal consistency. This includes three types:
Detecting the consistency between sound and picture;
Conducting text - video semantic consistency reasoning after introducing subtitles, titles, transcribed texts, and descriptive texts;
Robustness learning for the time positioning of inter - modal inconsistencies.
Layer 4, Language - Guided World - Level Reasoning: Focusing on the Gap between the Video and the Real World
The detection perspective of Layer 4 is upgraded from "the internal consistency of the video" to "whether it is consistent and does not conflict with the rules and knowledge in the external real world", and the research question it focuses on changes to: Whether the content of the video really exists and is reasonable in the real world in terms of semantics and facts.
All the content of a real video should be consistent with the facts, physical rules, and other domain knowledge and basic common sense in the real world. However, the content of AI - generated videos is often difficult to fully align with the real world, which is the detection space utilized by the fourth layer. This includes:
Using prompts, text priors, text prototypes, or lightweight modules to recalibrate the representation space of the model, so that the model can more easily match the observed anomalies with more explicit semantic categories;
Regarding detection as a verification process, constructing an investigator agent that can search for information, adjust tools, and revise judgments, and corresponding the judgments with evidence, tool outputs, and verification processes;
Through fine - tuning, preference learning, reward modeling, and reinforcement learning, training the model on "how to select evidence, how to organize explanations, and how to give conclusions". Focus on providing clear, structurally stable, and complete - evidence - chain detection outputs.
Evolution Maps of the Generation Side and the Detection Side
Figure 4 | Evolution map of representative detection methods: the escalation of threats on the generation side and the improvement of the detection side are advancing synchronously
The above figure shows along the timeline that the threats on the generation side are continuously raising the upper limit of the realism that "fake videos" can achieve. Against the background that the base models relied on by detection technologies have evolved from deep convolutional networks and recurrent networks to visual transformers, and then to large vision - language models with reasoning capabilities and intelligent agent systems, the detection side has evolved from visual forensics to multi - modal verification and high - level reasoning detection.
The review further makes a time - based statistics on the hierarchical distribution of detection methods. The proportion was only 7.7% in 2020, rose to 40.0% in 2023, and exceeded half in 2025.
Generally speaking, the focus of detection methods is continuously shifting upwards: in the early stage, it was mainly concentrated on the first and second layers, and as generated videos become smoother and more realistic, detection is starting to enter the third and fourth layers more often