Why Video Generation Drifts in Long Videos: The "Too Clean" Previous Frames and the Key of Shared Noise Level for Stability

3-hour degradation-free generation, subverting the autoregressive video paradigm

The problem of autoregressive video generation deteriorating as it progresses has a solution!

As video generation evolves towards long - term sequences, the autoregressive (AR) diffusion model has attracted extensive attention from both academia and industry due to its ability to support streaming output. However, when existing AR generation paradigms move towards "infinite length", they always face a core pain point: error accumulation caused by the inconsistency between training and inference. This makes the problems of temporal drift and frame deterioration (drift) more and more severe as the generated video progresses.

To alleviate this degradation problem, existing methods have tried various approaches, such as simulating prediction errors, introducing the First frame sink or the Self - rollout mechanism. However, these patch solutions all have their own limitations. For example, using the frame sink often severely limits the scene changes in the video.

Where exactly does the root cause of the drift problem in long - term sequences lie?

"The original intention of autoregressive generation" and "the trap of clean context"

The original intention of the autoregressive video generation model is to generate videos segment by segment, just like language models, so as to break through the physical video memory limitation of single - time generation. In this process, the mainstream approach is to wait for the previous video block to be completely denoised and become a "clean" frame, and then use it as a condition for the next block.

A research team composed of personnel from the University of Science and Technology of China, the Chinese University of Hong Kong, Tongji University, Tencent Hunyuan, and the Key Laboratory of Digital Security in Anhui Province found after back - tracking and analysis that this "overly clean" context is precisely the culprit leading to temporal drift. In actual inference, the generation of the previous block will inevitably have small prediction errors. When the model receives a context with no (or very little) noise, it will regard these "flaws" as absolutely correct real conditions with great confidence. As the number of autoregressive steps increases, this error is continuously transmitted and magnified exponentially, ultimately leading to severe drift.

HiAR: Completely denoised context is not necessary

In order to "explore the reason why drift still exists" and "efficiently solve this problem", the teams from the University of Science and Technology of China, MMLab, Tongji University, and Hunyuan jointly launched HiAR.

Is it necessary to completely denoise the previous frames?

The team first re - examined the Bidirectional Diffusion model. In bidirectional generation, all video frames share the same noise level and are denoised simultaneously. There is no need to denoise the previous frames in advance, and global continuity and consistency can still be maintained. In essence, this is because the diffusion model often has a coarse - to - fine generation mode, and a coarse context is sufficient for denoising in the coarse stage. This rule can also be transferred to Causal AR Diffusion. Based on this, the team re - planned the inter - frame dependency relationship and launched a hierarchical denoising framework.

HiAR no longer waits serially for the previous video block to be completely generated. Instead, in each denoising step, it performs causal generation on all video blocks. This means that the context and the currently generated block always share the same noise level. This simple reconstruction not only significantly reduces the error transmission between blocks but also brings an unexpected bonus - it naturally supports pipeline parallel inference.

How does HiAR avoid "rigid movements"?

In the distillation training of autoregressive video models, the research team often encounters a tricky problem: in order to easily reduce the loss, the model will look for shortcuts and tend to generate "low - motion amplitude" videos that are almost static.

To solve this problem, the team introduced the Forward - KL regularizer during the training phase. There is a very interesting discovery here: the currently distilled Causal model actually still retains quite good Bidirectional attention ability. Based on this phenomenon, the team calculates the forward KL regularization loss in the bidirectional attention mode. This effectively constrains the model to maintain the dynamic diversity and reasonable motion amplitude of the original video without interfering with the original distillation loss.

Experiments have proven that this design allows HiAR to maintain high dynamic expressiveness of the teacher model while keeping the frame stable.

What is the effect of HiAR? Minute - level generation without degradation

The research team comprehensively evaluated HiAR on the authoritative VBench long - video (20s) benchmark test. The results show that HiAR demonstrates significant advantages over current autoregressive models. Especially in terms of long - term stability, HiAR's Drift Score has dropped to the lowest (0.257), significantly reducing temporal drift compared to the baseline method and maintaining extremely high image quality and semantic stability in long - term sequences. HiAR also achieved the best results in core visual indicators such as Quality.

What's even more exciting is that HiAR truly realizes minute - level video generation without degradation.

In the team's test, HiAR successfully generated a 3 - hour high - quality continuous video after training on a 5 - second video.

Of course, the team also frankly pointed out that since the current version does not introduce any external memory module and only uses the Wan1.3B small model for distillation, the semantic continuity and instruction compliance of the video in extremely long sequences will be affected to a certain extent, but the image quality degradation (drift) has been greatly improved (everyone is welcome to try the team's open - source code).

Is HiAR fast in inference?

In addition to the leap in generation quality, HiAR also has obvious advantages in engineering implementation. Thanks to the hierarchical denoising architecture breaking the shackles of the traditional AR model's "block - by - block serial" mode, the team unlocked the pipeline parallel inference ability under the 4 - step denoising setting. Experimental data shows that without sacrificing any video quality, HiAR achieved about 1.8 times the inference acceleration, with a throughput of 30 fps and a single - chunk delay as low as 0.30s. This paves the way for real - time streaming generation of high - quality long videos.

What is the correct path for long - video generation?

Currently, the methods to solve the inconsistency between training and inference are to simulate prediction errors, use the first frame sink, or the self - rollout mechanism, but all three have their own problems.

HiAR provides a new idea for solving this problem in autoregressive long - video generation, proving that simply sharing the noise level can effectively break the curse of error accumulation. The team's method is independent of the frame sink and context - compression - based methods, and has great development potential.

Paper title: HiAR: Efficient Autoregressive Long Video Generation via Hierarchical Denoising

Paper link: https://arxiv.org/abs/2603.08703

Code: https://github.com/Jacky - hate/HiAR

Webpage: https://jacky - hate.github.io/HiAR/

This article is from the WeChat official account "QbitAI". Author: HiAR team. Republished by 36Kr with permission.