StartseiteArtikel

Generate without forgetting, the "ultra-long time series" world model is empowered by Peking University's EgoLCD long- and short-term memory.

新智元2025-12-24 15:53
Peking University and others released EgoLCD, solving the problem of long - video content drift.

[Introduction] Do video generation models always have a "poor memory"? Do objects deform and backgrounds show flaws after just a few seconds of generation? Peking University, Sun Yat-sen University and other institutions jointly released EgoLCD. Drawing on the human "long - short - term memory" mechanism, it pioneered a sparse KV cache + LoRA dynamic adaptation architecture, completely solving the problem of "content drift" in long - videos and refreshing the SOTA on the EgoVid - 5M benchmark! Enable AI to have a coherent first - person perspective memory like humans.

With the explosion of models such as Sora and Genie, video generation is moving from "animating images" towards the grand goal of a "world simulator".

However, on the road to "infinite - duration" video generation, there stands a major obstacle - "Content Drift".

Have you noticed that existing video generation models often have a "goldfish memory" when generating long videos: One second it's blue tiles, and the next second it turns into a white wall; the cup originally in hand gradually turns into a strange shape.

For first - person (Egocentric) perspective scenarios with severe shaking and complex interactions, the models are even more likely to "get lost".

Generating long videos is not difficult. The hard part is to "stay true to the original intention".

Recently, a research team from Peking University, Sun Yat - sen University, Zhejiang University, the Chinese Academy of Sciences, and Tsinghua University proposed a brand - new long - context diffusion model, EgoLCD. It not only introduces a "brain - like long - short - term memory" design but also proposes a new structured narrative Promp scheme, successfully enabling AI to "remember" scene layouts and object features when generating long videos.

Paper link: https://arxiv.org/abs/2512.04515

Project homepage: https://aigeeksgroup.github.io/EgoLCD

In the EgoVid - 5M benchmark test, EgoLCD comprehensively outperforms mainstream models such as OpenSora and SVD in terms of temporal consistency and generation quality, taking a crucial step towards building an embodied intelligent world model!

Core pain point: Why does AI "lose its memory"?

In long - video generation, traditional autoregressive (AR) models are very prone to generative forgetting.

This is like asking a person to draw with their eyes closed. As they draw, they deviate from the original composition. For first - person videos (such as the Ego4D dataset), severe camera jitter and complex hand - object interactions make this "drift" even more fatal.

Although traditional Transformers have an attention mechanism, when dealing with long sequences, the computational complexity explodes quadratically, and they simply cannot store that much historical information; and a simple sliding window will discard early key information.

EgoLCD (Egocentric Video Generation with Long Context Diffusion) redefines long - video generation as an "efficient and stable memory management problem".

Long - Short Memory System

EgoLCD designs a dual - memory mechanism similar to the human brain:

  • Long - Term Memory (Long - Term Sparse KV Cache): Instead of simply caching all tokens, it uses a sparse attention mechanism to store and retrieve only the most critical "semantic anchors" (such as the layout of a room and the features of key objects). This not only significantly reduces video memory usage but also ensures global consistency.
  • Short - Term Memory (Attention+LoRA): It uses LoRA as an implicit memory unit to enhance the adaptability of short - window attention and quickly capture drastic changes in the current perspective (such as rapid hand movements).

In a nutshell: Long - term memory is responsible for "stability", and short - term memory is responsible for "speed".

Memory Regulation Loss

To prevent the model from "slacking off" during training, the team designed a special loss function. It forces each frame generated by the model to be semantically aligned with the "historical segments" retrieved from the long - term memory bank.

This is like putting a "tightening spell" on the AI. Once the generated image starts to "fabricate" (drift), the loss will penalize it, forcing it to return to the original settings.

Structured Narrative Prompting (SNP)

EgoLCD abandons simple text prompts and adopts a segmented, time - logical structured script.

During training: It uses GPT - 4o to generate extremely detailed frame - level descriptions to train the model to strictly match visual details with text.

During inference: SNP acts as an "external explicit memory" and guides the generation of the current segment by retrieving the prompts of previous segments, ensuring the coherence of the story line and visual style.

Explosive performance

To fairly evaluate the "non - forgetting" ability, the research team even developed a new set of metrics - NRDP (Normalized Referenced Drifting Penalty), specifically used to penalize models that "start well but end poorly" and whose quality deteriorates over time.

The experimental results show:

Overwhelming consistency: EgoLCD has an overwhelming advantage in NRDP - Subject (subject consistency) and NRDP - Background (background consistency), with an extremely low drift rate.

Surpassing the baseline: Compared with top - tier models such as SVD, DynamiCrafter, and OpenSora, EgoLCD has the best performance in CD - FVD (temporal coherence) and action consistency metrics on the EgoVid - 5M benchmark.

Ultra - long generation: It demonstrates the generation of a coherent 60 - second video (such as a speaker speaking from dusk to late at night), with the character's clothing and background building details remaining consistent and without deformation!

On the way to the "Matrix" of embodied intelligence

EgoLCD is not just a video generation model; it is more like a "first - person world simulator".

By generating long - term, highly consistent first - person videos, EgoLCD can provide massive amounts of training data for embodied intelligence (robots), simulating complex physical interactions and long - sequence tasks (such as cooking and repairing).

Just as Sora gave people a glimpse of the prototype of a world model, EgoLCD makes the dream of "teaching robots to understand the world through videos" clearer than ever before.

Reference: https://arxiv.org/abs/2512.04515

This article is from the WeChat official account "New Intelligence Yuan". Editor: LRST. Republished by 36Kr with permission.