Overcoming the Memory Challenge in Long - form Video Generation: The University of Hong Kong and Kuaishou's Keling MemFlow Design Dynamic Adaptive Long - term Memory, Saying Goodbye to Rapid Forgetting and Plot Confusion
Have you ever been troubled by the incoherence of AI-generated videos?
In interactive creation, simply changing a single prompt can cause the story to "collapse" instantly. For example, a character may reappear after a brief absence from the screen, but look completely different, as if a new actor has taken over. Or, when you try to introduce a new character, the AI may repeatedly "summon" this newcomer in subsequent plots and even mix up the characteristics of multiple characters. This "goldfish memory" problem is a major flaw in the narrative of long video generation.
Now, researchers from the University of Hong Kong and the Kling team at Kuaishou have jointly launched a breakthrough solution - MemFlow.
This is an innovative streaming adaptive memory mechanism that endows AI with strong long-term memory and narrative coherence, promising to completely solve the above problems.
Fluid Narrative vs. Rigid Memory
To generate long videos, mainstream models generally adopt the "block-by-block generation" strategy, which means generating video segments one by one, similar to showing slides.
However, how to make the subsequently generated segments accurately "remember" the previous content has become a huge technological gap. Previous solutions can be roughly divided into several categories, but all have obvious limitations:
1. The "Remember Only the Beginning" Strategy: Some models only retain the first video segment as memory, and all subsequent generations refer to it. This method works well in a single scene. However, once the story develops and new characters need to be introduced or a new scene needs to be switched to, the model will get lost because the "memory" does not contain this new information, resulting in incoherence in both visual and semantic aspects between the subsequent generation and the previous content.
2. The "One-Size-Fits-All" Compression Strategy: Other methods attempt to compress all historical frames into a fixed-size "memory package". The problem is that different narrative needs require different key points to be recalled. The "one-size-fits-all" compression often leads to the loss of key details, resulting in the forgetting of subject features and the drift of visual quality.
3. The "Independent Processes" Strategy: Some processes attempt to split the task. First, one model creates a key-frame script, and then another model generates the video based on the script. This method is independent when generating according to each segment of the script, and the spliced complete video lacks global consistency.
These rigid and non-adaptive memory strategies cannot meet the fluid and unpredictable narrative needs in interactive creation, which is the reason for the poor consistency in interactive long video generation.
Generating True Long-Term Memory and Narrative Coherence
MemFlow breaks away from the traditional model that relies on rigid and fixed memory and establishes a dynamic memory system with semantics as a bridge. Its advantages are mainly reflected in two aspects:
1. Long-Term Memory: Maintaining Visual Consistency in Complex Scenarios
MemFlow has the ability to maintain long-term memory of object images. This means that even in complex situations such as scene changes, camera transitions, and the insertion or temporary disappearance of characters in long videos, it can remember the core visual features of each subject.
2. Narrative Coherence: Ensuring the Clear Development of Multi-Subject Storylines
Learning from the thinking of a director, MemFlow understands the plot from a global perspective. In a narrative involving multiple subjects, MemFlow will not mistakenly reintroduce existing characters or make the "face blindness" mistake of confusing subjects. When users introduce a new subject and describe it further, MemFlow can accurately understand and continue the narrative, allowing the story to progress smoothly.
Adaptive and Efficient Dynamic Memory
MemFlow's powerful capabilities stem from two core designs:
Narrative Adaptive Memory (NAM): Before generating a new segment, it intelligently retrieves the most relevant visual memory from the memory bank based on the current prompt. This enables it to find an accurate visual reference whether it is continuing an old character or depicting a new interaction, thus maintaining consistency. This design allows the model to prioritize retaining the information most relevant to the current narrative within a limited memory capacity, thereby striking a balance between consistency and computational overhead.
Sparse Memory Activation (SMA): To balance efficiency, this mechanism acts like a spotlight, only activating the most critical information in the memory for calculation. This not only avoids confusion caused by information overload but also greatly improves the generation speed, achieving high efficiency while ensuring high-quality narrative.
Comprehensive Verification from Quantitative Data to Qualitative Comparison
To evaluate the actual effect of MemFlow, the research team conducted a series of detailed qualitative and quantitative experiments. The results clearly demonstrate the model's performance in the field of long video generation.
Quantitative Analysis: Significant Improvement in Key Indicators
In the challenging "60-second long video generation with multiple prompts" task, MemFlow's data performance is particularly outstanding:
Excellent Performance in Comprehensive Quality and Aesthetics Scores:
Under the VBench-Long evaluation system, MemFlow achieved the highest scores among all comparison models in both the total quality score (85.02) and the aesthetics sub-score (61.07), indicating that the videos it generates have a good level of visual quality and aesthetic presentation.
Verification of Long-Range Semantic Consistency:
By evaluating the CLIP score of the video-text matching degree segment by segment, a key phenomenon can be observed: in the second half of the video (e.g., 40-60 seconds), the semantic consistency of many models declines significantly due to error accumulation, but MemFlow's score can continuously remain at a high level. This reflects the effectiveness of its dynamic memory mechanism in maintaining long-term narrative consistency and helps to alleviate the problem of "getting more chaotic as the video progresses".
Excellent Consistency Performance:
In the consistency score that measures core capabilities, MemFlow achieved a high score of 96.60, leading all comparison models. This directly shows that whether it is characters, backgrounds, or objects, MemFlow can maintain good visual unity in complex narrative changes.
In addition, in the ablation experiments on different memory mechanisms, the results show that the "Narrative Adaptive Memory + Sparse Activation (NAM + SMA)" strategy adopted by MemFlow has achieved improvements in both subject consistency and background consistency compared with the "no memory" or "only remember the first segment (Frame Sink)" schemes, and it also achieves higher operating efficiency than using a complete memory bank.
Qualitative Analysis: Visual Comparison Intuitively Shows the Model's Advantages
In addition to data indicators, intuitive visual comparison more clearly demonstrates the actual capabilities of the model:
Avoiding Narrative Confusion: In a multi-shot scenario where "a lady wearing a casual sweater" is introduced, other models had problems such as inconsistent appearance of the generated character or repeated introduction of the subject after the prompt was changed. In contrast, MemFlow successfully maintained the image of the same lady in multiple shots without obvious drift.
Accurate Character Tracking and Reproduction: The above comparison graph effectively shows MemFlow's stability in handling character interactions. Whether it is children playing with a puppy on the beach or a family decorating a Christmas tree, MemFlow can ensure that the core characters in the story remain consistent in multiple video segments. In contrast, the baseline model LongLive introduced redundant or inconsistent new characters after the prompt was changed, resulting in narrative incoherence; other models had more serious problems of quality drift and subject forgetting.
Demonstration of the Necessity of Dynamic Memory: In the visual comparison of memory mechanisms, the "no memory" version showed obvious scene inconsistency when the prompt was changed; the "only remember the first segment" scheme could not maintain the characteristics of newly introduced characters. Only MemFlow can smoothly continue the plot and ensure subject consistency, which intuitively demonstrates the effectiveness and necessity of its dynamic memory mechanism.
Efficiency Evaluation
The experimental results show that in the same long video generation task with multiple prompts, traditional models are prone to subject drift and character confusion, while MemFlow maintains better narrative coherence and visual consistency.
More importantly, MemFlow achieved a real-time inference speed of FPS = 18.7 on a single NVIDIA H100, with minimal performance loss compared with the baseline model without memory. It has reached the SOTA level in multiple key indicators such as consistency, aesthetics score, and text alignment.
Opening a New Era of Long Video Narrative
MemFlow, jointly developed by the University of Hong Kong and the Kling team at Kuaishou, has pushed AI video generation technology from "fragment splicing" to a new height of "storytelling" through its unique dynamic memory mechanism.
It marks that AI is evolving from an artist who can only create "concept videos" to a "narrative director" who can handle complex plots and maintain character consistency.
An era of AI video creation that can truly understand, remember, and tell stories coherently is coming.
Arxiv: https://arxiv.org/pdf/2512.14699
Project Page: https://sihuiji.github.io/MemFlow.github.io/
Github: https://github.com/KlingTeam/MemFlow
This article is from the WeChat official account "QbitAI". Author: MemFlow team. Republished by 36Kr with permission.