HomeArticle

Long video generation can now look back. Oxford proposes "memory enhancement and stabilization," increasing the speed by 12 times.

新智元2025-09-05 16:39
Oxford VMem enhances video consistency with 3D geometric indexing, accelerating speed by 12 times.

[Introduction] VMem replaces the short-window context of "only looking at the recent few frames" with a memory index based on 3D geometry: the retrieved reference view has just seen the surface area you are about to render; it enables the model to maintain long-term consistency even in a small context; the actual measurement is 4.2s/frame, which is about 12 times faster than the pipeline with a conventional 21-frame context.

When you use a single image to "explore" a house, switching back and forth between scenes and returning to the starting point, and still hope that the kitchen looks the same as before - this is not easy for video generation models.

The team from the University of Oxford proposed VMem (Surfel-Indexed View Memory): write "what has been seen" into geometric patches called surfels. When generating the next time, only take the truly relevant past views as context, achieving the effects of "stronger consistency, more resource-saving, and faster speed".

Paper link: https://arxiv.org/abs/2506.18903

· Geometry as a "memory catalog"

Index the previously generated views by 3D surface elements (surfels); each surfel records "in which frames I have been seen".

When a new view comes, render the surfels to see which ones "appear most frequently", and directly take these frames as references. Explicit occlusion modeling makes the retrieval more reliable.

· Small context, large consistency

On benchmarks such as RealEstate10K and Tanks and Temples, especially in the cycle-trajectory evaluation proposed by the team, VMem is significantly more stable when revisiting the same position in a long sequence.

· Plug-and-play

The memory module can be attached to image set generation backbones such as SEVA; reducing the context from K = 17 to K = 4 can still maintain the indicators, and cut the latency to 4.2s/frame (RTX 4090).

Why is it so difficult to "look back"?

Two mainstream approaches have their own pain points:

  • Reconstruction + out-painting: Estimate the geometry first and then fill in the image. Errors will accumulate, and the result will deviate more and more;
  • Multi-view/video conditional generation: It doesn't involve geometry, but it requires a large number of reference frames, resulting in high computing power consumption and a short context window. It forgets quickly when going far.

VMem reexamines the second type: Instead of looking at the "recent", it's better to look at the "most relevant". The measurement of relevance comes from geometric visibility.

Write: The newly generated frame uses point cloud prediction such as CUT3R to obtain a sparse point cloud → convert it into surfels (position, normal, radius) → write the "frame numbers that have seen me" into the index set of the surfels; merge similar surfels; put the whole into an octree for easy retrieval.

Read: Facing a set of camera poses to be generated, first calculate an average camera, render the surfel attribute map from this perspective, count the "frame numbers that have appeared" voted by each pixel, and select the Top-K most frequent ones as the reference view set; perform NMS on the references with similar poses to remove redundancy.

Generate: Feed (Top-K reference images + Plücker representations of the reference/target cameras) to the image set generator (by default SEVA in the paper), and generate M frames in an autoregressive manner at one time.

Intuitively, a surfel is a "sticker of the seen surface" with "who has seen me" written on it; when a new camera comes, project the stickers from a new angle, and call in those whose names appear most frequently to help.

A pluggable memory layer for world models

Why do world models need such memory?

World models usually rely on implicit latent states (latent state / RNN / Transformer cache) to retain information across time. However, in long-horizon, partially observable (POMDP) scenarios, implicit states are prone to "forget" early details and are not interpretable.

VMem provides explicit, queryable, and geometrically aligned external memory: using surfels as "memory indices" to store visibility clues such as "who has seen me" in a structured way. This brings three direct benefits:

  • Long-term consistency: The memory capacity is decoupled from the number of steps; it can stably revisit the same location and appearance across hundreds of steps.
  • Interpretability and prunability: Retrieval is based on visibility voting, resulting in fewer occlusions/mismatches; the memory can be pruned by region/density/heat.
  • Efficient evidence collection: Changing from "looking at many irrelevant historical frames" to "only looking at a small number of key frames relevant to the current surface", significantly reducing the context and computing power.

How to integrate it into existing world models? (Three common usages)

External Memory: Use VMem as a Key-Value storage, where Key = surfel (position/normal/radius, etc.), and Value = frames and features where the surfel has appeared. Before each step of prediction, the model renders the surfel visibility map through the camera pose, retrieves the Top-K reference views and features, and fuses them into the current state update.

Retrieval Front-End: Before the video/multi-view generation backbone (such as image set diffusion or spatio-temporal Transformer), use VMem to select reference views first, and then pass through the main network; this is equivalent to outsourcing the "context selection" to the geometric index.

RL/Embodied: Use VMem as shared memory for the "world model + policy" to read and write together: the world model uses it for long-term consistent simulation, and the policy uses it for positioning/navigation/memory evidence, reducing the difficulty of long-term credit assignment.

Experiments and results

Evaluation settings: Start with a single image and generate autoregressively along the ground-truth camera trajectory; for long-term evaluation, look at positions ≥ 200 frames; the team additionally proposed the cycle trajectory to specifically test the consistency of "going around and then going back".

Standard long-term settings

VMem outperforms the public baselines in most indicators; when the trajectory rarely revisits, the advantage is not fully reflected in LPIPS/PSNR, but the visual consistency is better.

Cycle trajectory

VMem generally leads in indicators such as PSNR and LPIPS compared to LookOut, GenWarp, MotionCtrl, ViewCrafter, etc. The appearance and layout are more consistent when returning to the starting point.

Efficiency: The K = 4/M = 4 version with LoRA fine-tuning + VMem has a ~12× inference speedup (4.2s/frame vs 50s/frame), and the image quality and camera alignment indicators are close to or better than those of the large context with K = 17.

Ablation: Replacing the retrieval strategy with "recent frames/camera distance/FOV overlap" significantly degrades the consistency; this shows that visibility voting based on surfels is the key. The smaller K is, the more significant the effect is.

What makes it different?

Compared with the reconstruction + out-painting approach: VMem doesn't use geometry as the final representation, but only uses it for retrieval, so it is relatively more robust to geometric errors;

Compared with FOV/distance/temporal retrieval: VMem's surfels explicitly consider occlusion and the real overlap of visible areas, so the relevance is more accurate;

Compared with hidden state memory (such as the latent representation of world models): VMem's "memory" is an interpretable spatial index, which is convenient for pruning and acceleration.

Limitations and prospects

Non-real-time: Diffusion sampling still requires multiple steps; the authors estimate that it can be further accelerated in the future with the help of single-step image set models and more powerful computing power;

Data domain: Fine-tuning is mainly done on RealEstate10K (indoor). The generalization to natural landscapes/dynamic objects still needs to be expanded;

Evaluation criteria: Existing indicators have limited ability to describe "true multi-view consistency". The cycle protocol is a start, and more systematic evaluations are needed.

Reference materials:

https://arxiv.org/abs/2506.18903 

This article is from the WeChat official account "New Intelligence Yuan", edited by LRST. It is published by 36Kr with authorization.