The video model natively supports consistent actions. You just don't know how to use it. Unveil the secret of the "first frame".
The latest method, FFGo, has changed our understanding of the first frame in video generation models. The first frame is not just a simple starting point; it is the model's "conceptual memory buffer," storing the visual elements of subsequent frames. Through a small number of samples and special training, FFGo activates this ability of the model to achieve high - quality video customization without modifying the model structure or using a large amount of data, opening up a new direction for video generation.
In today's era when Text - to - Video / Image - to - Video technologies are advancing by leaps and bounds, we have become accustomed to the following common sense:
The first frame (First Frame) of video generation is just the starting point on the timeline and the initial frame of the subsequent animation.
However, the latest research from the University of Maryland, the University of Southern California, and the Massachusetts Institute of Technology has found that the real role of the first frame is not a "starting point" at all. In fact, it is the "conceptual memory buffer" of the video model, and all the visual entities referenced in subsequent frames are silently stored in this frame.
- Paper link: https://arxiv.org/abs/2511.15700
- Project homepage: http://firstframego.github.io
The starting point of this research stems from in - depth thinking about a widespread but not systematically studied phenomenon in video generation models.
The core insight of the paper is very bold: The video generation model will automatically "remember" all the visual entities such as characters, objects, textures, and layouts in the first frame and reuse them continuously in subsequent frames.
In other words, no matter how many reference objects you provide, the model will quietly package them into a "conceptual blueprint" in the first frame.
The researchers tested video models such as Veo3, Sora2, and Wan2.2 and found that:
If multiple objects appear in the first frame, in rare cases, by using a special transition prompt <transition>, the model can naturally integrate them in subsequent frames, and even support cross - scene transitions and maintain consistent character attributes;
However, this magical transition prompt <transition> is different for each model and each video to be generated. Moreover, after the model integrates multiple objects during the transition, it often has problems such as loss of object and scene consistency or object loss.
This shows that:
✔ The first frame is where the model "memorizes" external references.
❌ But by default, this ability is "unstable and uncontrollable."
The FFGo Method
Without modifying the structure or large - scale fine - tuning, and using only 20 - 50 examples, any pre - trained video model can be transformed into a powerful "reference - image - driven video customization system."
Based on this insight, the researchers proposed an extremely lightweight approach: FFGo.
The key advantages have shocked the entire industry:
✔ No modification of any model structure
✔ No need for millions of training data
✔ Only 20 - 50 carefully curated video examples
✔ A few hours of LoRA training
✔ Can achieve SOTA - level video content customization
This is almost unimaginable with existing methods.
The researchers listed six major application scenarios:
- Robot Manipulation
- Driving Simulation
- Aerial / Underwater / Drone Simulation
- Multi - product Display
- Film and Television Production
- Arbitrary Multi - character Combination Video Generation
Users only need to provide the model with a first frame containing multiple objects/characters and a text prompt. FFGo can make the model automatically "remember" all elements and generate interactive videos with strong frame consistency, object identity preservation, and action coherence. It even supports "simultaneous integration of up to 5 reference entities", while VACE/SkyReels - A2 is limited to less than 3 and will directly miss objects.
Technical Highlights
Automatically build 20 - 50 high - quality training sets using VLM
Use Gemini - 2.5 Pro to automatically identify foreground objects, use SAM2 to extract RGBA masks, automatically generate video text descriptions, and build training samples suitable for the input of video models. This greatly reduces manual workload.
Use Few - shot LoRA to activate the model's "memory mechanism"
The research found that:
- The model naturally has the ability to integrate multiple reference objects, but it is difficult to "trigger" by default.
- A special prompt (such as "ad23r2 the camera view suddenly changes") can act as a "transition signal."
- What LoRA learns is not new abilities, but "how to trigger these abilities." During inference, only the first 4 frames (compressed frames of Wan2.2) need to be discarded.
The real mixed content of the video starts after the 5th frame. The first 4 frames are compressed frames and can be directly discarded.
Why is FFGo so powerful?
The researchers conducted a large number of comparative experiments:
✔ FFGo can maintain object identity consistency (Identity Preservation)
✔ Can handle more reference objects (5 vs 3)
✔ Can avoid the "catastrophic forgetting" caused by large - model fine - tuning
✔ The output frames are more natural and coherent
Especially in multi - object scenarios and general multi - object interaction scenarios, the generation effect of FFGo is significantly better than that of VACE and SkyReels - A2.
What does the occasional "success" of the base model represent?
During the research on FFGo, there is a particularly crucial experimental illustration worth discussing separately: In rare and extremely occasional cases, the original I2V model of Wan2.2 can also complete a "perfect" task:
- Multiple reference objects do not disappear.
- Scene transitions remain stable.
- Actions are coherent and identities are consistent.
- Highly match the text prompt (such as the wingsuit flyer moving in line with the Cybertruck).
If you only look at this set of results, you might even think that the original model itself has a stable multi - object integration ability.
But the fact is just the opposite. The significance of success does not lie in "the base model performs well," but rather in: The base model already "has" this ability, but it cannot be stably activated most of the time.
The insight of the research team is confirmed here:
✔ The video generation model does store multiple reference entities in the internal memory structure of the first frame.
✔ The video model itself can perform the generation of "multiple objects + consistent actions."
✔ But this behavior is almost uncontrollable, unstable, and difficult to reproduce by default.
This is like the model has a "hidden GPU" inside, which lights up occasionally, but you can't expect it to work all the time.
FFGo does not teach the model new abilities, but enables it to "perform stably."
In the above comparison, the results of FFGo are almost the same as the "occasional successful results" of the original model. This shows that FFGo's LoRA is not rewriting the model, but activating the existing potential abilities.
In other words: The original model = has potential but cannot perform continuously, while FFGo = turns potential into stable ability (without destroying pre - trained knowledge).
The paper mentions that FFGo can retain the generation quality of the original model, instead of sacrificing generalization ability like traditional large - scale fine - tuning. No fine - tuning can match the data quality and learning effect of pre - training.
This experiment also proves something extremely revolutionary: The first frame itself has the role of a "conceptual memory buffer." Video models are naturally capable of multi - object integration. The key is just the lack of a "trigger mechanism."
What FFGo does is: Use dozens of samples, a carefully designed transition phrase, and Few - shot LoRA to "turn on" this ability again and make it controllable, stable, and reliable.
This is also why FFGo can outperform SOTA models with 20 - 50 examples.
What this experiment conveys is essentially one sentence: Video models are already strong enough; we just haven't found the correct way to use them in the past.
And FFGo is exactly teaching us one thing: How to "correctly use" video generation models.
Summary
To summarize the research significance of this paper in one sentence: It does not teach the model new abilities, but teaches us how to use the abilities that the model already has but has never been correctly utilized.
The researchers proposed an extremely inspiring future direction:
🔮 Use the model more intelligently, rather than training it more forcefully.
🔮 Obtain stronger customization ability with less data and lighter fine - tuning.
🔮 Make "using the first frame as a conceptual memory buffer" a new paradigm for video generation.
In short, in video models:
- The first frame is not a starting point, but the "memory bank" of the model. Video models are naturally capable of multi - object integration.
- FFGo "awakens" this ability at an extremely low cost. Without modifying the structure or using big data, SOTA video customization can be achieved with only 20 - 50 examples.
- The experiments cover multiple scenarios such as robotics, driving, and film and television. In the user study, it led by a large margin with 81.2% of the votes.
This paper is not just a technological breakthrough; it is more like opening the "hidden skill tree" of video generation models.
Reference Materials
https://arxiv.org/abs/2511.15700