Fei-Fei Li released a new world model that can run on a single GPU.
The latest achievements of Fei-Fei Li's world model startup have arrived!
Just now, the professor herself announced the official launch of a brand - new model RTFM (A Real - Time Frame Model). It not only features real - time operation, persistence, and 3D consistency, but more importantly -
It can run on a single H100 GPU.
In addition, the design of RTFM follows three core principles:
Efficiency: With just a single H100 GPU, RTFM can complete inference operations in real - time at an interactive frame rate.
Scalability: This architecture has the ability to continuously scale with the growth of data volume and computing power. It autonomously learns from massive video data through an end - to - end general architecture and can build a three - dimensional world model without relying on explicit 3D representations.
Persistence: Users can interact with RTFM for an unlimited duration, and all scenarios will be permanently retained. The persistent 3D world built by this system will not disappear due to perspective changes.
Let's take a detailed look below.
World models require a large amount of computing resources
Powerful world models can reconstruct, generate, and simulate a persistent, interactive, and physically accurate world in real - time. Such models will completely transform various industries from media to robotics.
In the past year, the progress of generative video modeling has been successfully applied in the field of generative world modeling.
As technology develops, one fact becomes increasingly clear: The computing power requirements of generative world models will far exceed those of today's large language models.
If the existing video architecture is directly applied, generating a 60 - frame 4K interactive video stream requires generating more than 100,000 tokens per second (equivalent to the length of Frankenstein or the first Harry Potter book).
To maintain continuous interaction for more than an hour, the number of context tokens to be processed will exceed 100 million. Based on the current computing infrastructure, this is neither feasible nor economical.
The team led by Fei - Fei Li firmly believes in the law revealed by the "painful lessons":
Those simple methods that can scale gracefully with the growth of computing power will eventually dominate the AI field because they can benefit from the exponential decline in computing power costs that has driven technological development for decades. Generative world models are in an excellent position and will surely benefit from the continuously decreasing computing power costs.
This leads to a key question: Will generative world models be restricted by the current hardware conditions? Can we preview the prototype of this technology now?
Therefore, the team led by Fei - Fei Li set a clear goal: to design a generative world model that is efficient enough, can be immediately deployed, and can continuously scale with the improvement of computing power.
Their aim is to create a model that can be driven by a single H100 GPU, ensuring that the virtual world never disappears while maintaining the interactive frame rate. Achieving these technical indicators will allow them to glimpse the future in advance - experiencing the potential height of tomorrow's models on today's hardware.
This goal has a profound impact on their entire system design, from task setting to model architecture. By carefully optimizing each link of the inference stack and integrating the latest breakthroughs in architecture design, model distillation, and inference optimization, they are committed to presenting the highest - fidelity preview of future models on today's hardware.
World models as learned renderers
Traditional 3D graphics pipelines use explicit 3D representations (such as triangular meshes and Gaussian splats) to build world models and then generate 2D images through rendering. These pipelines rely on manually designed data structures and algorithms to simulate effects such as 3D geometry, materials, lighting, shadows, and reflections.
For decades, such methods have been the mainstay in the field of computer graphics, but it is difficult for them to scale linearly with the growth of data volume and computing power.
RTFM takes a different approach. Based on the latest breakthroughs in generative video modeling, the research team trains a single neural network. By inputting one or more 2D images of a scene, it can generate 2D images of the scene from a new perspective without constructing any explicit 3D representations.
RTFM also adopts an autoregressive diffusion transformer architecture acting on frame sequences and conducts end - to - end training with massive video data to achieve the prediction of subsequent frames based on historical frames.
RTFM can be regarded as a learned renderer. It first converts the input image frames into activations (i.e., KV cache) in the neural network. These activations implicitly represent the entire world. During the process of generating new frames, the network reads information from this representation through the attention mechanism to generate a new view of the world consistent with the input perspective.
The mechanism of converting the input view into the world representation and then rendering new frames from this representation is not manually designed but automatically learned through end - to - end data training.
RTFM can learn to model complex effects such as reflections and shadows just by observing these phenomena during the training process.
It can be said that RTFM blurs the boundary between "reconstruction" (interpolation between existing perspectives) and "generation" (creating new content invisible in the input perspectives), which have historically been regarded as two independent problems in computer vision.
When RTFM is provided with a large number of input perspectives, due to stronger task constraints, it tends to perform reconstruction; when there are fewer input perspectives, it is forced to perform extrapolation and generation beyond the existing perspectives.
Using pose frames as spatial memory
A key feature of the real world is persistence: When you look away, the world does not disappear or completely change. No matter how long you are away, you can always return to the places you've been before.
This has always been a challenge for autoregressive frame models. The world is only implicitly represented through two - dimensional image frames. Therefore, achieving persistence requires the model to reason about an ever - growing set of frames as the user explores the world. This means that the cost of generating each frame is higher than the previous one, so the model's memory of the world is actually limited by its computing resource budget.
RTFM circumvents this problem by modeling each frame as having a pose (position and orientation) in three - dimensional space. They generate new frames by providing the model with the pose of the frame to be generated.
The model's memory of the world (contained in its frames) has a spatial structure. It uses frames with poses as spatial memory. This provides the model with a weak prior - that the world it models is a three - dimensional Euclidean space - without forcing the model to explicitly predict the three - dimensional geometry of objects in the world.
The spatial memory of RTFM makes persistence unrestricted. When generating new frames, they retrieve nearby frames from the spatial memory of the posed frames to build a customized context for the model.
The team calls this technology context juggling: The model uses different context frames when generating content in different spatial regions. This enables RTFM to maintain a persistent memory of a large - scale world during long - term interaction without having to reason about an ever - growing set of frames.
Finally, the model is now available for preview. You can try it out right now...
After trying it out, welcome back to leave a feedback comment. Love you~
Reference links:
[1]https://x.com/drfeifei/status/1978840835341914164
[2]https://x.com/theworldlabs/status/1978839175320186988
[3]https://www.worldlabs.ai/blog/rtfm
This article is from the WeChat official account "QbitAI". Author: Shiling. Republished by 36Kr with permission.