Fei-Fei Li's new "world model" is out, capable of generating a 3D eternal world in real-time with a single H100.
Just now, Fei-Fei Li's World Labs proudly released a brand - new real - time generative world model called RTFM (Real - Time Frame Model)!
It is a highly efficient autoregressive diffusion Transformer model that undergoes end - to - end training on large - scale video data.
With just one H100 GPU, RTFM can render a persistent and 3D - consistent world in real - time during your interaction, whether it's a real - world scenario or an imaginary space.
Its uniqueness lies in that it doesn't construct an explicit 3D representation of the world. Instead, it takes one or more 2D images as input and directly generates new 2D images of the same scene from different viewpoints.
Put simply, you can regard it as an "AI that has learned to render".
Just by observing the videos in the training set, RTFM has learned to model complex physical phenomena such as 3D geometry, reflections, and shadows. Moreover, it can reconstruct specific locations in the real world using a small number of sparsely captured photos.
RTFM is designed around three core principles:
Efficiency: With just a single H100 GPU, RTFM can perform real - time inference at an interactive frame rate.
Scalability: RTFM is designed to scale with the increase in data and computing power. It doesn't rely on explicit 3D representations when modeling the 3D world and adopts a general end - to - end architecture to learn from large - scale video data.
Persistence: You can interact with RTFM endlessly, and the world it creates will never disappear. It simulates a persistent 3D world that won't vanish when you look away.
RTFM can render 3D scenes generated from a single image. The same model can handle diverse scene types, visual styles, and effects, including reflections, smooth surfaces, shadows, and lens flares.
Some netizens joked, "Our world might be running on a single H100."
A former senior Google engineer said that the latest achievements of RTFM have truly solved the problem that has long plagued the scalability of world models.
Now, RTFM is officially open, and anyone can try it out.
Portal: https://rtfm.worldlabs.ai/
World Models: Gluttons for Computing Power
We envision a future where powerful world models can reconstruct, generate, and simulate a persistent, interactive, and physically - consistent world in real - time. Such models will revolutionize numerous industries, from media to robotics.
In the past year, with the application of advancements in generative video modeling to generative world modeling, the development of this emerging technology has been truly exciting.
As technology advances, it has become increasingly clear that the computing power requirements of generative world models will be extremely huge, far exceeding those of today's large - language models.
If we simply apply existing video architectures, to generate an interactive 4K video stream at a 60fps frame rate, we need to generate over 100,000 tokens per second (roughly equivalent to the length of "Frankenstein" or the first "Harry Potter" book).
To maintain the persistence of this content during an interaction lasting an hour or longer, we need to handle a context window of over 100 million tokens.
With today's computing infrastructure, this is neither feasible nor cost - effective.
The team firmly believes in "The Bitter Lesson": in the field of AI, simple methods that can scale smoothly with the growth of computing power often dominate because they can benefit from the exponentially decreasing computing costs that have driven all technological advancements over the decades.
Generative world models can gain significant advantages from the continuous decline in future computing power costs.
This naturally raises a question: Will generative world models be limited by today's hardware bottlenecks? Or, is there a way for us to glimpse the future of this technology today?
Efficiency: Bringing the Future to the Present
In response, Fei - Fei Li's team set a simple goal: to design a generative world model that is efficient enough to be deployed currently and can continue to scale with the growth of computing power.
The more ambitious goal is to build a model that can be deployed on a single H100 GPU, maintaining an interactive frame rate while ensuring the world persists no matter how long the interaction lasts.
Achieving these goals will allow us to present our future vision in the present and glimpse the huge potential of such models in the future through today's experience.
This goal has also influenced the entire system design, from task setting to model architecture.
To this end, the team has meticulously optimized every aspect of the inference stack, applying the latest advancements in architecture design, model distillation, and inference optimization to preview the future model with the highest fidelity on today's hardware.
Scalability: Regarding World Models as "Learning Renderers"
Traditional 3D graphics pipelines use explicit 3D representations (such as triangular meshes and Gaussian splatting) to model the world and then generate 2D images through rendering. They rely on manually - designed algorithms and data structures to simulate 3D geometry, materials, lighting, shadows, reflections, and other effects.
Although these methods have been reliable pillars in the field of computer graphics for decades, they are difficult to scale easily with the growth of data and computing power.
In contrast, RTFM takes a different approach.
Based on the latest advancements in generative video modeling, it trains a single neural network. This network only needs to take one or more 2D images of a scene as input and can generate 2D images of the scene from new viewpoints without constructing any explicit 3D world representation.
RTFM is implemented as an autoregressive diffusion Transformer running on a sequence of frames. Through end - to - end training on large - scale video data, it learns to predict the next frame given the previous frames.
RTFM can be regarded as a "learning renderer" -
The input frames are converted into the activation values of the neural network (i.e., KV cache), thus implicitly representing the entire world;
When generating a new frame, the network reads information from this representation through the attention mechanism to create a new view of the world consistent with the input view.
The entire mechanism, from the conversion of the input view to the world representation and then to the rendering of a new frame from the representation, is learned end - to - end through data rather than being manually designed.
RTFM has learned to simulate complex effects such as reflections and shadows just by observing during training.
By combining RTFM with Marble, a 3D world can be created from a single image. RTFM can render complex effects such as lighting and reflections, all of which are learned end - to - end from data.
RTFM blurs the line between reconstruction (interpolation between existing views) and generation (creating new content not seen in the input views), which have always been regarded as independent problems in the field of computer vision.
When provided with a large number of input views, due to stronger task constraints, RTFM tends to perform reconstruction; when there are fewer input views, it has to extrapolate and imagine.
<