Exclusive interview transcript with DeepMind: Decrypting the Genie 3 world model that will revolutionize the future of the gaming and robotics industries.
On August 5th local time, Google DeepMind's latest AI technology, "Genie 3", was hailed as a revolutionary breakthrough that is expected to completely transform the future of virtual world generation, robot training, and the entertainment industry. This technology can generate an interactive and realistic 3D virtual world in about 3 seconds through simple text prompts, reaching a resolution of 720p, and possessing features such as real-time interaction and environmental consistency.
Popular YouTube influencer Tim Scarfe introduced in detail the innovative features, potential applications, and future prospects of Genie 3 through an exclusive interview with the DeepMind research team. The following is a summary of the full interview content:
Host: Hello, everyone. Today we bring you a globally exclusive report. I think this is the most amazing technology I've ever seen, and it's truly exciting! Last week, I witnessed a demonstration of this technology at Google DeepMind's office in London. This technology could become the next trillion-dollar industry or a killer application for virtual reality. Google DeepMind has been performing extremely well recently, and even Gemini Deepthink can't keep track of its number of successes.
Today, we'll discuss a new type of AI model - generative interactive environments. They are different from traditional game engines, simulators, or generative video models, but they integrate the characteristics of all three. Essentially, they are an interactive world model and a video generator, and you can connect a game controller or any other controller. DeepMind defines a "world model" as a system that can simulate environmental dynamics, and its consistency emerges naturally without any explicit programming.
This sounds incredible: how can a randomly sampled neural network generate a consistent, real-world-like map? Remember the Quake engine in 1996? It required explicit programming of physical rules and interaction logic. In contrast, this generation of AI systems learns the dynamics of the real world directly from video data.
You can control agents in the world in real-time. The emergence of generative world models stems from the limitations of handwritten simulators. Even DeepMind's most advanced XLAND platform, designed for general agent training, still appears cartoonish, limited to specific domain rules, and rigid. Imagine if you could generate any interactive world to train agents through simple text prompts?
01. Evolution from Genie 1 to Genie 2
Host: Last year, I interviewed Ashley Edwards, a member of the DeepMind team, at the International Conference on Machine Learning (ICML). At that time, he introduced Genie 1, which was trained on 30,000 hours of 2D platform game recordings. When generating the next frame, objects in the distance move slower than those nearby, simulating a sense of depth. This ability was unexpected; the model was able to understand the physical world so quickly.
The core innovations of Genie 1 include a spatio-temporal video marker that converts raw videos into processable markers, a latent action model that can discover meaningful control actions without labeled data, and an autoregressive dynamic model that predicts future states. The latent action model is a form of unsupervised action learning. Genie 1 discovered eight discrete actions that remained consistent across different environments, achieved simply by analyzing frame-to-frame changes.
This shocked me! How could training from offline game clips achieve this? Even more surprisingly, it also had an emergent ability similar to 2.5D parallax.
Just 10 months later, Genie 2 was released. It had 3D capabilities, near real-time performance, and significantly improved visual fidelity, simulating realistic lighting effects such as smoke, fire, water flow, and gravity, covering almost all elements in real games. It even had a reliable memory function: if you looked away and then looked back, the objects would still be in the same place. This is Jack Parker Holder, a research scientist on Google DeepMind's Openness team.
Holder: This is a photo taken by our team somewhere in California. We input this photo into Genie to generate an interactive game world. All subsequent pixels are generated by the generative AI model. Someone is actually operating it, pressing the W key to move forward, and from that moment on, each frame is generated by the AI.
Host: Last year, the DeepMind Israel team led by Shlomi Fruchter demonstrated a Doom engine simulation based on a diffusion model, called the "game engine". Doom has almost become a meme and can run on calculators and toasters. But now, a neural network can generate the Doom game frame by frame in real-time, display health points, allow shooting characters, opening doors, and navigating the map. Although there are occasional glitches, it's truly incredible! It can run at 25 frames per second on a single TPU, and the only limitation is that it can only simulate Doom.
02. Genie 3: Generating Realistic Interactive Worlds from Text
Host: Last week, we went to London, and the DeepMind team showed us a demonstration of Genie 3. I couldn't believe my eyes! The resolution reached 720p, which was immersive enough. It's real-time, can simulate a realistic real-world experience, and maintain context for several minutes without losing it. One of the team members was involved in the development of VO3, and they seem to have combined the Genie architecture with VO to create a "super-enhanced VO".
Unlike Genie 1 and 2, Genie 3 takes text prompts as input instead of images, which increases flexibility but can't generate directly based on real photos. One of its main features is the diversity of the environment, long-term prediction ability, and promptable world events. For example, in a ski slope scenario, you can input "a skier wearing a Genie 3 T-shirt appears" or "a deer runs down the hill", and these events will occur.
They said that this is very useful for simulating rare events in autonomous driving. But I'm wondering if this is the "infinite turtles" problem? How do you write a program to prompt an infinite number of rare events? They showed an example of a drone flying over a lake, which was amazing, but I noticed the lack of birds. Can you add birds through a prompt?
The DeepMind team believes that the "37th step" moment for embodied agents, when an agent discovers a brand-new real-world strategy, hasn't arrived yet. They see Genie 3 as the key to achieving this goal. However, the real world is full of creativity, and events keep branching out. In the future, there might be an outer loop to make the system more open, but currently, Genie 3 generates strictly according to the prompts and doesn't have its own creativity.
Genie 3 currently only supports a single-agent experience, but a multi-agent system is under development. What I'm most looking forward to is a brand-new interactive entertainment model, like "YouTube 2.0". DeepMind believes that robot simulation training is the real breakthrough. The wonder of human cognition lies in that we avoid expensive physical experiments by simulating the world, which is similar to the concept of Genie.
Why train in the real world? Just simulate any scenario, just like in the plot of "Black Mirror". Genie 2's generation lasts about 20 seconds, while Genie 3 can last for several minutes, and the errors are becoming increasingly difficult to detect. Genie 2 isn't real-time, you have to wait for a few seconds, the image resolution is low, and the memory is limited. Genie 3 has completely changed all of this.
Holder: Genie 3 can maintain a coherent interactive environment for several minutes.
Host: They're being cautious about the architectural details, probably because this could be a trillion-dollar business opportunity, and Meta CEO Mark Zuckerberg might be eyeing it. I'm worried he'll come with a checkbook and say, "Come on, $100 million, join me!" Zucker, please, don't do that! They're doing great work, so give them some space. I joked that if you're learning the Unreal engine, you might have to consider a career change. But the Google team is very practical and thinks this is a different technology, each with its own advantages and disadvantages. It's still a neural network with many limitations, but it can easily generate interactive dynamic graphics, similar to the trend of Unreal Engine 5.6. Do I need to fire my dynamic graphics designer? Victoria, can users use Genie 3?
Victoria (a member of the DeepMind team): Not yet. It's still a research prototype and will be gradually opened through a testing program for safety reasons. At the press conference, someone asked if it could generate an ancient battle scene, and Fruchter said the model hasn't been trained on relevant data and can't do it for now.
Host: DeepMind says that model improvements will reduce errors and increase accuracy. The training data might include all YouTube videos and more, and they're being cautious about it. The computation depends on a TPU network, and it's estimated to require a large amount of computing power, but the demonstration was smooth, and you can enter the world about 3 seconds after inputting a prompt. DeepMind also mentioned that Genie can train agents, and in turn, agents can improve Genie 3, creating a positive feedback loop. For example, when simulating crossing the road, a person will observe the driver's signals before deciding to act, and agents also need similar simulations.
Can you introduce Genie 3?
Fruchter: I'm the research director at Google DeepMind, involved in the VO project. I've been working at Google for the past 11 years and have recently focused on multimodal diffusion models. Genie 3 is our most advanced world model, capable of predicting the evolution of the environment and the impact of agent actions. It achieves high resolution, long-term prediction, and better consistency, all in real-time, allowing agents or users to freely navigate and interact.
Holder: I'm a research scientist on the DeepMind Openness team. I initially studied open-ended learning and have recently focused on world models. In London, we showed you Genie 3. I think this is the most amazing technology I've ever seen, and it might be a paradigm shift moment.
03. Core Concept: What is a "World Model"?
Host: Genie 3 is incredible! But let's first look back at Genie 2?
Holder: Genie 2 is the result of two years of our research, known as the foundational world model. Previous world models only simulated a single environment. Genie 1 was the first to create a new world through prompts, but the resolution was low, and the interaction only lasted for a few seconds, requiring image prompts. Genie 2 was trained on a wider range of 3D environments, and the resolution increased from 90p to 360p, approaching modern standards but still not fully mature. We wanted to verify the scalability of this approach, and Genie 3 takes it to a new level: 720p, real-time interaction, and it's amazing.
Host: As the late Apple co-founder Steve Jobs said, the touchscreen has a certain magic, and interactivity brings that magic. Your demonstration was so impressive! The realistic visual effects, like the integration of VO, can understand the real world and build an interactive foundational model. Can you share some examples?
Fruchter: Video models are, to some extent, world models, but they can't be interacted with. Genie 3 solves this limitation by generating the experience frame by frame, allowing users or agents to control the direction and explore non-predefined trajectories. For example, an agent can return to a previously visited location, and the environment remains consistent, which is a very impressive ability.
04. Challenges in Generating World Consistency
Host: Genie 2 already had some object persistence and consistency, but Genie 3 takes it a step further. Genie 2 used a spatio-temporal transformer, similar to ViT, and a latent action model to infer the action space from non-interactive data and then input it into the dynamic model. What can you reveal about Genie 3's architecture?
Fruchter: Due to interactivity, the model isn't autoregressive. It needs to generate frame by frame and refer to all previous frames. For example, when revisiting a place in an auditorium, the model needs to ensure consistency. This consistency emerges naturally without an explicit 3D representation, which is different from neural radiance fields or Gaussian splatting. This emergent ability is surprising.
Host: Genie 2 could already simulate parallax and lighting, but the Doom simulation you were involved in shocked me even more. Doom in 1993 was a masterpiece by John Carmack. Now, a neural network can generate a game without an explicit world model, just through the pixel space. This is truly incredible!
Fruchter: I played Doom when I was a kid and was also involved in game engine development. This project made me feel fulfilled. We use GPUs or TPUs to generate a consistent 3D environment, running on the same hardware as traditional game engines. We tried to use a diffusion model to simulate a game environment in real-time, generating all pixels and only accepting user input. At first, we weren't sure if it was possible, but we were very excited when we succeeded. Real-time interaction stimulates people's imagination and makes them feel like they can really step into the generated world. We hope to take it to a higher-quality and more general simulation.
Host: The key question is, Genie 2's dynamic model uses a masked GIT and runs iteratively. How do you explain the phenomenon of a random neural network generating a consistent world? When I look away and then look back, the objects are still in the same place. Isn't this strange for a sub-symbolic random model?
Holder: Similar to language models, world models need to maintain certain basic consistencies. Language models maintain consistency in factual content, and new content has randomness. In the world generated by Genie, new objects might have randomness, but once they're generated, they remain consistent, which is an emergent property of large-scale training.
05. How to Measure the Quality of a World Model?
Host: In 2018, David Ha defined a world model as a system that simulates system dynamics. How do you measure its quality?
Fruchter: Measuring the quality of a world model is difficult, especially for visual generation, as quality is subjective. Language models can be evaluated using perplexity or performance on downstream tasks, but world models mainly focus on visual interaction. The quality depends on the use case. Our goal is to enable AI agents to interact in a simulated environment. Simulation is crucial for AI because real-world experiments are time-consuming and costly, such as in drug development or robot assembly. Genie 3 pushes the boundaries of simulation.
Host: I recently interviewed a startup that envisioned a market for robot strategies. Due to the scarcity of real-world data, people could share strategies. And you proposed generating robot strategies for specific scenarios through a world foundational model. Is that correct?
Holder: Yes, robots are usually deployed in restricted environments, like a carefully arranged apartment, lacking the randomness of the real world. Existing simulators can simulate physics but can't simulate the weather, other agents, or animals. Genie 3's world knowledge goes beyond physics and includes the behavior of other agents, like a herd of deer running down a hill. This is crucial for large-scale robot deployment, as it can safely simulate real-world scenarios and avoid risks in the real world.
Host: To train robot strategies, curriculum learning and diversity might be needed, gradually increasing complexity, like adding a person wearing a Gemini T-shirt or a car. This requires a meta-process to control the complexity gradient, similar to the POET paper by Ken Stanley, a computer science professor at the University of Florida. Is this intuition reasonable?
Fruchter: It's still difficult to determine the specific applications of Genie 3 in AI research. We've seen that other generative models have the ability to make unexpected discoveries, like VO can read text on a photo and follow spatial instructions. We hope to explore the potential of Genie 3 through community feedback.
06. Openness: Human Skills and Prompt Creativity
Host: I really like open-ended research. Currently, general prompts yield simple results, while computer graphics experts use highly specific prompts to generate novel content. The real world always produces novel events, and Genie 3 currently generates specific scenarios and lacks random events, like an airplane flying by. Is that correct?
Holder: Yes, Genie 3 highly depends on text prompts, but this isn't a limitation; it's an advantage. Humans can create cool worlds through high-quality prompts, amplifying their creativity. Research like POET is limited to simple environments, while Genie 3 uses language to guide generation and combines human knowledge to define "interesting" content.
Fruchter: Human definitions of interesting things drive innovation. For example, generating an ASMR video of cutting glass fruits, this novelty comes from the prompt. The richness of Genie 3 emerges from short prompts, and in the future, more complex experiences might be generated through a multi-step creative process involving collaboration between humans and AI.
07. Future: The Next Generation of YouTube or VR?
Host: On social media, prompt sharing has promoted creative exploration. This could be the next form of YouTube or virtual reality, similar to the "experience machine" in philosophy, making people immerse themselves and not want