HomeArticle

Can AI create a world? Google DeepMind's Genie can generate "Death Stranding" in seconds.

极客公园2025-08-06 19:25
It is no longer "one flower, one world", but "one conversation, one world".

If in the past few years, the breakthroughs in generative AI have taught us to converse with algorithms, enabling us to have it write articles, draw illustrations, and even edit videos, then Genie 3, introduced by DeepMind today, has taken generative AI into another dimension.

On August 5th, DeepMind announced Genie 3, a new model called the "Universal World Model," on its official website.

Open Genie 3 and input a prompt like "Stroll through a medieval village during a storm." In just a few seconds, Genie 3 can generate a 3D scene that you can explore and interact with in real - time. In the wet village, the lightning's glow is reflected on the stone slabs. You can control the perspective and freely wander around the village. When you approach a small cottage and push open the door, you can see the changing light and shadow of the flickering fire in the wind.

What's even more amazing is that when you leave the cottage and return, the fire is still there, and the graffiti on the wall remains unchanged. At this time, if you input "After the rain, the sky clears, and a knight rides a horse towards the cottage" in the command box, in a few seconds, you can push the door open again to greet the knight.

At this moment, you are like the creator of a small world. This is the generative ability of the "Universal World Model" presented by Genie 3. And the powerful capabilities of Genie 3 have given Google an edge in the fierce AI competition.

01

Create the World at Your Fingertips

The predecessor of Genie 3 was Genie 2, which was released at the end of 2024. Although that model could generate simple 3D environments, the scenes could only last for 10 to 20 seconds, with rough details that couldn't stand scrutiny. If you turned your perspective slightly, trees might float, characters might disappear out of thin air, and the positions of objects would change randomly.

In just seven months, Genie 3 has achieved an astonishing leap.

It has jumped from a 360p resolution to a 720p resolution, with a frame - rate output of 24 frames per second. Genie 3 can also maintain a continuous simulation for several minutes, rather than just a ten - second or twenty - second animation clip.

More importantly, Genie 3 doesn't rely on hard - coded physics like a game engine. Instead, it maintains the logical and physical consistency of the scene through model prediction. Simply put, the leaves in the scene will sway naturally instead of flying around randomly. The shadows of characters will move with their positions, and objects will give feedback that conforms to physical laws after a collision.

In the past, neither text - to - video models like Sora nor the early Genie series could easily solve the problem of "world consistency."

Genie 3, however, introduces a new visual memory mechanism that allows each frame to refer to the state of the previous frame and continuously maintain the layout of the entire environment. This means that the paths you've walked won't disappear out of thin air when you turn back. Trees, rocks, and buildings will remain stably in place, as if they truly exist in a continuous space.

In simple terms, the model has learned to "remember" what it has just drawn. As a result, you no longer see those abrupt jumps but a continuous world that can last for several minutes.

Genie 3 can remember the generated objects | Image source: Genie 3

DeepMind stated in its blog that this type of world model is the cornerstone of general intelligence. Because true intelligence not only requires understanding the world but also making decisions and taking actions in it, and all of this can only happen in a stable and logically consistent environment.

This is why DeepMind calls it a "world model" rather than simply a "video generator."

The generated scene conforms to physical laws | Image source: Genie 3

Traditional video - generating models, such as Sora, can convert a text description into a 30 - second video, but in essence, it is still a "closed segment." You can't change the world within the segment, let alone interact with it.

Genie 3, on the other hand, has taken a big step forward in interactivity. It can not only generate a continuous world but also dynamically adjust the scene as you explore, while ensuring that the logic remains intact. This is Genie 3's Promptable World Events, which can be simply understood as "Text is the command, and the world responds in real - time."

For example, when you input "A speedboat appears on the water surface," Genie 3 won't generate a brand - new image. Instead, it will make a speedboat appear out of thin air and glide across the river, splashing realistic water on both sides and behind it.

This instant malleability means that users are not just spectators but also directors.

Infinite possibilities in the same scene | Image source: Genie 3

According to DeepMind, Genie 3 used a large dataset generated by game engines and video prediction tasks during training to endow the model with a sense of "causality" and "persistence." More simply put, Genie 3 has learned two things: The world is continuous, and actions have consequences.

Another detail is that Genie 3 supports free movement of the perspective and can dynamically redraw the content from different perspectives. This may sound easy, but it is extremely difficult for a generative model, requiring the model to have strong 3D reasoning ability. This is why DeepMind emphasized in its blog that the goal of Genie 3 is not just video but "world - based interactive generation."

So, Genie 3 doesn't just "generate images" or "generate videos." Instead, it creates an explorable and editable virtual reality, which gives rise to infinite application scenarios.

02

Disrupt the Creative Industry

Just by looking at the official demo, one can imagine many scenarios where Genie 3 can be applied, especially in the creative industry.

From the initial text - based interface, to 2D, and now to 3D and VR, video games have always been at the forefront of human exploration of virtual spaces. In the demonstration of Genie 3, this trend has been taken to a whole new level: With just one sentence, you can instantly generate an explorable and interactive 3D scene. What does this mean for the game development industry?

In the traditional development process, building 3D scenes is one of the most expensive and time - consuming parts of game production. Especially for independent developers, this often forces them to make compromises. Many choose 2D pixel art, hand - drawn, or low - polygon styles to reduce development costs.

But Genie 3 completely breaks this limitation. What used to take weeks or even months for modeling, texturing, and lighting can now be achieved by simply writing a few sentences to build a dynamic and interactive scene.

Doesn't it look like the style of a "Bakery Simulator" game? | Image source: Genie 3

Large - scale studios may still use Unreal Engine or self - developed engines to build AAA worlds with top - notch graphics. However, for developers with limited resources, Genie 3 fills the "cost gap." It doesn't replace professional engines but significantly lowers the threshold for scene design. A small, creative but technically - limited team can piece together an entire open - world map with text, just like building with Lego.

The film and television industry is the same. Directors and art designers can preview the scene style in real - time before shooting, adjust the lighting, add characters, and even have actors walk through their positions in the virtual space, achieving an "immersive shot list."

The educational industry has even more room for imagination. The historical monuments and geographical phenomena described in textbooks can be turned into an interactive and explorable scene through Genie 3.

Art also has a new form of expression. Imagine being able to "visit" the Doors of Durin in "The Lord of the Rings" or "step into" Raphael's "The School of Athens."

Perhaps when everyone has the ability to "build virtual spaces," the metaverse that Zuckerberg has been longing for can finally be realized.

DeepMind's greater ambition lies in the training of physical agents.

03

The "Cognitive Training Ground" for AI

DeepMind stated in its blog that the significance of the world model is that it can provide a "cognitive training ground" for intelligent agents, allowing agents to learn causal relationships, spatial perception, and action planning in the virtual world, rather than making mistakes directly in the real world.

For example, if you want to train a warehouse robot, the past approach was to build an expensive physical scene or rely on traditional game engines for simulation. But both methods have limitations: The former is costly, and the latter lacks diversity.

In Genie 3, however, you have a world that can be infinitely generated, instantly modified, and logically coherent. The robot can practice obstacle avoidance, handling, and collaboration in it, and even simulate extreme situations. For example, training an autonomous vehicle to handle a situation where a pedestrian suddenly steps into the road. Such scenarios are extremely difficult to reproduce in reality, but they can be created in Genie 3 with just text.

This is also what DeepMind means when it says that Genie 3 has the potential to push AI agents to their limits. It will force agents to learn from their own experiences, similar to how humans learn in the real world.

However, Genie 3 is not omnipotent and still has obvious technical limitations.

For example, the current scene resolution is only 720p, and the frame rate is 24fps. Although this is already remarkable for AI - generated content, it still falls short of the 4K high - frame - rate standard for game graphics. Secondly, the persistence of the scenes generated by Genie 3 is still limited. Although the official says it can last for several minutes, the released demos are all controlled within 1 minute.

The text rendering in the scene is still poor. It's difficult to see clear fonts on the road signs it generates, and the physical consistency is not perfect either. In detailed tests such as simulating large groups of creatures or avalanches, it still shows signs of "AI anomalies."

A strange herd of deer | Image source: Genie 3

The openness is also unknown. DeepMind said that Genie 3 is currently only used in research and collaborative projects and has not opened an API to the public, nor does it have an online experience portal like Imagen or Gemini.

However, looking at the bigger picture, Genie 3 is not an isolated innovation but a landmark turning point in the direction of AI technology migration.

From the World Labs created by Fei - Fei Li, to the Cosmos world foundation model launched by NVIDIA, and now to Genie 3 introduced by DeepMind, a clear development path for AI spatial intelligence technology is reflected: from 2D to 3D, then to explorable spaces, and finally to scenes with physical consistency, temporal and spatial coherence, and interactive changes and causality.

ChatGPT made us realize that language can be an operating system, Sora showed us that video can be a creative interface, and Genie 3 takes it a step further by turning text into an "operable" space.

Ultimately, whether in games, film and television, education, or scientific research, building virtual worlds will become an instant form of expression:

One line of text, one description, one sentence, one world.

This article is from the WeChat official account "GeekPark" (ID: geekpark). Author: Moonshot, Editor: Jingyu. Republished by 36Kr with permission.