Fei - Fei Li's Viral 3D World Model: China's First Free Version Arrives, Making You a "Sovereign" Creator

The world is so simple.

Do you still remember Fei-Fei Li's "3D world generation model" that went viral in the AI circle some time ago? Now, the domestic version has finally arrived.

Just last week, when the news of Ya Shunyu joining Tencent went viral, Tencent's Hunyuan team quietly launched World Model 1.5 (TencentHY WorldPlay), which is the first real-time world model open for experience in China.

What is a world model? Simply put: you input a few sentences or a picture, and the AI can generate a virtual world that you can "walk into and play in". It's not just a video that you can only watch, but a 3D space that you can control in real-time using a keyboard, mouse, or even a gamepad.

A game scene generated based on the first frame of the picture

What are the highlights this time :

Through the original Context Forcing distillation scheme and streaming inference optimization, the model can generate 720P high-definition videos at a speed of 24 FPS.
: Through the Reconstituted Memory mechanism, the model supports the geometric consistency generation of minute-level content, which can be used to build high-quality 3D space simulators.
: The Hunyuan world model can be widely applied to games or real-world scenarios of different styles, as well as first and third-person perspectives. It also supports functions such as real-time text-triggered events and video continuation.

Are you confused by these obscure technical terms? APPSO will now take you to have some fun and create some mind-blowing "worlds".

Online experience website: https://3d.hunyuan.tencent.com/sceneTo3D?tab=worldplay

Text → World: Experience the Pleasure of Being the "Creator"

When I first opened the page, I noticed that the interface was designed to look like a retro TV. Recall that when we were kids watching TV, we could only watch what CCTV or Hunan TV was broadcasting. No matter how we switched channels with the remote control, we couldn't escape the pre-arranged programs.

But now, you don't have to wait for the prime time at 8 p.m., or wait for the director to finish filming. You are the chief director of this world. Want to experience a roller coaster ride? Type a few words and it will be generated. Want to go back to the New Year's Eve in the millennium? Describe it and it will appear.

A roller coaster speeding at full throttle. You're holding the cold metal handrail, the whistling wind rushes into your throat, a sudden sense of weightlessness hits you, there are blurry tree shadows flashing by rapidly, and the glaring sun above your head. The first-person experience is extremely exciting, with a realistic style.

After clicking "Generate", I waited for about 5 - 8 seconds, and the picture appeared. At first glance, I really felt like I was sitting in the front row of the roller coaster. Looking closely at the hands at the bottom of the picture, the skin texture, joints, and even pores are clearly visible. The paint texture of the red seat and the scratches on the metal handrail are also very realistically rendered.

Press the ↑ key to move forward, and the picture starts to reverse, as if you're sitting on the roller coaster backwards. It's even more exciting.

However, in the later part of the video, the tree shadows on both sides didn't hold up well and were severely distorted. But considering the difficulty of real-time generation, it's understandable.

A snowmobile racing forward at full speed. You're clutching the freezing metal handlebar in your palm, sharp snow particles are hitting your cheeks, a sudden sense of weightlessness surges up, there are blurry forest shadows flashing by rapidly, and the cold, broken sunlight and snowflakes above your head. The first-person experience is...

After the picture was generated, I found myself in a "frozen moment". The surrounding snow, forest shadows, and sunlight were all static, as if the pause button had been pressed. I could freely turn the perspective and carefully observe the snowflakes flying up, the sky at that moment, and the frost marks on the metal handlebar.

At first, I thought it was a bug, but then I thought about it. It's a bit like when you just travel to a new world and time freezes at that moment, allowing you to calmly examine all the details around you.

From a technical point of view, it might be because the model has difficulty handling the "first-person perspective + high-speed movement" scenario. Although it's not the dynamic experience of "riding a snowmobile at high speed" that I expected, this exploration of the frozen moment allows people to more clearly feel the texture of the 3D space generated by AI.

A bustling New Year's Eve scene in the millennium. There are cassette tape stalls on the street, people are holding colorful balloons and gathering in the square for the countdown, CRT TVs in the roadside shops are playing the New Year's Eve party, and suddenly bursting fireworks light up the night sky. It's in a retro style.

If you were born in the 1980s or 1990s, this scene is definitely worth a try. After all, in the millennium when smartphones weren't popular yet, not many people could record the moment with a video.

When the camera turned from the cassette tape stall to the building on the right, the objects in the scene maintained a good relative position. The street lights, people, and TV sets didn't show any obvious drifting or misalignment, which proves that the model has a good understanding of 3D space.

However, when the model processes the cassette tape rack, there is a typical AI lag. From a distance, it looks colorful, but up close, it lacks sharp edges. When looking up at the building on the right, the details of the building look very "soft", more like a smeared oil painting rather than a solid with a hard physical structure.

After testing the nostalgic style, I wanted to try the seaside mansion that I've always dreamed of living in.

The main colors of the room are light blue and sandy white. The floor is paved with matte tiles imitating the texture of seashells. There is a light gray linen sofa by the window. The floor-to-ceiling glass window has no obstructions, framing the seaside view outside into a flowing picture.

This is an almost 180-degree panoramic sweep. When the model handles a large-span perspective switch, the straight lines of the window frames, columns, and ceiling don't distort, showing excellent 3D space consistency.

Although we can't afford a seaside mansion, at least we can lie back and relax in the AI-generated world for a while (laughs). If we ever achieve that goal, we can also use it to preview the decoration effect.

Bring the "A Thousand Li of Rivers and Mountains" to Life

In addition to text generation, the Hunyuan world model also supports the "single-image scene generation" function. But before uploading a picture, there are a few things to note:

Check the resolution: It should be between 1280×704 and 4k×4k. If it's a large picture of dozens of megabytes taken by a professional camera, please reduce the image quality or resize it to below 10MB.

Avoid vertical pictures: Vertical photos taken by mobile phones don't meet the requirements. It's recommended to crop them into horizontal ones.

Video continuation: Generated based on the first frame of the picture

After taking care of these things, I made a bold attempt: I uploaded a partial picture of "A Thousand Li of Rivers and Mountains".

Yes, it's that famous blue-and-green landscape painting with overlapping mountains and ranges, painted by the genius teenage artist Wang Ximeng at the age of 18 in the Northern Song Dynasty. I wanted to see if silicon-based intelligence could understand the carbon-based aesthetics from a thousand years ago.

The picture was generated. It completely exceeded my expectations:

The AI well preserved the style characteristics of the original painting. The 3D processing didn't destroy the artistic conception of traditional Chinese paintings. It felt like I really traveled back to the Northern Song Dynasty and stood in the mountains and rivers where Wang Ximeng once painted.

This shows that the world model may make art not just something to be "appreciated", but something that can be "freely explored".

Real-time Triggered Events

The most appealing thing about the world model is that you just need to say a sentence, wait for 5 seconds, and the world will change according to your wishes.

Stop saying "I'm out of ideas". Come here and experience the thrill of being a domineering CEO.

It's not a sudden "scene switch", but a smooth transition. The gradual change of the sky from bright to dark and the delicate changes in light and shadow make people feel that this world has "come to life".

The high-brightness orange fire from the explosion is realistically reflected on the water surface, and the effect is very natural. However, if you look closely, there are still some minor flaws.

For example, after such a huge explosion, the nearby water surface doesn't show any ripples. In the real physical world, the violent air expansion would change the state of the water surface.

After reading the actual test, I'm sure you're as curious as I am: How is it done technically?

The technical report of Tencent's Hunyuan team mentioned that when traditional diffusion models generate videos, they need to completely denoise the entire picture first and then output it. This leads to two problems: one is high latency, and the other is the inability to respond to user operations in real-time.

This time, the streaming DiT (Diffusion Transformer) architecture is adopted. It can, like a streaming media, receive the user's real-time gamepad control signals while instantly denoising and decoding them into pictures. This design ensures extremely low latency, making you feel no lag when controlling the perspective.

The biggest problem with the world model is that it's "forgetful". You ask it to generate a living room, and it does. But if you turn away and come back, it generates a completely new living room that has nothing to do with the previous one.

The role of the Context Forcing mechanism is to force the model to "remember" the details of the previously generated scene. Simply put, it adds a "short-term memory" to the model, allowing it to refer to the previous geometric structure, lighting relationship, and object positions when generating new pictures, thus ensuring long-term 3D consistency.

After testing the Hunyuan world model, the words of Fei-Fei Li kept echoing in my mind: "The core of human intelligence is not language, but the ability to understand and manipulate the three-dimensional space."

In the past two years, large language models (LLMs) have become extremely popular around the world. ChatGPT, Claude, and Gemini have amazed us with the language abilities of AI. But if we think calmly: Does an AI that can chat really understand the world?

It doesn't know how tall a table is, doesn't know how many turns it takes to walk from the living room to the kitchen, and doesn't know how a cup will break when it falls on the ground...

Language intelligence enables AI to "talk"; spatial intelligence enables AI to "do".

This is why Google, Meta, OpenAI, and Tencent are all betting on world models. It's not just a cooler video generation tool, but a crucial step towards artificial general intelligence (AGI).

When "Minecraft" was first released, many people thought, "What's so fun about this?"

Decades have passed, and Minecraft has become one of the most successful games in the world. It's not because of its top-notch graphics and special effects, but because it gives players the freedom to create their own worlds.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

After Fei-Fei Li's 3D world model went viral, the first free version in China is here: I became a "sovereign" creator.

Text → World: Experience the Pleasure of Being the "Creator"

Bring the "A Thousand Li of Rivers and Mountains" to Life

Real-time Triggered Events