Google Heads Left, Li Fei Heads Right: Alibaba's World Model "Happy Oyster" Carves Out Third Path

Alibaba's approach is quite different.

Recently, an unknown "Happy Horse" has reached the top of the Artificial Analysis rankings.

The AI circle was immediately filled with speculations until Alibaba stepped forward to claim it.

Unexpectedly, in just a few days, Alibaba's "Happy" family welcomed a new member —— HappyOyster.

Both of them "come from the same school" and are from the Alibaba Token Hub (ATH) Innovation Business Group established by Alibaba in March this year.

However, different from the one-time process of "writing prompts, waiting for rendering, and receiving finished works" of Happy Horse, HappyOyster is an open-world model product that can be built and interacted with in real-time.

It is based on a native multi-modal architecture, and behind it is a streaming generative world model that supports multi-modal input and joint audio-video generation. During the generation process, it can continuously receive user instructions, and the picture responds and evolves in real-time.

HappyOyster features two core functions: Wander and Direct.

Among them, the Wander function is the first general world model that supports any style and infinite interaction.

Just input text or images, and you can generate a world scene for boundless exploration, supporting real-time displacement control and camera control for more than 1 minute.

The Direct function is a real-time AI video directing engine based on the world model, which can continuously generate 720p real-time videos up to 3 minutes long. We can control the camera, schedule characters, and change the plot direction in real-time through text instructions.

Speaking of this name, there is some particularity. It borrows the classic quote from Shakespeare, "The world is your oyster."

Currently, HappyOyster has been launched, and we got the invitation code as soon as possible and then conducted a hands-on test.

Experience link: https://www.happyoyster.cn/

First-hand test:

Alibaba's world model is quite interesting

Let's first try the Wander function.

This function supports generating the world from text or pictures.

We can either directly input prompts or separately set "Character" and "Scene" for fine-grained control. We can also switch the perspective between the first-person and the third-person.

For example, we use the "Custom Mode" to input separately: set the character as "A stylish blonde female model" and the scene as "On the streets of Paris in the 1980s."

HappyOyster didn't directly output a fixed video. Instead, in just over a dozen seconds, it built a complete Parisian street at night after the rain. The puddles on the road reflected the dim yellow street lights, cars sped by on the road, and there were shops on both sides. All the details conformed to the laws of physics.

Next, we can use the WASD keys to control the character's moving direction or use the up, down, left, and right arrow keys to move the camera. The character moves freely in this space, and finally, a video is generated.

The entire picture responds in real-time, and the process is smooth without any lag.

The system also automatically matches the BGM that fits the scene atmosphere, and the audio and video are naturally synchronized.

We also uploaded an anime-style first-person cycling picture. Based on this static picture, HappyOyster generated a complete scene with spatial structure and movement logic.

When the perspective moves forward, the extension of the road, the distribution of the flower fields, and the hierarchical changes of the distant scenery are coherent, without obvious splicing or jumps.

The visual language of the Ghibli style and the atmosphere of cherry blossoms falling remain consistent throughout the movement.

The Wander function can adapt to various styles. We even walked directly into Van Gogh's paintings.

Let's try the Direct function. Its biggest highlight is that it can change the content in real-time at any node of the video.

We gave it a Ghibli-style picture, and HappyOyster immediately created a Miyazaki-style anime world: a little girl holding a red umbrella walking on a bumpy country road after the rain.

At this time, we input the prompt "A cute Ghibli-style kitten suddenly runs to the girl's side." The model didn't re-render but directly generated a kitten running in the current picture and walking side by side with the little girl.

We continued to add the instruction: "The girl squats down to stroke the kitten." The picture responded immediately again. The little girl squatted down and reached out her hand, and the movements were natural and smooth.

In short, the model can accurately adjust the scene and the character's actions according to the prompts we input. The picture is smooth and natural, and every change is seamlessly connected with the plot.

Technical interpretation:

What's the difference between the world model and text-to-video models?

After the test, we may have an intuitive feeling that this thing seems different from text-to-video models like Sora and Keling. Indeed, it's different, and it takes a different path from the underlying logic.

Whether it's Sora or Keling, text-to-video models are essentially one-time systems. After given text or image conditions, the model organizes content, movement, and rhythm within a pre-defined time window and then delivers the result. The user gives one input and gets one output, and the process ends. This process is closed and one-time, with no room for intervention in the middle.

This model is sufficient for generating a beautiful short film. However, if you want to intervene in the middle of the picture and change anything that has already happened, it's powerless.

The idea of the world model is completely different. It learns how the world will evolve next, what the current state is, what will happen after a certain action, and what will happen next. It has no preset end. When we have no new input, the model continues the development of the world based on the existing state. If we inject new instructions in the middle, the model will re-infer the subsequent direction based on the current state. It can be interrupted, intervened, and rewritten at any time.

For this reason, the training difficulty of the world model is much higher than that of text-to-video models.

The most direct challenge is speed. The world model needs to respond instantly when the user gives an instruction. Any obvious delay will break the sense of immersion. To address this, HappyOyster adopts a streaming generation framework, compressing high-dimensional video and multi-modal information into a compact dynamic latent state, which significantly reduces the computational overhead of single-step generation and enables low-latency continuous generation. Control signals such as text, images, and roaming instructions are designed as condition variables that can be injected online, allowing the model to respond to external interactions instantly at any node without resetting the generation process.

An even more challenging problem is how to maintain the consistency of the world during long-term evolution. The longer the generation time, the more likely the scene is to experience content drift and structural degradation. The laws of physics and spatial structure gradually lose their constraints, and the world gradually becomes unrecognizable. To combat this "amnesia," HappyOyster introduces a continuous state reuse mechanism. Through the continuous transfer of historical attention states, the model can efficiently inherit the generated information and gradually update it, maintaining a stable scene structure and dynamic coherence over a longer time span.

In terms of audio-video coordination, different from separately modeling audio as a post-production addition to the video, HappyOyster adopts a unified audio-video generation framework, synchronously generating visual and auditory signals in the same world state. Audio participates in the joint generation as part of the world dynamics, naturally establishing a cross-modal time alignment relationship.

Currently, there are several representative directions in the field of world models. Google's Genie focuses on real-time interactive world modeling but has limitations in the unified expression of multi-modal input and joint audio-video generation. The World Labs team led by Fei-Fei Li takes the route of 3D spatial structured reconstruction, emphasizing geometric consistency rather than long-term dynamic generation in the pixel space.

HappyOyster chooses to conduct long-term, real-time interactive dynamic world simulation in the pixel space and adds the ability of joint audio-video generation on this basis. This is a path that few have successfully walked before, with few ready-made answers to refer to.

Conclusion

As AIGC has developed to the present, content generation tools have become quite mature. There are good solutions for writing articles, generating images, and making videos. However, this track is quietly approaching a new turning point, from "generating content" to "constructing the world."

The emergence of HappyOyster allows us to see the outline of this direction. It provides everyone with a "customized digital world" that they can enter at any time, modify at any time, and get real-time feedback. We can wander in it, direct in it, and share it with others, allowing them to continue the story in the world we have constructed.

In terms of application scenarios, its boundaries are far beyond the entertainment experience on the screen. Cultural and tourism exhibitions, interactive short dramas, film and television concept verification, brand marketing, live co-creation... It is naturally suitable for any scenario that requires real-time perception, real-time generation, and real-time feedback closed-loop.

In the long run, once combined with hardware such as cameras, sensors, and spatial devices, HappyOyster will carry a generative environment system that can be continuously driven by real-world signals.

But frankly speaking, the world model as a whole is still in its early stage. Physical consistency over long time sequences, causal reasoning in complex scenarios, and in-depth understanding of the laws of the real world are all hard-core challenges that remain unsolved. HappyOyster is one of the explorations closest to the form of a "usable product" in this direction, but exploration means that the boundaries are not yet determined.

This is both a limitation and the reason for imagination.

This article is from the WeChat official account “MachineHeart” (ID: almosthuman2014), author: Yang Wen, published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Google goes left, Li Fei goes right, and Alibaba's world model "Happy Oyster" carves out a third path.

First-hand test:

Alibaba's world model is quite interesting

Technical interpretation:

What's the difference between the world model and text-to-video models?

Conclusion