HomeArticle

LeCun's prediction comes true: A 790-year-long video trains the most powerful open-source "world model"

新智元2025-11-03 16:42
Following the pre-training and post-training of language models, the AI field is now entering the "third scaling paradigm" for multimodality.

[Introduction] The third Scaling paradigm of AI is here! The multimodal native world model Emu3.5 is born, with 34 billion parameters and trained on 790 years of long - video data. In the 3D world, the inference speed of each image soars by 20 times.

In 2025, the "world model" has become a battleground for AI giants.

Google's Genie 3 can generate a new 720p real - time simulated world with just one sentence. Some netizens even call it the "Game Engine 2.0 era".

The World Labs team led by Fei - Fei Li also launched a real - time world generation model - RTFM, which can render a 3D world with just one H100.

In addition, there are also the "Code World Model" (CWM) created by Meta FAIR, the "General World Model" (GWM) of Runway, and Tesla's neural network simulator. Players in the AI field are actively deploying.

In particular, the "world model" in the multimodal field has become the core focus of their investment.

For a long time, AI giants like Fei - Fei Li and LeCun, who support the "world model", believe that AI cannot replicate human intelligence with just language. It also needs to understand and simulate physical explanations.

The world model is the ultimate answer. It can predict the world by imitating the "mental model" that humans form of the surrounding environment.

Last week, the artificial intelligence field was hit by another bombshell.

The Beijing Academy of Artificial Intelligence (BAAI) officially released the latest achievement of its Wujie·Emu series - Emu3.5.

At the technical exchange meeting, Dr. Wang Zhongyuan, the dean of BAAI, positioned it as a milestone in "opening a new era of large multimodal world models".

"Not all large - model technical routes have to completely follow the paths that others have taken. We are also creating some new technical paths ourselves," Wang Zhongyuan said. "The Emu series is a technical route we have developed on our own, and we are leading the way."

Different from the current mainstream "modular splicing" multimodal models (such as LLM + CLIP and DiT architectures) that separate understanding and generation, Emu3.5 returns to the "first - principles". It learns from continuous, long - term visual experiences like humans and uses a unified autoregressive architecture to achieve native understanding and generation of the multimodal world.

"Through Wujie·Emu3, we verified the feasibility of using an autoregressive architecture to achieve the unity of multimodal understanding and generation," Wang Zhongyuan said. "From Emu3 to Emu3.5, we proved that there is also a Scaling paradigm in the multimodal field."

This model with 34 billion parameters performs so well in multiple dimensions such as long - text rendering, complex image editing, and visual story generation that the industry can't help but exclaim "Wow". More importantly, its profound understanding of the dynamics, causality, space - time, and logic of the physical world indicates that AI is accelerating its move from the digital world to the physical world.

BAAI published a detailed 45 - page technical report, revealing all the technical details such as data processing, model architecture, training methods, and inference acceleration.

Project homepage: https://zh.emu.world

Technical report: https://arxiv.org/pdf/2510.26583

Behind this is BAAI's persistence in "leading original innovation in artificial intelligence" and its confidence in future technical routes.

Wujie·Emu3.5 provides an original solution from China that is logically consistent and has great potential for several fundamental problems in the current global large - model competition:

How should multimodality be unified? - Through the native, end - to - end autoregressive "Next - State Prediction" paradigm

What should the world model learn? - Learn from long - video data that contains world knowledge such as long - term and high - consistency information

How to achieve scalability? - Rely on the third Scaling paradigm of "pretraining + multimodal RL" and reuse the existing LLM infrastructure

How to implement it? - Solve the efficiency bottleneck through inference acceleration technologies such as DiDA

First - principles, learning like a human, from Next - Token to Next - State

"Human learning does not start with text learning," Wang Zhongyuan repeatedly emphasized this point at the press conference.

When a baby opens its eyes, it first perceives the visual world. Through observation and interaction, it gradually understands physical laws and causal relationships. Language is a tool developed on this basis for communication and generalization.

After exhausting the Internet text data, the growth of current large language models (LLMs) has shown signs of fatigue. In the multimodal field, the technical routes have not yet converged. Most mainstream video and image generation models, such as Sora and Nano Banana, mostly use hybrid architectures like Diffusion Transformer (DiT). In essence, they are still "assembled" - the understanding and generation modules are separated, making it difficult to achieve real and unified intelligence.

Since its inception, the Emu series has chosen a more difficult but more fundamental path: native multimodality.

Emu3.5 inherits and greatly develops this concept. It adopts an extremely simple but powerful unified paradigm: Next - State Prediction.

Similar to how an LLM predicts the next text token, Emu3.5 "tokenizes" images, texts, and even action instructions, places them in a unified sequence, and then uses a single, end - to - end autoregressive Transformer model to predict the next token in the sequence.

This "token" can be a text description, a "visual word block" that makes up an image, or even an instruction to guide the movement of a robot arm.

The advantages of this architecture are obvious:

Unity: It completely breaks down the barriers between understanding and generation. When generating an image, the model is based on a deep understanding of the context (including previous images and texts).

Scalability: It can perfectly reuse the extremely mature training, inference, and reinforcement learning infrastructure built for LLMs. This means that all the Scaling Laws and optimization techniques verified on LLMs can theoretically be applied to Emu3.5 again.

"We can finally scale up multimodal large models," Wang Zhongyuan said with confidence.

The third Scaling paradigm, 790 years of long - video data and large - scale multimodal RL

If the unified architecture is the skeleton, then a large amount of high - quality data is the flesh and blood.

The training data volume of Emu3.5 is astonishing: more than 13 trillion multimodal tokens.

Its core is no longer short - video clips or static image - text pairs, but Internet long - videos with a cumulative duration of 790 years, covering documentaries, teaching videos, Vlogs, game animations, etc.

"Long - videos contain voices and interactive texts. They have a long context and consistency," explained Wang Xinlong, the R & D leader of the Emu series. Compared with isolated data points, long - videos naturally contain rich spatio - temporal continuity, causal logic, and context consistency, making them excellent nutrients for learning world models.

To process this large amount of data, the BAAI team built a complex automated data processing pipeline, including scene segmentation, automatic speech recognition (ASR), key - frame extraction, quality assessment, redundancy removal, and multimodal summary generation.

In terms of training, Emu3.5's path is clear and firm:

  • Large - scale pretraining

During the first - stage pretraining on more than 10 trillion tokens, the model learns basic multimodal alignment and generation capabilities. The entire training process is "very stable". On multiple unseen downstream task validation sets, the loss function decreases steadily with the increase in computing power, which is strong evidence for the existence of the "Scaling paradigm".

  • Large - scale multimodal reinforcement learning (RL)

This is another major innovation of Emu3.5. As we all know, reinforcement learning is the key to stimulating the inference and instruction - following abilities of LLMs (such as GPT - 4o and DeepSeek - R1). However, applying it to the more complex and longer - sequence multimodal field is fraught with difficulties.

Thanks to the unified autoregressive architecture, Emu3.5 has achieved unified multi - task, multimodal reinforcement learning for the first time. The team built a complex reward system that includes general rewards (such as aesthetics and image - text consistency) and task - specific rewards (such as OCR accuracy and face ID maintenance), and optimized it in a unified reward space through the GRPO algorithm.

This combination of "large - scale long - video pretraining + large - scale multimodal RL" is called the "third Scaling paradigm" by Wang Zhongyuan, following language model pretraining and post - training. It points out a path: by continuously increasing video data, model parameters, and computing power, the capabilities of multimodal world models will continue to improve predictably.

The black - tech DiDA, boosts the inference speed of autoregressive models by 20 times

The "one - token - by - one - token" generation method of autoregressive models results in slow speeds when generating high - definition images (usually, a single image requires thousands of tokens). This is why Diffusion models have long dominated the generation field.

To overcome this problem, the Emu3.5 team developed a black - tech called Discrete Diffusion Adaptation (DiDA).

The core idea of DiDA is that after the model completes large - scale autoregressive pretraining and post - training, through a lightweight "adaptation" stage, it can be transformed from the "one - token - at - a - time prediction" mode to the "parallel generation" mode.

Specifically, it borrows the idea of discrete diffusion and turns the image generation process into a "denoising" process: instead of generating from left to right, the model generates all "noisy" visual tokens at once and then corrects them in parallel and bidirectionally in a few steps to finally restore a clear image.

What's the result? The inference speed of each image is increased by about 20 times, with almost no loss in performance!

This means that for the first time, the inference efficiency of Emu3.5's autoregressive model can compete with top - notch closed - source Diffusion models (such as Midjourney). This is not only a great engineering victory but also fundamentally solves the commercialization bottleneck of the native multimodal architecture.

From image editing to embodied operation, the best in the open - source field

Theoretical superiority ultimately depends on practical results. The results of Emu3.5 are exciting for any practitioner.

Top - notch Any - to - Image generation and editing:

Emu3.5 can not only generate high - quality images with complex formulas and Chinese - English couplets but also reaches a new height in image editing capabilities. On authoritative benchmarks such as ImgEdit and GEdit - Bench, Emu3.5's scores comprehensively surpass all publicly available models, including Gemini 1.5 Flash and Qwen - VL - Max.

  • High - level semantic understanding:

By combining specified characters, specific scenes, and arbitrary objects, Emu3.5 can create a new logical world, demonstrating its strong imagination and world - building ability.

  • Digital and spatial understanding:

Given the instruction "Replace the object labeled 4 in the picture with a movie poster", the model can accurately locate and replace it.

  • Perspective transformation:

Given a front view of a building and the instruction "Switch to a top - view", the model can generate a new perspective reasonably, as if it has 3D modeling capabilities.

  • Long - term, high - consistency "world learning" ability:

This ability is the core manifestation of Emu3