StartseiteArtikel

Nach DeepSeek hat das Large Language Model von Zhipu Institute einen Artikel in der Zeitschrift "Nature" veröffentlicht: Es geht um die Herrschaftsstrategie des "Weltmodells".

智东西2026-02-02 08:21
Ein technologischer Hochstapler, der auf die Zukunft der KI setzt.

According to a report by Zhidongxi on February 1st, at Beijing time on January 29th, the multi-modal large model "Wujie·Emu" launched by the Beijing Academy of Artificial Intelligence (BAAI) was published in the main issue of Nature. It became the second Chinese large model research team after DeepSeek to achieve this feat, and it is also China's first Nature paper focusing on the multi-modal large model approach.

Screenshot of the official Nature website

The editors of Nature commented: "Emu3 has achieved unified learning of large-scale text, images, and videos based solely on 'predicting the next token'. Its performance in generation and perception tasks is comparable to that of specialized approaches. This achievement is of great significance for building scalable and unified multi-modal intelligent systems and is expected to drive the development of native multi-modal assistants, world models, and embodied intelligence."

Emu3 is expected to drive the development of embodied intelligence and other fields

The "Wujie·Emu3" model was launched by the BAAI in October 2024. In terms of both perception and generation, Emu3 achieves performance comparable to that of models specialized for specific tasks. The model can perform various tasks such as text-to-image generation, text-to-video generation, future prediction, visual-language understanding, interleaved image-text generation, and embodied operations. This achievement is of great significance for establishing autoregression as a unified approach for generative AI.

As shown in the figure below, in image generation, Emu3 outperforms diffusion models such as SDXL on benchmarks like MSCOCO-30K23; in video generation, it scores 81 on the VBench, surpassing Open-Sora 1.2; in visual-language understanding, it scores 62.1, slightly higher than LLaVA-1.6. Although such results are quite common nowadays, they were remarkable two years ago.

Main evaluation results of Emu3 in image generation, visual-language understanding, and video generation

Jack Clark, the former policy director of OpenAI and now a co-founder of Anthropic, commented on Emu3 at that time: "Without relying on fancy architectural tricks, and using only the most basic logic of predicting the next token, this'simplicity' is considered to have strong scalability potential."

It is precisely this'simple' architectural approach that is of great significance for reducing the threshold and cost of large model research and development. "The more minimalist the architecture, the more likely it is to have strong productivity and greater value for the industry," Wang Zhongyuan, the director of the BAAI, told Zhidongxi. "Because it simplifies the multi-modal AI architecture, reducing the complexity and potential errors in the R & D process, thus making the construction and maintenance of the model more efficient."

Wang Zhongyuan, the director of the BAAI

By October 2025, the "Wujie·Emu" series had iterated into a multi-modal world model. Emu3.5 can understand long-term and spatially consistent sequences, simulate exploration and operations in the virtual world. It not only surpassed models like Google's Nano Banana to achieve the state-of-the-art (SOTA) in multi-modal tasks but also first pointed out the'multi-modal Scaling paradigm'. It enables the model to spontaneously learn the inherent laws of world evolution, providing an important new path for the development of physical AI fields such as embodied intelligence.

Emu3.5 continues the core idea of unified modeling of multi-modal data

Why was Emu3 able to be published in the main issue of Nature and highly recognized by the international academic community? What kind of original AI technologies were developed behind it, and what challenges were encountered? What practical impacts will this have on the development of the academic and industrial communities? This article attempts to explore these questions in depth.

Paper title: "Multimodal learning with next-token prediction for large multimodal models"

Paper link: https://www.nature.com/articles/s41586-025-10041-x

GitHub link: https://github.com/baaivision/Emu3

Partial screenshot of the Emu3 paper

01. A 50-person team's pursuit of "unification": A high-stakes technological bet on the future of AI

The project for the Emu3 model was first initiated in February 2024. At that time, the team was re - evaluating the development path of large models. With the popularity of GPT - 4 and Sora, the autoregressive approach of "predicting the next token" completely changed the field of language models and sparked discussions about early signs of AGI. In the field of multi-modal generation, the DiT (Diffusion Transformer) architecture became the mainstream and began to show amazing generation results.

Whether the autoregressive approach can be used as a general approach to unify multi-modal data has always been an unsolved mystery.

The groundbreaking aspect of Emu3 lies in that it achieves unified multi-modal learning and trains a high-performance native multi-modal large model using only the autoregressive approach of "predicting the next token (NTP)".

Before the project was initiated, the BAAI team conducted extensive analyses and debates and reached a consensus: Multi-modal data is the key path to achieving AGI in the future. However, the existing multi-modal generation has long been dominated by diffusion models, and visual-language perception is mainly led by combinatorial methods. These methods are not unified and have technological ceilings.

Although some industry insiders have attempted to unify generation and perception (such as Emu and Chameleon), these efforts either simply combined large language models with diffusion models or were inferior in performance to specialized methods designed for generation or perception tasks.

Whether to believe that the autoregressive architecture can be used as a native approach to unify multi-modal data was a major technological decision. Finally, at the end of February 2024, the BAAI decided to form a 50-person technical research team. They focused on research and development with the autoregressive architecture as the core and used discrete tokens to simplify the architecture and reuse the infrastructure of large language models on a large scale, thus starting the research and development of the new multi-modal model Emu3.

The model innovatively discretizes images, text, and videos into the same representation space and jointly trains a single Transformer from scratch on mixed multi-modal sequence data.

Emu3 can perform different multi-modal tasks

This was a "risky" path that challenged the tradition. Before achieving success, the BAAI team faced numerous challenges.

Firstly, there were obvious technological challenges. Choosing the "discrete token" approach was itself a risk because it attempted to reinvent a language system for vision and other modalities that aligns with human written language. During the image compression process, since images contain more information and more redundancy than text, it was difficult to train an effective model when compressing images based on tokens. Frustration was inevitable during this process.

Secondly, there were deeper doubts about the path. In 2024, domestic large model teams were actively replicating GPT - 4. Many leading players also laid out multi-modal models, but they wavered in the actual process. Eventually, due to high resource consumption and a focus on language models, they disbanded their teams. The BAAI persevered in this industrial context, which required strong belief from the leadership and great perseverance from the team.

Thirdly, the question of "whether multi-modal data can improve the intelligence of the model" was not fully settled at that time. However, the BAAI team firmly believed that if the next-generation model was to enter the physical world, text alone would not be enough. A model that "has seen the world" was needed. They believed that no matter how difficult it is to break through the intelligence upgrade of multi-modal models and even world models, it is an inevitable path to achieving AGI.

02. Performance comparable to specialized models: In two years, Emu3 has profoundly influenced the industrial development

Many industry professionals told Zhidongxi that since the release of the Emu3 model more than two years ago, it has had a significant impact on the multi-modal field and promoted the development of the entire industry. There is evidence that it has been widely used and highly recognized in the industry.

The prerequisite for industrial application is that Emu3 first won the "performance" battle. In multi-modal generation and perception tasks, the overall performance of Emu3 is comparable to that of various mature task-specific models.

Firstly, focusing on text-to-image generation ability, on multiple benchmarks such as MSCOCO-30K23, GenEval24, and T2I-CompBench25, the performance of Emu3 is comparable to that of the most advanced diffusion models at that time: it surpasses models such as SD1.5 and SDXL and is close to models such as DALL-E 3 and FLUX.1 (Dev).

The performance of Emu3 is comparable to that of the most advanced diffusion models

As shown in the figure below, in the text-to-image generation task, its effect reaches the level of diffusion models; in visual-language understanding, it can compete with the mainstream solutions that combine CLIP and large language models.

Emu3 is comparable to mainstream solutions in text-to-image generation and visual-language understanding

In terms of visual-language understanding, as shown in the figure below, as a pure encoder-free method, Emu3 achieves performance comparable to that of its peers in multiple benchmark tests. To achieve such visual-language understanding ability, Emu3 does not rely on specialized pre-trained large language models and CLIP.

Evaluation results of Emu3 in visual-language understanding ability

In the zero-shot image inpainting case, given the input image (on the left side of each row) and the corresponding prompt, Emu3 can accurately fill the masked area within the bounding box and generate semantically aligned content without specific task fine-tuning.

Zero-shot image inpainting by Emu3

At the same time, Emu3 also has video generation ability. Emu3 natively supports the generation of 5-second videos at 24 frames per second and can be extended through the autoregressive method. As shown in the figure, in Extended Data Table 3, the results produced by Emu3 are highly competitive compared with other video diffusion models: the performance of Emu3 surpasses well-known specialized models at that time, such as Open Sora V1.2, Kling (2024), and Gen-3.

Comparison of Em