After DeepSeek, the large model from Beijing Academy of Artificial Intelligence (BAAI) is published in Nature: It concerns the ruling path of the "world model".
According to a report by Zhidx on February 1st, Beijing time on January 29th, the multimodal large model "Wujie·Emu" launched by the Beijing Academy of Artificial Intelligence (BAAI) was published in the main issue of Nature. It became the second Chinese large model research team after DeepSeek to achieve this feat and is also China's first Nature paper on the multimodal large model route.
Screenshot of the official Nature website
The editors of Nature commented, "Emu3 achieves unified learning of large-scale text, images, and videos based solely on 'predicting the next token.' Its performance in generation and perception tasks is comparable to that of specialized approaches. This achievement is of great significance for building scalable and unified multimodal intelligent systems and is expected to drive the development of native multimodal assistants, world models, and embodied intelligence."
Emu3 is expected to drive the development of embodied intelligence and other fields
The "Wujie·Emu3" model was launched by BAAI in October 2024. In both perception and generation, Emu3 achieves performance comparable to that of models specialized for specific tasks. This model can perform various tasks such as text-to-image generation, text-to-video generation, future prediction, visual-language understanding, interleaved image-text generation, and embodied operations. This achievement is of great significance for establishing autoregression as a unified route for generative AI.
As shown in the figure below, in image generation, Emu3 outperforms diffusion models such as SDXL on benchmarks like MSCOCO-30K23; in video generation, it scores 81 on the VBench, surpassing Open-Sora 1.2; in visual-language understanding, it scores 62.1, slightly higher than LLaVA-1.6. Although such results are quite common today, they were remarkable two years ago.
Main evaluation results of Emu3 in image generation, visual-language understanding, and video generation
Jack Clark, the former head of policy at OpenAI and now a co-founder of Anthropic, commented on Emu3 at that time, "Without relying on fancy architectural tricks, using only the most basic logic of predicting the next token, this'simplicity' is considered to have strong scalability potential."
It is precisely this "simple" architectural route that is of great significance for reducing the R & D threshold and cost of large models. "The simpler the architecture, the greater the potential productivity and the greater the value to the industry," Wang Zhongyuan, the president of BAAI, told Zhidx. "Because it simplifies the multimodal AI architecture, reducing the complexity and potential errors in the R & D process, thus making model construction and maintenance more efficient."
Wang Zhongyuan, the president of BAAI
By October 2025, the "Wujie·Emu" series had iterated into a multimodal world model. Emu3.5 can understand long-term and spatially consistent sequences and simulate exploration and operations in the virtual world. It not only surpasses models like Google's Nano Banana to achieve the state-of-the-art (SOTA) in multimodality but also first points out the'multimodal Scaling paradigm', enabling the model to spontaneously learn the inherent laws of world evolution and providing an important new path for the development of physical AI fields such as embodied intelligence.
Emu3.5 continues the core idea of unified modeling of multimodal data
Why was Emu3 able to be published in the main issue of Nature and highly recognized by the international academic community? What kind of original AI technologies were born behind it, and what challenges did it go through? What practical impacts will this have on the development of the academic and industrial communities? This article attempts to explore these questions in depth.
Paper title: "Multimodal learning with next-token prediction for large multimodal models"
Paper link: https://www.nature.com/articles/s41586-025-10041-x
GitHub link: https://github.com/baaivision/Emu3
Partial screenshot of the Emu3 paper
01. A 50-person team's pursuit of "unification": A high-stakes technological bet on the future of AI
The development of the Emu3 model was first initiated in February 2024 when the team was re - evaluating the development path of large models. With the popularity of GPT - 4 and Sora, the autoregressive route of "predicting the next token" completely changed the field of language models and sparked discussions about early signs of AGI. In the field of multimodal generation, the DiT (Diffusion Transformer) architecture became mainstream and began to show amazing generation effects.
Can the autoregressive technology route be used as a general route to unify multimodality? This has always been an unsolved mystery.
The pioneering aspect of Emu3 lies in its ability to achieve unified multimodal learning and train a high - performance native multimodal large model using only the autoregressive route of "predicting the next token (NTP)".
Before the project was initiated, the BAAI team conducted a lot of analysis and debates and reached a consensus: Multimodality is the key path to achieving AGI in the future. However, existing multimodal generation has long been dominated by diffusion models, and visual - language perception is mainly led by combinatorial methods. These methods are not convergent and unified, and there is a technological ceiling.
Although some industry insiders have tried to unify generation and perception (such as Emu and Chameleon), these works either simply splice large language models with diffusion models or perform worse than the specialized methods designed for generation or perception tasks.
Whether the autoregressive architecture can be used as a native and unified multimodal technology route is a major technological decision. Finally, at the end of February 2024, BAAI decided to form a 50 - person technical research team. They centered their R & D around the autoregressive architecture and adopted a discrete token approach to streamline the architecture and reuse large language model infrastructure on a large scale, thus initiating the R & D of the new multimodal model Emu3.
This model pioneered the unified discretization of images, text, and videos into the same representation space and jointly trained a single Transformer on mixed multimodal sequence data from scratch.
Emu3 can perform different multimodal tasks
This is a "risky" path that challenges the tradition. Before achieving success, the BAAI team faced numerous challenges.
First and foremost, there were technological challenges. Choosing the "discrete token" approach was in itself a risk because it attempted to reinvent a language system for vision and other modalities that aligns with human written language. During the image compression process, since images contain more information but also more redundancy compared to text, it was difficult to train an effective model when compressing images based on tokens, and setbacks and frustrations were inevitable in this process.
Second, there were deeper doubts about the path. In 2024, domestic large model teams were enthusiastically replicating GPT - 4. Many leading players also laid out multimodal models, but they wavered in the actual process. Eventually, due to high resource consumption and a focus on language models, some teams were disbanded. BAAI persevered in this industrial background, which required strong belief from the top leadership and great determination from the team.
Third, the question of "whether multimodality can improve the intelligence of the model" was not fully settled at that time. However, the BAAI team firmly believed that if the next - generation model was to enter the physical world, text alone would not be enough. They needed a model that "has seen the world". They believed that no matter how difficult it is to break through the intelligence upgrade of multimodal models and world models, it is an inevitable path to achieving AGI.
02. Performance on par with specialized models: In two years, Emu3 has deeply influenced the industrial development landscape
Many industry professionals told Zhidx that since the release of the Emu3 model more than two years ago, it has had a significant impact on the multimodal field and promoted the development of the entire industry. There is evidence that it has been widely used and highly recognized in the industry.
The prerequisite for industrial application is that Emu3 first won the "performance" battle. In multimodal generation and perception tasks, Emu3's overall performance is comparable to that of various mature task - specific models.
First, focusing on text - to - image generation ability, on multiple benchmarks such as MSCOCO - 30K23, GenEval24, and T2I - CompBench25, Emu3's performance is comparable to that of the most advanced diffusion models at that time: it surpasses models such as SD1.5 and SDXL and is close to models such as DALL - E 3 and FLUX.1 (Dev).
Emu3's performance is comparable to that of the most advanced diffusion models
As shown in the figure below, in the text - to - image generation task, its effect reaches the level of diffusion models; in visual - language understanding, it can compete with the mainstream solutions that combine CLIP and large language models.
Emu3 is on a par with mainstream solutions in text - to - image generation and visual - language understanding
In terms of visual - language understanding, as shown in the figure below, as a pure encoder - free method, Emu3 achieves performance comparable to that of its counterparts in multiple benchmark tests. To achieve such visual - language understanding ability, Emu3 does not rely on specialized pre - trained large language models and CLIP.
Evaluation results of Emu3's visual - language understanding ability
In the zero - shot image inpainting case, given the input image (on the left of each row) and the corresponding prompt, Emu3 can accurately fill the masked area within the bounding box and generate semantically aligned content without task - specific fine - tuning.
Emu3 zero - shot image inpainting
At the same time, Emu3 also has video generation ability. Emu3 natively supports the generation of 5 - second videos at 24 frames per second and can be extended through the autoregressive method. As shown in the figure, in the extended data table 3, the results produced by Emu3 are highly competitive compared to other video diffusion models: Emu3's performance surpasses well - known specialized models at that time, such as Open Sora V1.2, Kling (2024), and Gen - 3.