HomeArticle

Peking University has found a solution to the multi-modal AI internal consumption problem discovered by Zhang Xiangyu.

36氪的朋友们2025-09-19 18:47
Inside multimodal AI, there has always been a "civil war."

In June this year, Zhang Xiangyu, the chief scientist of Step Star, mentioned in an interview the biggest dilemma he encountered in model training in the past two years - there has always been a "civil war" inside the multimodal AI.

Specifically, in the training of the unified multimodal model, the "understanding" and "generation" abilities of vision can coexist, but they rarely collaborate and often cause internal strife. During joint training, the improvement of one ability may even lead to the decline of the other's performance.

This is completely contrary to our perception. For a human being, the deeper his understanding of a picture, the more exquisite his painting may be. However, in the multimodal model, there is no effective "information gain" and "mutual promotion" between understanding and generation.

Zhang Xiangyu's explanation is that image generation is too complex, requiring extremely complex spatial planning, physical common sense, and semantic reasoning. And although the Transformer model is powerful, the number of logical reasoning steps it can perform in a single forward propagation is limited. It is too difficult for it to generate an image that meets all physical, geometric, and semantic constraints at once according to the instruction "Draw an astronaut riding a bicycle with square wheels on the moon".

During the training process, due to this single - step reasoning, the gradient signal is too rough. The trained understanding model cannot effectively guide the generation model, and vice versa. The failure of the generation module cannot effectively help the understanding module improve.

Therefore, Zhang Xiangyu's solution is that the multimodal model should introduce the "Chain - of - Thought" like language reasoning. Let the model think and create step by step to avoid the problem of rough signals caused by single - step reasoning.

Recently, a latest study from Peking University, "Can Understanding and Generation Truly Benefit Each Other, or Just Coexist?", proposed a brand - new framework called UAE, providing another solution to this problem.

Attachment: Paper address: https://arxiv.org/abs/2509.09666

Zhang Xiangyu's Chain - of - Thought solution makes sense, but it mainly solves the complexity problem of single - step reasoning. However, the Peking University team discovered a more fundamental problem: the training goals of understanding and generation are inherently separated. Even if the Chain - of - Thought is introduced, the two modules are still pursuing different KPIs.

Therefore, the UAE team chose a more radical approach: Instead of letting the model think about the same complex task step by step, it is better to redefine the task itself and make understanding and generation two links in the same process.

01 The Road to Unity: From Separate Governance to Pipeline Collaboration

To understand the essence of this paper, we must first figure out the fundamental problems of the old methods for unified multimodal models.

The old methods are like the endless internal strife under the "diarchy".

Imagine there are two master craftsmen in a workshop. We call them the "Understanding Craftsman" and the "Generation Craftsman".

The KPI of the "Understanding Craftsman" is the accuracy of semantic abstraction. His task is to understand a painting and summarize its core content in the most concise and accurate language. He needs to ignore minor changes and grasp the essence and relationships of things. Therefore, to do this job well, his logic is a cognitive process from the specific to the abstract.

The KPI of the "Generation Craftsman" is the fidelity of pixel restoration. His task is to draw a painting according to the instruction. His work will be examined under a microscope. Therefore, he must pay extreme attention to details, materials, and the statistical laws of the physical world. To achieve sufficient restoration, his logic is a construction process from the abstract to the specific.

In many past attempts at "unified models", researchers tried to let the same model (with the same set of core parameters) play these two roles simultaneously and evaluate it with these two completely different sets of KPIs.

The two optimization goals conflict with each other at the underlying logic. Their gradient updates pull against each other in the model's parameter space, making the training process extremely unstable. Eventually, it often fails to do well in both aspects or sacrifices one for the other.

To avoid this direct conflict, some works chose the strategy of "decoupling". Researchers first trained the "Understanding Officer" and the "Generation Officer" independently to the top level, and then built a liaison office (adapter module) for them to communicate limitedly. This method avoids internal strife and indeed enables the model to have both abilities.

However, this is a "unification in name only". They only "coexist" under the same roof without forming a real synergy and mutual gain.

The new method is a common goal under the "pipeline operation".

Facing the dilemma of "diarchy", the proposers of the UAE framework made a fundamental change: Abolish the two independent sets of KPIs, establish a unified pipeline, and set up a single, ultimate quality - inspection standard.

The core of this idea comes from the classic "Auto - Encoder" model.

The logic of the auto - encoder is very simple: It consists of an encoder and a decoder. The encoder is responsible for compressing the input data (such as a picture) into a compact representation (usually a vector) containing the core information. The decoder is responsible for reading this compressed representation and trying its best to restore it to the original input data.

The training goal of the whole system is only one: to make the restored output as similar as possible to the original input.

The UAE framework cleverly maps this structure to the tasks of understanding and generation.

Understanding is encoding (compression process): On the pipeline, the "understanding model" trained by Qwen - 2.5 - VL 3B plays the role of the encoder. It is the first process of the pipeline. It receives an original image and then "compresses" all the key and describable semantic information of the image losslessly into a detailed and structured text description. This text is the core information representation of the image.

Generation is decoding (restoration process): The "generation model" trained by SD3.5 - large plays the role of the decoder, which is the second process of the pipeline. It receives the text description produced by the previous process, and its only task is to "decompress" and reconstruct the original image according to this information.

On this pipeline, the old contradictions are completely resolved. The two craftsmen have a common KPI: to ensure that the "reconstructed image" produced at the end of the pipeline can perfectly restore the original image initially input.

Why is the reconstruction similarity a good indicator for measuring unity?

Because if the understanding module really "understands" the original image, its description should contain all the key information. And if the generation module really "understands" the description, it should be able to reproduce all the elements of the original image.

So if the reconstructed image is highly similar to the original image, it means that the information has achieved almost lossless transmission on the link of understanding → text → generation.

02 Left - right Cycle and Two - way Strengthening in Training

Designing the new organizational structure of the "pipeline" is only the first step. The more crucial question is: How to train the two craftsmen on this pipeline, let them grow from novices to masters, and finally achieve perfect cooperation?

The UAE proposes a three - stage training strategy called Unified - GRPO, achieving the "left - right cycle and two - way strengthening" of understanding and generation.

Stage 1: Cold - start reconstruction (pre - job training and initial alignment)

Just as two strangers need to establish basic communication and cooperation, the understanding and generation modules first need to establish a preliminary collaborative relationship in a relaxed environment.

In this stage, the system will receive an original image. The "understanding module" will generate a description, and then the "generation module" will reconstruct the image. Then, a basic loss will be calculated directly based on the semantic similarity between the reconstructed image and the original image, and this loss will be used to update the parameters of both modules simultaneously.

The goal of this stage is very simple: to ensure that the generation module can reconstruct a semantically similar image from the output of the understanding module and establish a basic information transmission channel.

Stage 2: Generation serves understanding, focusing on training the "Understanding Craftsman"

After the pre - job training, the real specialized training begins. This is the first step of the "left - right cycle". The coach's goal is to train the "Understanding Craftsman" into a top - level communicator.

The training process is as follows:

1. Freeze the "Generation Craftsman": In this stage, the ability of the "Generation Craftsman" (generation model) is temporarily fixed. He no longer learns new skills but plays the role of a "quality inspector" or "sparring partner" with stable performance.                   

2. The "Understanding Craftsman" tries and makes mistakes repeatedly: Now, the "Understanding Craftsman" (understanding model) is the only trainee. He will receive an original image and then try to generate a description.                   

3. The sparring partner executes: The fixed "Generation Craftsman" will take over this description and try his best to reconstruct the image.                   

4. The coach scores: The coach (reinforcement learning algorithm) will compare the reconstructed image with the original image and give rewards and punishments.                   

Through thousands of cycles, the "Understanding Craftsman" is forced to learn how to generate the most friendly description for the "Generation Craftsman". This is the first direction of the "two - way strengthening": the result of generation strengthens the depth and accuracy of understanding in return.

Stage 3: Understanding serves generation, focusing on training the "Generation Craftsman"

After the "Understanding Craftsman" has been specially trained and can stably produce descriptions with extremely rich information, the cycle enters the second step. Now, the coach's goal is to train the "Generation Craftsman" into a top - level executor.

The training process is the opposite of stage 2, which is to freeze the understanding craftsman and let the generation craftsman reconstruct the image repeatedly according to the description to optimize his skills.

In this stage, the "Generation Craftsman" is forced to learn how to process and execute long and constraint - filled instructions. This is the second direction of the "two - way strengthening": profound understanding strengthens the generation's ability to follow complex instructions in return.

The second and third rounds will be trained alternately. This alternate training forms a positive feedback loop: the more accurate the understanding, the more accurate the generation; the higher the generation requirements, the deeper the understanding. In the latter two stages, the UAE uses the GRPO algorithm.

The Aha Moment at the intersection of generation and understanding

Through the new method of UAE, when this training system of "left - right cycle, two - way strengthening" starts to operate, the model's behavior indeed spontaneously shows behaviors conducive to collaboration. Moreover, at these turning points, the understanding module shows a "eureka moment" similar to that of humans.

For example, without any external instructions, the text descriptions generated by the understanding module become longer and more detailed. Traditional image descriptions usually only have a few words, but the UAE uses detailed descriptions with an average of more than 250 English words.

In the early stage of training, there are only short descriptions mainly containing basic objects and colors. But in the middle stage of training, it begins to include information such as counting and spatial relationships. Finally, in the late stage of training, this model can describe in detail systematically covering materials ("knitted sweater"), occlusion relationships ("ears are not visible"), background details ("blurry park background"), lighting conditions, etc.

There is a delicate game mechanism behind this.

The understanding module finds that the more detailed the description, the higher the generation quality and the more rewards it gets - but it's not useful to just add words randomly. The details must be helpful for reconstruction.

So, it starts to automatically learn what details are most crucial for generation.

And the generation module is forced to improve its long - text processing ability to utilize this rich information.

Researchers compared the descriptions generated by the UAE understanding model with those generated by other well - known models (such as Bagel, OmniGen2) and invited multiple top - level large language models (such as GPT - 4o, Claude - 4.1) as judges. The descriptions of the UAE are superior in many aspects such as completeness, attribute binding, relationship, and spatial fidelity.