No more mistakes in drawing Sudoku or lighting candles? Zhejiang University and Alibaba enable AI to think twice before acting.
Currently, visual generation is caught in a dilemma of misaligned capabilities.
The pixel quality of diffusion models has nearly reached perfection, but they often fail when faced with generation tasks that require logical reasoning.
When asking the model to draw "what a solved sudoku looks like" or "the state of a candle after burning for 6 hours", open - source models either exhibit logical hallucinations or fail to convert text instructions into precise visual operations, creating an insurmountable execution gap.
In contrast, closed - source models such as Nano Banana and GPT - Image already possess mature reasoning - driven generation capabilities.
Is the gap between open - source and closed - source models really due to the weakness of the generator?
The research team from Zhejiang University in collaboration with Alibaba provides an answer: The problem lies not in the "hands" but in the lack of an independent "brain".
They proposed Unified Thinker, a general reasoning core that completely decouples thinking and execution, upgrading image generation from "end - to - end black - box mapping" to "modular thinking chain planning".
This work has been officially accepted as an Oral at ACL 2026.
Reasoning should not be just a "self - indulgence" in the text space
The problem with today's multimodal generation models often lies not in the ability to think, but in whether the thoughts can be translated into images after thinking.
Who is in the image, what are the positional relationships, how the actions occur, how the states change before and after, and which information needs to be visually expressed.
At this time, if reasoning remains in a self - loop within the text space, an awkward situation may easily occur: the language sounds reasonable, but the generated result is completely different.
The existing multimodal generation paths are generally trapped in two types of technical routes.
One is the unified model that attempts to balance understanding and generation in a single network. This tight coupling often leads to unstable training, and it is difficult to achieve both generation quality and logical reasoning.
The other is the external mode that uses a general LLM as a Planner. However, this approach faces the serious problem of semantic - visual misalignment.
The descriptions that the LLM considers reasonable may not be executable by the diffusion model due to the lack of corresponding visual priors.
△
The core insight of Unified Thinker is that reasoning should not be just a logical deduction in the text space, but must be an "executable plan".
The researchers designed an independent Thinker module. It does not directly generate pixels but acts as the brain, responsible for decomposing the vague user intent into a hierarchical, structured, and generator - friendly intermediate representation.
The Generator acts as the hands, focusing on high - precision pixel synthesis.
This decoupling design not only allows developers to independently upgrade the logical ability of the brain but also enables the generalization and migration of logical ability across different generation bases (such as Qwen - Image, BAGEL, etc.).
From data to algorithm: Building an executable thinking chain
To translate "thinking" into images, the research team did not stop at the model structure level but started by transforming the most basic data engineering.
They built a dataset containing 40,000 samples: HieraReason - 40K.
The biggest difference between it and traditional image - text pairs is the introduction of structured reasoning traces.
That is, before generating or editing an image, the model must go through a fixed thinking chain: intent decomposition → logical concretization → visual translation.
First, determine what the user really wants to change, then break down the abstract requirements into specific visual elements, and finally convert them into instructions that the downstream generator can execute.
Moreover, in the image editing scenario, the researchers also proposed a "golden rule":
It is strictly prohibited to describe unchanged areas in the prompt. This strategy greatly reduces the semantic drift of the diffusion model during the editing process and ensures that the generation process is precisely focused.
During the optimization phase, supervised fine - tuning (SFT) alone is difficult to guarantee the actual gain of the reasoning results for generation.
Therefore, Unified Thinker introduces an innovative two - stage reinforcement learning scheme based on the GRPO algorithm.
In the reasoning - oriented RL stage, multiple reasoning paths generated by the Thinker are directly fed back by the visual quality scores of the generated images. This forces the model to abandon empty words and instead learn to generate "visually executable" instructions.
In the generation - oriented RL stage, the fidelity of the generator for complex instructions is optimized through random sampling. This two - way feedback mechanism truly realizes the in - depth collaboration between the brain and the hands.
Towards the evolution of "plan first, then generate"
The experimental results also verify the value of this decoupling architecture.
In benchmark tests that better test reasoning ability, Unified Thinker performs particularly well:
For example, in RISEBench, which focuses on reasoning - based image editing, and WiseBench, a knowledge - intensive text - to - image task, significant improvements have been achieved.
In addition, in tasks involving time - scale evolution (such as predicting the aging process of objects) and complex spatial positioning, its performance is significantly better than that of existing open - source baseline models and shows instruction - following ability comparable to that of closed - source models:
More practically, this architecture has strong generalization ability.
As a plug - and - play reasoning core, the logical planning ability of the Thinker can be migrated across models.
Experiments show that even when it is mounted on a generation base that has not participated in training, it can effectively improve the logical execution accuracy of the latter.
From a longer - term perspective, the proposal of Unified Thinker can be regarded as an attempt to move visual generation from "probability fitting" to "logic - oriented".
In the past, models relied more on feature matching and random sampling to generate images. Now, by introducing interpretable and intervenable structured reasoning traces, the generation process has an additional layer of pre - planning, thus having higher certainty.
This also provides a feasible architectural idea for building generative agents with autonomous decision - making capabilities in the future.
It is foreseeable that as the reasoning cost is further optimized, "plan first, then generate" will become an important path to improve the quality of visual generation.
Reference links:
[1] Paper link: https://arxiv.org/pdf/2601.03127
[2] Code repository: https://github.com/LivingFutureLab/UnifiedThinker
[3] Data link: https://huggingface.co/datasets/demo911/HieraReason_40K
This article is from the WeChat official account “QbitAI”. Author: Zhejiang University & Alibaba Team. Republished by 36Kr with permission.