Alibaba's Overnight Bombshell: Qwen - Image - 2.0 Tested, Excels in Outfit Changes, Group Photos, and Clear Writing

With Chinese AI image generation, Alibaba has stood up.

Originally, Xiaolei thought that as the Chinese New Year approached, the AI circle would calm down a bit.

Especially last year, the entire industry didn't stop for a moment. Major manufacturers seemed to have made an appointment to launch new products in droves. Especially in image generation, on the closed - source side, Banana Pro, with its amazing lighting and texture, almost became a staple in designers' computers; on the open - source side, models led by Z - image were everywhere. As long as your graphics card could handle it, the effect of local image generation had long since improved significantly.

At that time, Xiaolei was still chatting with colleagues in the editorial department, saying that the trend of these two large models would at least lead the way for half a year.

Unexpectedly, the slap in the face came faster than turning a page.

Just yesterday, the Tongyi Qianwen team of Alibaba quietly made a big move - the new generation of the basic image generation model, Qwen - Image - 2.0, was officially launched.

(Image source: Alibaba)

The name sounds quite plain, without any fancy suffixes. But what really caused a stir in the industry was its core selling point: it can not only draw pictures but also understand human language and even write Chinese characters.

According to the official introduction, this model not only supports the native 2K resolution (2048x2048 pixels) but also can handle complex instructions up to 1000 tokens. It also uses a lighter model architecture, and the model size is much smaller than the 20B of Qwen - Image 1.0, resulting in a faster inference speed.

What? You say these parameters are so confusing that you can't understand what they mean?

It's okay. I also have the Google Nano Banana Pro here to give everyone a side - by - side experience comparison right away. Without further ado, let's get started!

Good at Chinese output, but aesthetic needs improvement

Before starting to generate images, let's talk about a core logic of Qwen - Image - 2.0.

In the past, when we played with AI image generation, it was like drawing cards. Due to the limitation of the input Token length, it was difficult to define the pictures we wanted in detail. We could only simplify our requirements into a collection of keywords and then let the AI generate a few pictures for us. Whether they were good - looking or not was all up to luck.

From my experience, if the prompt is written too long, the model often can't handle it comprehensively. It may either miss the background or get the number of objects wrong.

But Qwen - Image - 2.0 is different. Its core selling point is following long instructions and having strong rendering ability.

To verify this, Xiaolei prepared three extremely difficult tests: extremely long logical instructions, mixed text - and - image layout, and accurate restoration of Chinese semantics.

You know, the length of the prompt input for Qwen - Image - 2.0 has increased to 1K tokens. You can write the prompt in great detail and specificity, and you can also choose whether to optimize the prompt.

This is really attractive to novice AI players.

For the extremely long logical instructions, based on my recent personal experience, I directly input a prompt containing complex instructions and up to 700 words to the two large models:

(Image source: Lei Technology)

To be honest, after typing this passage, Xiaolei himself thought it was a bit excessive.

You know, for most image - generation models on the market, it is almost impossible to meet the requirements of creating a picture with a four - panel structure, clear logic, character relationships, and a unified painting style.

After waiting for more than a dozen seconds, two pictures were generated.

To be fair, the picture generated by Banana Pro really captured the artistic conception of an ink - wash comic. The strong black - and - white contrast made it look very artistic.

But on closer inspection, I burst out laughing: it actually drew Lin Chong with a leopard's head as a monster with a real leopard's head! In its logic, "Leopard Head" was just a literal translation, and it completely failed to understand that it was a nickname.

(Image self - made by Lei Technology, Nano Banana Pro)

Looking at Qwen - Image - 2.0's result, I personally think the painting style is more realistic. Lin Chong in the picture is a tough - looking man with a weather - beaten face, without an animal's head. It clearly understood that "Leopard Head" referred to a person's characteristic rather than a species. From kneeling on the ground, breaking the window to shooting the enemy with a gun, the story - telling through the frames was very clear.

This is the advantage of the domestic model in the Chinese context - it understands the allusions, while its competitors can only interpret literally.

(Image self - made by Lei Technology, Qwen - Image 2.0)

What? You say one picture doesn't prove much?

Then let's try the restoration of Chinese semantics. I prepared a detailed prompt of nearly 800 words to see if Qwen - Image 2.0 could generate a result that met the expectations:

(Image source: Lei Technology)

As a result, the generated picture by Qwen - Image 2.0 is as follows. We can see that the model restored our requirements for the picture layout and font color, and the content was accurately presented with few omissions.

(Image self - made by Lei Technology, Qwen - Image 2.0)

However, there are also some deficiencies. In several boxes, semicolons were included, and some very small font labels were simply unreadable.

The generated result of Nano Banana Pro clearly had more images and icons. The design style was the same as what we required, and most of the text was successfully rendered.

The only drawback was that some of the text was blurred and difficult to distinguish.

(Image self - made by Lei Technology, Nano Banana Pro)

Overall, both models did a good job. Qwen - Image 2.0 was relatively simpler, while the finished product of Nano Banana Pro was really full of design sense.

Finally, let's test the effect of text - and - image combination. Here, we'll use Cao Cao's "Duangexing" as the target:

(Image source: Lei Technology)

Without prompting the full text of "Duangexing", neither model was able to generate the full text. Qwen - Image 2.0 would stop halfway through the content, while Nano Banana Pro seemed to have some strange repetitions.

(Image self - made by Lei Technology, top: Nano Banana Pro, bottom: Qwen - Image 2.0)

Putting this aside, the generation effects of both large models were actually quite good.

If the full text is provided, will the generated results be different? To answer everyone's questions, I tried again.

(Image self - made by Lei Technology, top: Nano Banana Pro, bottom: Qwen - Image 2.0)

At first glance, the overall completion rate was quite high. The picture elements I required, the long text to be fully embedded, and the requirements for the calligraphy font were all restored.

But on closer inspection, it's not hard to find that Qwen - Image 2.0 still has room for improvement in the layout, generation, and artistic design of long texts.

Strong stability and excellent image - editing ability

If the previous text - to - image generation was just a regular operation, then the following image editing is where Qwen - Image - 2.0 really surprised Xiaolei.

Specifically, we can upload one or more pictures and let the AI perform secondary creation, modification, and other editing operations through prompt instructions.

Let's not waste any more words. Let's try the once - popular "three - view" gameplay first:

(Image source: Lei Technology)

The original picture is of a Japanese internet celebrity on TikTok:

(Image source: Bilibili)

On this basis, the three - view picture generated by Qwen - Image 2.0 is very normal and can be regarded as a finished product that conforms to the character's logic.

(Image self - made by Lei Technology, Qwen - Image 2.0)

The finished product of Nano Banana Pro is very abstract.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Alibaba Drops a Bombshell Late at Night: Qwen-Image-2.0 Tested - Handles Outfit Changes and Group Photos, and Writes Clearly

Good at Chinese output, but aesthetic needs improvement

Strong stability and excellent image - editing ability