Save Photo Editing Strugglers! Alibaba Launches Free Multimodal Model Qwen

Here comes the first-hand on-the-spot test results.

Did it immediately make netizens exclaim that its "raw image generation ability" is even stronger than GPT-4o?!

Just last night, Alibaba launched a stunning show with its brand - new multimodal model Qwen-VLo.

It is reported that Qwen-VLo has been comprehensively upgraded based on Alibaba's original multimodal understanding and generation capabilities, featuring three major highlights:

It has enhanced detail - capturing ability and can maintain high semantic consistency throughout the generation process;

With a single instruction, it can achieve image editing, including style replacement, material addition and deletion, text addition, etc.;

It supports multiple languages such as Chinese and English, making it more convenient for global users.

Moreover, both at the input and output ends, Qwen-VLo supports arbitrary resolutions and aspect ratios, without being restricted by fixed formats.

Meanwhile, in the official demo, in addition to the existing gameplay of GPT-4o (such as continuous generation, Ghibli style, text addition), it also supports some creative ideas.

There's no need to elaborate on the former. Now it can generate various pictures that precisely meet the instructions, just like a "serial drama":

As for the latter, for example, we can ask Qwen-VLo to generate a picture of "all bath products in the shopping basket", just like shopping for daily necessities in a supermarket.

Surprisingly, it immediately completed the task (⊙ˍ⊙):

There are indeed some minor flaws, but to be fair, its "understanding" ability is indeed stronger than before.

According to the official introduction, this understanding ability is not only reflected in image generation but also in image recognition and interpretation.

For example, after completing the image generation task, we can ask it to introduce the breeds of the kittens and puppies in the picture (correctly identified as tabby cats and beagles):

Moreover, different from previous models, Qwen-VLo can also annotate existing information (such as detection, segmentation, etc.).

In the following picture, it successfully segmented the edge of the banana with a red mask.

...

Currently, everyone can play with the model for free (it is currently in the preview version). Specifically, look for Qwen3 - 235B - A22B and just enter your requirements in the input box on the home page.

Without further ado, let's start a hands - on test right away.

Qwen-VLo, just how capable are you at editing?

Based on the highlights introduced by Qwen, namely "strong detail - capturing" and "editing images with a single sentence", we focused on examining various editing abilities of Qwen-VLo in the test.

After all, this is really appealing!

On the one hand, almost all model - based image generation requires a bit of luck, and the previous generation results may not be entirely satisfactory. So, the ability to edit images for the second or multiple times is very important.

On the other hand, strong editing ability really saves a lot of trouble for those who are not good at photo - editing...

Let's start with a warm - up!

In the first test, let it generate a photo of a polar bear drinking Coke.

This round features an unrealistic style.

On this basis, continue the conversation to replace the Coke with milk.

It succeeded in one attempt. Qwen-VLo indeed completed the replacement.

Moreover, the background and the polar bear itself were hardly changed randomly.

But if we really want to be picky, we can still observe that there are slight differences in the eyebrows and hair texture of the polar bear between the two pictures.

In the second test, first let it generate a photo of a bird.

This round features a realistic photography style.

Then, without going to Hogwarts, just say "replace the bird in the picture with a pigeon", and you can perform magic:

But when we tried to play with the "garlic bird" meme, Qwen-VLo didn't get it.

(Note: The term "garlic bird" is a recent popular meme. The Wuhan dialect "forget it, forget it, everyone has a hard time" in the voice - over of a short video was homophonically called "suan niao" by netizens and later evolved into "garlic bird".)

However, although it didn't get the meme, Qwen-VLo still tried hard to complete the editing task.

Looking at the result in the following picture, on the basis of not changing other elements, Qwen-VLo replaced the pigeon in the picture with another bird.

Is this also a kind of bird replacement?

In the third test, let's do a multi - step task to comprehensively test Qwen-VLo's ability to "depict" the world, and focus on examining its text - editing ability on images.

The process is "let Qwen-VLo generate a sketch - color it - add text - edit Chinese characters".

Come on. Afraid that the GIF slides too fast, let's look at the four pictures sequentially captured during the process to feel the changes brought about by each step:

Although the facial features of the handsome guy in the picture are changing, the main body of the character is stable, and the background remains the same. Overall, the task of editing Chinese characters is done quite well.

Finally, let's do an additional question: edit English -

The text is edited correctly, the positions of the multi - character main bodies remain unchanged, and the background is still the same. Overall, it's correct.

But as you can see, the handsome guy also looks more like a character from a US comic (laughing manually).

Also showing step - by - step, but Qwen-VLo really has something behind it

Here we'd like to add a little more. Everyone should notice this when playing with it.

That is, the process of Qwen-VLo generating images is like this -

Does it smell a bit familiar?

Yes, GPT-4o also generates images block by block from top to bottom: first showing a fuzzy outline and then gradually filling in the details.

However, a reverse - engineering study by the Chinese University of Hong Kong found that the line - by - line rendering effect seen by users is just a trick of OpenAI, not really generating pixel by pixel from top to bottom.

The purpose of doing this is to meet users' psychological expectations of "real - time generation" and avoid the technical burden of real line - by - line rendering.

But Qwen isn't playing the same game as OpenAI.

Let's take note -

First of all, Qwen officially stated that Qwen-VLo's progressive generation method not only goes from top to bottom but also gradually clarifies the entire picture from left to right.

After multiple tests, we haven't visually observed the front - end effect of "from left to right" for the time being.

But the front - end effect of gradually forming a photo from top to bottom is guaranteed:

Secondly, Qwen introduced this form, and it's really useful:

During the generation process, the model continuously adjusts and optimizes the predicted content to ensure that the final result is more harmonious and consistent.

This generation mechanism not only improves the visual effect and generation efficiency but is also particularly suitable for long - paragraph text generation tasks that require fine

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Save those who are bad at photo editing! Alibaba has launched the multimodal model Qwen-VLo! Everyone can use it for free.

Qwen-VLo, just how capable are you at editing?

Also showing step - by - step, but Qwen-VLo really has something behind it