HomeArticle

Has Google Lost Its Throne? The research team led by JIA Jiaya from the Hong Kong University of Science and Technology has open-sourced DreamOmni2, and its powerful photo editing capabilities have outperformed Nano Banana.

新智元2025-10-24 09:48
DreamOmni2 breaks through the bottleneck of AI multimodal editing and realizes the generation of abstract concepts.

You can never precisely describe the brushstrokes of Vincent van Gogh or the lighting and shadows of Wong Kar-wai. The future of AI creation is to enable AI to directly "understand" your inspiration rather than trying to interpret your instructions.

The AI image models are going crazy!

At the beginning of the year, GPT-4o triggered a "Studio Ghibli" craze.

Recently, the whole internet has been going wild for the 3D figurines generated by Nano Banana.

However, have you noticed a "key point":

Most of these unified generation and editing efforts focus on the combination of instruction editing and the generation of entity concepts. If they are to be used as intelligent creative tools, there is still a long way to go.

  • When language becomes inadequate.

Imagine you want to replace the backpack of a person in a photo with the pattern of a skirt in another photo. How can you precisely describe that complex and irregular Bohemian-style pattern to an AI in words?

The answer is: almost impossible.

  • When inspiration is not a physical object.

Furthermore, when what you want to draw inspiration from is not an object but an abstract "feeling" -

For example, the unique "retro film-like lighting and shadows" of an old photo or the "brushstroke style" of a specific painter, those models that are only good at extracting and replicating a specific "object" will be at a loss.

How great would it be if an AI could understand human language and accurately grasp these abstract styles!

Recently, this bottleneck was broken through by an AI research team led by Jia Jiaya from the Hong Kong University of Science and Technology. Their project on GitHub gained 1,600 stars in two weeks and was shared by many foreign creators on YouTube and forums, sparking a lot of discussions.

In a paper titled "DreamOmni2: Multimodal Instruction-based Editing and Generation", the AI has mastered the ability of multimodal editing and generation for "abstract concepts".

·Paper URL:

https://arxiv.org/html/2510.06679v1

·Project homepage:

https://pbihao.github.io/projects/DreamOmni2/index.html

·Code repository:

https://github.com/dvlab-research/DreamOmni2

Based on the powerful FLUX Kontext model, DreamOmni2 is endowed with the new ability to handle multiple reference images while retaining its top-notch text-to-image and instruction editing capabilities, making it a more intelligent creative tool.

It not only significantly outperforms existing open-source models in traditional tasks but also shows greater strength than Google's powerful Nano Banana in the new task of handling abstract concepts.

An open-source version of Nano Banana, but even stronger

Actions speak louder than words. Let's directly conduct some actual tests.

First, let's try a classic one: input a product and then let a character "endorse" it.

Prompt:

The character from the first image is holding the item from the second picture.

The details of the expression, hair, fingers, and the texture of the clothes are just perfect, aren't they?

Moreover, the product itself is well integrated.

Next, let's try the effect in the real world - let the model replace the man in Image 1 with the woman in Image 2.

The result is out!

As we can see, in the generated picture, the mountains in the background and the cyber-style lighting effects are almost perfectly inherited, and the text in front of the character is not affected at all.

In terms of the character, the clothes and hairstyle are basically the same as those in Image 2, and the facial lighting imitates the effect in Image 1.

It's really amazing.

Speaking of lighting rendering, let's increase the difficulty and let the model transfer the red - blue style in Image 2 to Image 1.

Prompt:

Make the first image has the same light condition as the second image.

Unexpectedly, DreamOmni2 not only retains the original grille - like lighting in Image 1 but also shows a very strong red - blue contrast after the fusion.

In contrast, GPT-4o (left in the following picture) only transfers the color tone, and the lighting and shadow effects are not retained. Nano Banana (right in the following picture) only changes the color slightly.

Style transfer is also a breeze for it.

Prompt:

Replace the first image have the same image style as the second image.

The pixel - style chicken - done.

The anime - style girl - done. (So beautiful)

Patterns and text are also no problem.

Prompt:

On the cup, "Story" is displayed in the same font style as the reference image.

Moreover, DreamOmni2 is also very good at imitating actions.

Prompt:

Make the person from the first image has the same pose as person from the second image.

In the result generated by DreamOmni2, the actions of the arms and legs are almost perfectly replicated from Image 2.

However, unfortunately, the direction of the character and the details of the hands are slightly different.

However, compared with the open - source model FLUX Kontext, which has big problems in semantic understanding, DreamOmni2 is much stronger.

As shown in the following picture, obviously, Kontext completely fails to understand what "the first image", "the second image" are, and the need to adjust the pose. So it simply copies Image 2.

Among the closed - source models, GPT-4o (left in the following picture) imitates the action quite well, but the facial consistency is not good.

And Nano Banana (right in the following picture) is a bit abstract, creating a "Three - Body person": )

In addition to body actions, DreamOmni2 is also very accurate and stable in editing facial micro - expressions and hairstyles.

Prompt:

Make the person in the first image have the same expression as the person in the second image.

The size of the open mouth and the slits of the squinted eyes are almost exactly the same. It can be said to be very bright.

It would be very difficult to describe such an effect in words.