Dark Horse Image Model Praised by Nano Banana's Technical Leader: 15 - Member Chinese Team Led by DDIM Father and CVPR Best Paper Author

Multiple capabilities are on par with industry benchmarks.

A dark horse has emerged in the image field!

Just now, Luma AI introduced a brand - new model, Uni - 1, directly competing with Google's Nano Banana Pro and GPT Image 1.5.

Uni - 1 is a unified image understanding and generation model.

In the official demonstration, Uni - 1 has many capabilities, such as character pose transfer, storyboard generation, reference - based generation by combining drafts and materials, draft - to - comic conversion, multi - reference image scene synthesis, draft - guided photo editing, UV mapping generation, and generation of greeting cards and posters with text.

In many authoritative task evaluations, Uni - 1 can not only compete with Nano Banana Pro and GPT Image 1.5, but also achieve world - leading performance in some tasks.

For example, in the following case, Uni - 1 precisely captures the details and outperforms in terms of style consistency, element integration, and detail restoration.

What's even more surprising is that Uni - 1, which can achieve such amazing results, is not backed by a large - scale company with heavy investment, but by a Chinese research team of less than 15 people.

After the release of Uni - 1, the comments were all positive, and even Oliver Wang, the chief scientist of Google DeepMind and the technical leader of the Nano Banana project, gave his praise:

Jim Fan, the head of robotics at NVIDIA, also sent his blessings:

How amazing is the effect of Uni - 1? Without further ado, let's look at more pictures.

Unlock diverse creative scenarios

New Year greeting card for the Year of the Horse

Let's start with a simple test:

Generate a New Year greeting card for the Year of the Horse, which should include Chinese characters such as 'Happy Chinese New Year', 'Good luck in the Year of the Horse and all wishes come true', and 'Year of the Horse, 2026'.

The greeting card generated by Uni - 1 has complete text content and reasonable layout, and the image of the horse is highly consistent with the traditional Chinese paper - cutting style. In contrast, GPT Image 1.5 has chaotic text, and Nano Banana Pro also has obvious defects in text rendering.

Chinese text rendering has always been a 'litmus test' for image - generation models - Uni - 1 has delivered a quite impressive answer.

Multi - reference image scene synthesis

Give the model 5 reference images - two cats, two men, and the logo of Luma AI - and ask it to synthesize a meeting scene:

One cat presents a Luma AI slide, and the other cat listens, while integrating real - person photos and the logo.

Uni - 1 accurately restores the identity features of each reference image - the fur color and pattern of the cats, the facial features and hairstyles of the men, and the details of the logo - and reasonably organizes them into the same scene.

GPT Image 1.5 simply 'pastes' the reference images onto the slide, while Nano Banana Pro fails to even achieve the basic integration of reference images.

Infographic extraction

Give the model a real - shot public - welfare poster of 'THE BEES NEED YOU' from a subway station, and ask it to extract an infographic ready for production - generate a complete image without placeholder frames and accurately restore all visible text in the infographic.

This task tests both the 'seeing' and 'drawing' abilities:

First, it needs to understand all the information levels in the real - shot poster, and then regenerate a well - laid - out infographic.

Uni - 1 accurately restores the complete poster layout, all text, correct color matching, the black grass silhouette, and the correct aspect ratio. GPT Image 1.5 has incorrect text colors in some parts, all the bottom text is missing, and there are also problems with the wildflower seed and bee logos. Nano Banana Pro has a decent overall layout, but the bottom text is also missing.

Draft - to - comic conversion

Let's look at the generation ability - convert a rough draft (a cat standing on a bookshelf, with someone saying 'Hey! Get down from there!') into a professional - level comic.

Uni - 1 perfectly transforms the draft intention into a professional comic: the panel composition, the position and direction of the dialogue bubbles are accurately restored, and all details are completely retained - the cat's ears, the raised tail, the cigarette ashtray, the books on the bookshelf, and even the mobile phone screen shows 911.

A lifetime in front of the piano: 6 - frame storyboard

The following may be one of the demos that best reflects the strength of Uni - 1.

Requirement: Generate a 6 - frame storyboard to show the life of the same character in front of the piano from childhood to old age. A person goes from being a boy to a teenager, to a young man, to a middle - aged man, to an old man, and finally becomes a group photo of a family on the stage.

In the 6 - frame images, the character's identity remains consistent - the same face, the piano, perspective, and painting style remain unchanged, and only the character's image and background change over time. This cross - frame character consistency and time - narrative ability are one of the core challenges of current image models.

UV mapping generation

Give the model three photos of a person taken from different angles (front, left, and right), and ask it to generate an unfolded UV map of a standard facial topology.

UV mapping is a core part of 3D modeling, with extremely high requirements for facial alignment, left - right symmetry, and skin - color consistency.

The UV map generated by Uni - 1 is significantly better than GPT Image 1.5 and Nano Banana Pro in these three dimensions:

The front - face and side - face maps of GPT Image 1.5 are inconsistent, and Nano Banana Pro fails to generate a result that meets the standard UV layout specifications.

The ability to handle such professional - level 3D tasks shows that Uni - 1 is not just 'good at drawing pictures', but truly has a deep understanding of the three - dimensional spatial structure.

How can a team of less than 15 people achieve this?

After seeing the effects, you may be curious: How did a team of less than 15 people achieve results that are usually only seen from large - scale companies?

The answer may lie in the two research leaders of this team.

Jiaming Song, graduated from Tsinghua University as an undergraduate and obtained a doctorate from Stanford University.

His most well - known contribution is the invention of DDIM (Denoising Diffusion Implicit Models). If you have used any image - generation tools based on diffusion models, from Stable Diffusion to DALL·E, they almost all rely on the sampling acceleration technology brought by DDIM.

This paper has been cited more than 10,000 times and won the ICLR 2022 Outstanding Paper Award.

Botao Shen, graduated from Stanford University as an undergraduate and obtained a doctorate from the same university.

His representative work won the CVPR 2018 Best Paper Award - CVPR is a top - tier conference in the field of computer vision, and only a very small number of papers can win this honor each year. In addition, he was also short - listed as a finalist for the RSS 2022 Best Student Paper.

One is the founder of diffusion - model acceleration, and the other is a top - tier researcher in computer vision. These two Chinese scholars joined hands to lead an elite team and chose a different path from large - scale companies:

Instead of separating understanding and generation, they use a unified model to handle both things at once.

Unified model: Giving the logical brain a'mind's eye'

The core concept of Uni - 1, in Luma's own words, is to 'give the logical brain a mind's eye'.

Under the traditional approach, image understanding (describing images, object detection) and image generation (text - to - image, image editing) are two independent systems. However, Uni - 1 uses a decoder - only autoregressive Transformer architecture to represent text and images in the same interleaved sequence - it is both the input and the output.

This means that Uni - 1 does not need to train the 'understanding module' and the 'generation module' separately, but models time, space, and logic simultaneously within a unified framework.

What's more interesting is that Luma found that generative training can significantly improve the understanding ability. In other words, when the model learns to 'draw', its ability to 'look at pictures' also becomes stronger - this is actually highly consistent with human cognitive laws.

In inference - based generation tasks, Uni - 1 will conduct structured internal reasoning before synthesizing images: first decompose the instructions, plan the composition, and then render and output.

This ability to 'think before drawing' enables it to achieve the world's best results on RISEBench (a benchmark test that evaluates four dimensions: temporal reasoning, causal reasoning, spatial reasoning, and logical reasoning).

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The Dark Horse image model was praised by the technical leader of Nano Banana. A 15-member Chinese team is led by the father of DDIM and the author of the best paper at CVPR.