The core team of Nano-Banana reveals for the first time how the world's hottest AI image generation tool was created.
The gameplay of Nano banana is still being continuously updated, including desktop figurines, multi - element jigsaw puzzles, and generating continuous stories, etc.
This powerful model that brings the "Studio Ghibli moment" has once again triggered a creative boom across the network. There are all kinds of real - life figurines in the Moments.
However, while being surprised by the generation results, remember to actively mark "The picture content is generated by AI" because the "Measures for the Identification of Content Generated by Artificial Intelligence" have come into effect today.
In image generation, Google actually already has a text - to - image model like Imagen 4. Why was nano banana finally brought by Google?
At the beginning, when it appeared on the big - model arena with the mysterious code name nano banana, some people guessed it was Google's model.
But this is indeed not by chance or wild guessing. Nano banana is the result of the projects of multiple Google teams. First is the powerful world knowledge and instruction - following ability of Gemini. Second is the pursuit of ultimate image aesthetics and naturalness provided by Google's top - notch in - house text - to - image model, Imagen.
We have compiled the podcast interviews of the core team of nano banana. Let's take a look at the past, present, and future of this banana.
TL;DR:
1. Nano banana has achieved a huge quality leap in image generation and editing functions. It has a fast generation speed, can understand vague, colloquial instructions and those that require world knowledge, and maintains the consistency of characters and scenes in multiple rounds of editing. The effect is more natural, getting rid of the past feeling of being "P - stuck on".
2. In the past, evaluating images and videos was very difficult, and finding a suitable indicator was very important. The nano banana team found that by improving the text rendering effect, the effect of generated images can be improved at the same time. Because when the model can generate structured text, it can also better learn the structure in images.
3. The key to the improvement of Nano banana lies in the "native multimodal" ability, especially "Interleaved Generation". This allows the model to process complex instructions step - by - step like a human and create in context, rather than generating all content at once.
4. If you only need high - quality "text - to - image" generation, the Imagen model is still the first choice; while if you need more complex multimodal workflows such as multiple rounds of editing and creative exploration, nano banana is a more suitable creative partner.
5. In the future, the goal of nano banana is not only to improve visual quality, but also to pursue "intelligence" and "factual accuracy". The team hopes to create a smart model that can understand the user's deep - seated intentions, and even give better and more creative results beyond the user's prompts, and can accurately generate work content such as charts.
Below is the main text of the podcast, with slightly adjusted compilation.
Hello everyone, welcome back to "Release Notes". I'm Logan Kilpatrick from the Google DeepMind team. Today with me are Kaushik, Robert, Nicole, and Mustafa. They are the colleagues in charge of the research and product of our Gemini native image generation model. I'm super excited about today's sharing. So, Nicole, would you like to take the lead? What's the good news about the release?
From left to right: Kaushik Shivakumar, Robert Riachi, Nicole Brichtova, Mostafa Dehghani, and Logan Kilpatrick
Nicole: Yes, we are releasing updates to the image generation and editing functions for Gemini and 2.5 Flash. This is a huge quality leap, and the model has reached the industry - leading level. We are very excited about both the generation and editing capabilities. Why don't I just show you the effects of the model directly, because that's the most intuitive way.
Logan: I'm so looking forward to it! I played with it once before, but not as much as you guys, so I'm really eager to see more examples.
Nicole: Okay, let me take a picture of you. Let's start with a simple example: for instance, "Zoom out the shot, put him in a huge banana suit, but keep the face clear because it has to still look like you". It takes a few seconds to generate, but it's still fast. You remember the last released model was already quite fast.
Logan: This is one of my favorite functions. I think the speed of this kind of editing makes the model very interesting. Can you zoom in on the picture? View it full - screen?
Nicole: Just click on it. This is Logan, still your face. And amazingly, the model can keep it as you, but also put you in a huge suit and generate a background of you walking in the city.
Logan: That's so interesting! The background is Chicago, and it really looks like that street.
Nicole: Yes, this is the world knowledge of the model at work. Let's continue then. Try "make it nano".
Logan: What does "make it nano" mean?
Nicole: We initially gave it the code name Nano Banana during testing, and later people guessed it was our new model update. And look, now it will turn you into a cute mini - version character in a banana suit.
Logan: Haha, I love it.
Nicole: This is the coolest part. Your prompt was actually very vague, but the model is creative enough to interpret it and generate a scene that both meets the prompt and is reasonable in context.
This is very exciting because this is the first time we've seen a model maintain scene consistency in multiple edits, and at the same time, users can interact with the model in very natural language without writing a long and complicated prompt. It makes you feel like you're having a conversation with the model, which is super fun.
Logan: I love it so much. So, how does it perform in text rendering? This is one of the use - cases I'm most concerned about.
Nicole: Do you want me to demonstrate? Give me a prompt.
Logan: Then "Gemini Nano", that's the only nano - related word I can think of. My most common use - case is making posters or announcements with text.
Nicole: This is a very simple text, with few words and simple terms, so the effect is good. We do still have some deficiencies in text rendering, which was also mentioned in the release notes. Our team is working hard to improve it, and the next model will do better.
Text rendering is an effective signal reflecting the model's performance
Logan: I really like it. Are there any other examples or indicator stories related to this release? I know it's difficult to evaluate, as many are based on human preferences. How do you think about this?
Robert: Indeed, in multimodal models, such as for images and videos, evaluation is very difficult. In the past, we mainly relied on human preference scores. But images are very subjective, so we need to collect signals from a large number of people, and the process is slow. We are also trying to find new indicators. Text rendering is an interesting example.
Kaushik has long emphasized its importance. Although we thought he was a bit obsessive before, we later found that it is actually very valuable. When the model learns to generate structured text, it can also better learn the structure in images, such as frequency, texture, etc. This provides us with a good signal.
There is a dedicated font rendering project GenType in Google Labs
Kaushik: Yes, I think it started with identifying the deficiencies of these models. To improve a model, we first need to clearly identify where it performs poorly, that is, find a "signal" to point out the problem. Then we will try various methods, whether it's the model architecture, data, or other aspects of improvement. Once we have this clear signal, we can indeed make good progress on the corresponding problem.
Looking back a few years ago, almost no model could perform decently on short prompts like Gemini Nano. We spent a lot of time delving into this indicator and always tracked it.
Now, no matter what experiments we conduct, as long as we continuously track this indicator, we can ensure that there will be no regression in this aspect. And precisely because we use this indicator as a reference signal, sometimes we can even find that some changes that we didn't expect to have an impact actually do have a positive effect. In this way, we can continuously optimize this indicator and improve the model's performance.
Yes, as Robert said, this is a good way to measure the overall image quality when there are no other image quality evaluation indicators that don't saturate quickly.
I was actually a bit skeptical about using human evaluation for image generation results at first, but as time passed, I gradually realized that as long as enough people evaluate enough prompts and cover different categories, we can indeed get very valuable signals.
The text can reflect the effect of image generation. The prompt is to generate a poster of "The monkeys' cries on both banks ceaselessly ring; My skiff has left ten thousand crags behind."
But obviously, the cost of this method is high, and it's impossible to always have many people score. So, during the model training process, indicators like text rendering are especially valuable. It can well reflect whether the model's performance meets expectations and is a very effective signal.
Image understanding and image generation are as closely related as "sisters"
Logan: This is really interesting. I'm curious about how the model's own image generation ability and image understanding ability interact with each other. We did a program with Ani before, and his team obviously invested a lot in this area. For example, Gemini has reached the industry's most advanced level in image understanding.
So, can we understand it like this: when the model becomes stronger in image understanding, part of this ability can actually be transferred to image generation? Conversely, the progress of image generation may also improve image understanding ability. Is this way of thinking reasonable?
Mostafa: Yes, basically our goal is to ultimately achieve native image understanding, native multimodal understanding, and generation abilities, that is, in the same training process, let the model learn to handle tasks of different modalities at the same time and generate "positive transfer" between these different abilities.
And this is not just the mutual promotion between image understanding and image generation, nor is it limited to the generation ability of a single modality. Further, we hope that the knowledge the model learns from images, videos, and audios can in turn help with text understanding or text generation.
So, it can be said that image understanding and image generation are as closely related as "sisters". In some of the applications we see now, such as interleaved generation, the two are indeed complementary and develop synchronously.
But our ultimate goal is far more than this. For example, in language, there is a phenomenon called "reporting biases".
What does it mean? For example, if you visit a friend's house and then come back, in a chat, you usually won't specifically mention the "ordinary sofa" in their house. But if you show someone a photo of that room, the sofa is there, and it exists in the image even if you don't mention it.
So, if we want to comprehensively understand the world, in fact, images and videos contain a large amount of information that we can obtain without explicitly asking.
That is to say, of course, we can learn a lot from text, but it may require a large amount of language data (tokens) to learn. Visual signals are a "shortcut" to understanding the world and can more efficiently convey certain types of information.
Back to the topic of image understanding and generation, as I mentioned before, the two are closely related and complementary. In some of the applications we see now, such as interleaved generation, the two do complement and develop synchronously.
But our ultimate goal is much more than this. For example, in language, there is a phenomenon called "reporting biases".
What does it mean? For example, if you visit a friend's house and then come back, in a chat, you usually won't specifically mention the "ordinary sofa" in their house. But if you show someone a photo of that room, the sofa is there, and it exists in the image even if you don't mention it.
So, if we want to comprehensively understand the world, in fact, images and videos contain a large amount of information that we can obtain without explicitly asking.
That is to say, of course, we can learn a lot from text, but it may require a large amount of language data (tokens) to learn. Visual signals are a "shortcut" to understanding the world and can more efficiently convey certain types of information.
Back to the topic of image understanding and generation, as I mentioned before, the two are closely related and complementary. In some of the applications we see now, such as interleaved generation, the two do complement and develop synchronously.
Nicole: Let me transform this theme into a scene of a charming 1980s American - style shopping mall, presented in five different ways. Okay, hope everything goes well.
It seems the effect is good. It does take some time, as we not only need to generate multiple images but also generate text describing the content of these images.