Das Nano Banana-Team: Gespräche über AI-Produkte und Bildmodelle - Hoffnung auf Verschmelzung verschiedener Modalitäten

Selbst kleine Teams können in der Bildverarbeitungsbranche Spitzenmodelle entwickeln.

In the latest episode of the podcast "Unsupervised Learning" by Redpoint Ventures in the United States, Redpoint Ventures partner Jacob Effron had a conversation with two Google researchers, Nicole Brichtova and Oliver Wang, who are in charge of Nano Banana. The discussion suggested that the popularity of Nano Banana is attributed to the unprecedented "character consistency" achieved by this model.

Nano Banana was "anonymously" released on August 26th, and it was later proven that this model is Google's Gemini 2.5 Flash Image model. The success of Nano Banana has also led to a soaring download volume of Google's Gemini APP.

According to the latest data provided by the app data analysis company Appfigures, this app has climbed to the top of the global app store rankings, and its download volume increased by 45% month-on-month in September. Although only half of September has passed, the Gemini app has already received 12.6 million downloads this month, far higher than the 8.7 million downloads in August. Prior to this, Gemini only reached the third place in the US App Store on January 28, 2025. The stock price of Google's parent company, Alphabet (GOOG.US), increased by 19.56% from August 26th to the close on September 17th.

Gemini download volume data (Source: Appfigures, TechCrunch)

Beyond the product, this podcast interview covered topics such as how the model can be integrated into the creative workflow, why despite the current powerful AI image capabilities, "we are still in the early stages of AI image development," and how image and video generation are trending towards unity.

During the interview, Nicole Brichtova and Oliver Wang also shared the limitations of the current model, security strategies, and why the expectation of "going from a prompt directly to generating content ready for production" is actually highly overestimated.

The following is the translated interview transcript (abridged) by "Mingliang Company":

Nicole (left), Oliver (middle), Host Jacob (right)

The success of Nano Banana is attributed to character consistency

Jacob: Nicole and Oliver, thank you so much for coming on the show. I've been really looking forward to this conversation. It feels like Nano Banana has taken over my entire Twitter feed and all my free time.

Today, we'll delve into many topics. Maybe we can start with this question - you had access to and experienced the product and model before its release. I remember it was initially released anonymously. But you were among the first to play around with it. I'm curious, what use cases did you initially think would be the most popular or exciting? And how has the actual situation been since its release?

Nicole: Oliver has seen many iterations of pictures of my face. For me, the most exciting part is the character consistency and being able to see myself in new scenarios - so I actually have a bunch of slides with my face, like wanted posters, archaeologists, and the careers I dreamed of as a child.

Basically, we've now created an evaluation dataset that includes my face and those of other team members, which we use to test every time we develop a new model.

Jacob: In the AI field, this is the highest honor.

Nicole: I'm really excited. So I highly value character consistency because it gives people a whole new way to imagine themselves, which was difficult before. This is also one of the reasons why everyone is ultimately very excited. We've seen many people turn themselves into figurines, which is one of the very popular use cases. Another use case that surprised me but actually makes sense - people coloring old photos, which is a very emotionally valuable use case. For example, now I can see what I really looked like as a child, or what my parents really looked like restored from black-and-white photos.

Jacob: This is really interesting. I'm sure seeing all the different use cases is one of the joys of having a popular product. I've also seen on Twitter that you must have received countless feature requests. Everyone wants the model to do this or that. What are the most common requests? How do you view the next milestones or development directions for these products and models?

Nicole: The most common request on Twitter is for higher resolution. Currently, many professional users are requesting images with a resolution above 1K. There are also many requests for support for transparent backgrounds, which is a common requirement for professional users. These two are the most common ones I've seen, along with better text rendering.

Jacob: Character consistency used to be a big problem that was difficult to solve, and you've done an excellent job in this regard. What do you think is the next frontier for the improvement of image models?

Oliver: For me, the most exciting thing about this model is that it can start to handle more difficult questions. Previously, you had to define every detail of the image you wanted, but now you can ask for help just like you would with a language model. For example, someone used it to redecorate a room but didn't have any ideas themselves, so they asked the model for suggestions. The model can give reasonable suggestions based on the color scheme, etc.

I think the most interesting thing is to combine the world knowledge of language models to let the image model really help users and even show them things they didn't expect. For example, an information retrieval request - I want to know how something works, and the model can generate explanatory pictures. I think this is a very important use case in the future.

Jacob: How's the progress in this area?

Oliver: Aesthetics are always tricky because it requires deep personalization to give useful information. I think personalization is an area that is still being continuously improved on the technical side. It will take us some time to really understand users' needs, but if we can have a dialogue with the model to continuously clarify and refine, I think it's very promising. For example, we can communicate repeatedly in a conversation thread until the picture we want is generated.

Jacob: Do you think personalization will only happen at the prompt level? That is, by providing enough descriptions and context to the model to achieve personalization? Or will there be different aesthetic models?

Oliver: I think it will happen more at the prompt level. For example, the information users provide can help us make more informed decisions. I hope so. After all, it sounds very complicated for everyone to have their own model and be served separately, but maybe that's how it will be in the future.

Nicole: But I do think there will be significant differences in aesthetics. I think to some extent, personalization must be achieved at that level. You can see this on the Google Shopping tab. For example, when you're looking for a sweater, the system will recommend a bunch of them, but you actually want to focus on your own aesthetics and even select matching items from your wardrobe. I hope all of this can be achieved within the model's context window. We should be able to feed the model pictures from your wardrobe and then help you find suitable matches. I'm really looking forward to this and hope it can be done. Maybe more advanced aesthetic control will be needed, but I think that will probably happen more at the professional user level.

In the field of language models and even in the image field, many decisions actually depend on the data used during pre-training, which directly affects the final capabilities and aesthetic style of the model. So I'm also curious, will there be a universal model in the future that can cover all image use cases through prompts? Or will there be models of various styles?

Nicole: We've always been surprised by the range of use cases that off-the-shelf models can support. You're absolutely right. Many consumer-oriented use cases, such as just wanting to draw a room rendering, can be done. But once it comes to more advanced functions, other tools need to be integrated to turn it into a final product and play a role in workflows such as marketing or design.

Jacob: Everyone must be curious about why these models have become so good?

Nicole: There are many special reasons.

Oliver: Actually, there isn't a single factor. It's about getting all the details right, really fine-tuning the formula, and having a team that has been focusing on this problem for a long time. We were actually quite surprised by the level of success of the model. We knew the model was cool and were looking forward to its release. But when we launched it on LM Arena, not only did it have a high Elo score, which is of course great. A high score is a good sign that the model is useful, but for me, the real indicator is that a large number of users flocked to LM Arena to use the model. We had to keep increasing the queries per second, which we completely didn't expect. This was the first time we realized that this was really something very useful. Many people need such a model.

After the launch, Nano Banana's Elo score was significantly ahead (Source: LM Arena website)

Jacob: I think this is the most interesting part of this ecosystem. You had some expectations when building the model yourself, but it's only when it's really in the hands of users that you can discover its power and influence. This time, it obviously caused a huge reaction.

Obviously, the inference ability of the model largely benefits from the progress of the language model itself. Can you introduce how much the image model has benefited from the progress of the language model? Do you think this trend will continue with the development of LLMs?

Oliver: It definitely benefits. It almost 100% relies on the world knowledge of the language model. For example, Gemini 2.5 Flash Image (that's the name of this model).

Jacob: It would be better if the name was a bit more interesting.

Nicole: (Nano Banana) is definitely easier to pronounce.

Oliver: I'm a bit curious if our success is because people like saying the name Nano Banana. But it is indeed part of the Gemini model. You can communicate with it just like you would with Gemini, and it knows everything that Gemini knows. This is a key step for these models towards practicality, which is to integrate with the language model.

Nicole: You may remember that two or three years ago, you had to be very specific when describing your requirements. For example, "a cat on the table, what the background is, these colors." Now you don't need to be that detailed. A big reason is that the language model has become stronger.

Jacob: It's no longer the magic prompt conversion in the background. Previously, when you entered a sentence, the system would automatically expand it into a detailed prompt of ten sentences. Now the model itself is smart enough to understand your intention. This is really exciting.

How to polish the product, the potential of multimodal and voice AI

Jacob: From a product perspective, you have different types of users. Some are experts who went to LM Arena to play with the model as soon as it was launched. They know very well how to use it. There are also many ordinary Gemini users who have absolutely no idea what to do in front of a "blank canvas." How did you consider building the product for these two types of users?

Nicole: There's still a lot we can do. You're right. The users and developers on LM Arena are very professional and can use these tools to create new use cases that we didn't expect. For example, someone turned an object into a holographic image in a photo. We never trained the model for such a scenario, but it performed very well. For ordinary consumers, ease of use is extremely important. Now when you enter the Gemini app, you'll see banana emojis everywhere. We did this because after people heard about Nano Banana, they went looking for it, but there was no obvious entry in the app.

We've done a lot of work, such as collaborating with creators to preset some use cases and releasing examples with direct links to the Gemini app, where the prompts will be automatically filled. I think there's still a lot of room for improvement in the "zero state" problem, such as using visual guidance for users. In the future, gestures could also become a way to edit pictures, not just relying on text prompts.

Sometimes, if you want a very specific effect, you still need a long prompt, but this isn't natural for most users. So I use the "parent test method" - if my parents can use it, then it passes. We haven't reached that level yet, so there's still a long way to go.

Many problems actually boil down to "showing rather than telling." Provide users with examples that are easy to replicate and make sharing simple. There isn't a magic solution. It requires efforts from multiple aspects.

Oliver: We also found that social sharing is very important in solving the "blank canvas" problem. When users see what others have done, since the model can be personalized by default, they can try it with their own photos, friends, and pets. It's very easy to imitate, and this is also an important way for the model to spread.

Jacob: Currently, everyone interacts with the model using text. Are you excited about any new types of design interfaces in the future?

Nicole: I think we've just started exploring the possibilities. Ultimately, I hope all modalities can be integrated, and the interface can automatically switch to the most suitable way based on the task. Now large models can not only output text but also pictures and visual explanations to meet users' needs.

I think voice has great potential and is a very natural way of interaction, but no one has really created a great voice interface yet. Currently, we're still typing text. So in the future, it could be combined with pauses, gestures, etc. For example, if you want to erase an object in a picture, you should be able to operate it just like you would on a draft paper. How to seamlessly switch between different modalities is a direction I'm very looking forward to, and there's still a lot of room to explore the actual form.

Jacob: What do you think are the limitations of voice?

Nicole: Some issues are about priorities. We're still working on improving the model's capabilities. Voice has also made great progress in the past two years. I think someone will try it soon, and maybe we'll also do some related work. The problem lies in how to detect users' intentions and then switch between different modes based on those intentions, because it's not obvious. You may end up facing the "blank canvas" problem again. How do you show users the functions? We've found that when users come in, they have high expectations for the chatbot, thinking it can do everything. In fact, it's very difficult to explain the limitations and show all the functions, especially when the tool's capabilities are getting stronger. So we need to find a way to define the scope and show the possibilities in the UI to help users complete their tasks.

Jacob: And you teach users what the robot can do at a certain moment, but three months later, you have to teach them again because the functions have changed. This is also a very interesting product challenge.

Many products have evaluation mechanisms. You have your own evaluation dataset, such as Nicole's own photos. What does the evaluation of image models usually look like? Besides letting users experience it on LM Arena, what experiences do you have in tracking the progress of the model?

Oliver: One of the benefits of the progress of language models and vision

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Das Nano Banana-Team spricht über AI-Produkte und Bildmodelle: Letztendlich hoffen wir, dass verschiedene Modalitäten miteinander verschmelzen können.

The success of Nano Banana is attributed to character consistency

How to polish the product, the potential of multimodal and voice AI