HomeArticle

Google's Nano Banana has gone viral across the entire internet. Let's uncover the team behind it.

机器之心2025-08-29 15:04
Introduce the "staggered generation" function to enhance the model's capabilities in world knowledge and creative interpretation.

The "interleaved generation" feature is introduced to enhance the model's capabilities in world knowledge and creative interpretation.

Can a banana be turned into a dress? Google has truly made it happen!

In the latest Google Developers program, the Google DeepMind team comprehensively showcased Gemini 2.5 Flash Image for the first time — a cutting-edge model with native image generation and editing capabilities.

It can not only generate high-quality images rapidly but also maintain scene consistency in multi-round conversations, bringing an unprecedented interactive experience and revolutionizing image generation at the SOTA level.

The R & D and product teams behind it also made their debut for the first time.

Unveiling the Team Behind

Logan Kilpatrick

Logan Kilpatrick is a senior product manager at Google DeepMind, leading the product development of Google AI Studio and Gemini API.

He enjoys a high reputation in the AI developer community. He once served as the head of developer relations at OpenAI and is well - known by the nickname "LoganGPT". Before joining Google, he worked as a machine learning engineer at Apple and an open - source policy advisor at NASA.

At Google, Kilpatrick led the launch of the local image generation feature of Gemini 2.0 Flash, enabling developers to generate and edit images through natural language prompts. The highlights of this feature include multi - round conversational image editing, alternating generation of images and text, and image generation based on world knowledge.

Kilpatrick also regularly shares product updates and developer resources on X, becoming an informal spokesperson for Google AI.

He graduated from Harvard University and the University of Oxford. In the early days, he developed software for lunar rovers at NASA and trained machine learning models at Apple. He has a positive attitude towards the Julia programming language and stated in 2024 that it is "increasingly likely" to directly move towards artificial super - intelligence (ASI) without focusing on the intermediate stages.

Kaushik Shivakumar

Kaushik Shivakumar is a research engineer at Google DeepMind, focusing on the research and application of robotics, artificial intelligence, and multimodal learning.

He obtained a bachelor's degree in computer science from the University of California, Berkeley, and is pursuing a master's degree at the AUTOLab of the same university, under the guidance of Professor Ken Goldberg. During his postgraduate studies, he mainly engaged in robotics research related to deformable object manipulation, language models, and reinforcement learning.

Before joining DeepMind, Kaushik worked as a software engineering intern at the Google Brain team, researching methods for uncertainty estimation of deep neural networks. He also served as a researcher and intern at institutions such as the RISE Lab at UC Berkeley and Snorkel AI, participating in multiple projects related to robotics, machine learning, and weakly - supervised learning.

At DeepMind, Kaushik participated in several important projects, including the development of the Gemini 2.5 model, which has made significant progress in reasoning ability, multimodal understanding, and long - context processing. In addition, he has published multiple research papers in fields such as robot manipulation, object tracking, and semantic search.

Robert Riachi

Robert Riachi is a research engineer at Google DeepMind, focusing on the development and application of multimodal AI models, and has made significant contributions, especially in the field of image generation and editing.

He majored in computer science and statistics during his university years and graduated from the University of Waterloo in Canada.

At DeepMind, Riachi participated in several important projects, including the R & D of the Gemini 2.0 and Gemini 2.5 series models, aiming to combine image generation capabilities with conversational AI, enabling users to perform fine - grained image editing through natural language prompts.

Before joining DeepMind, Riachi worked as a software engineer and machine learning engineer at companies such as Splunk, Bloomberg, SAP, and Deloitte.

Nicole Brichtova

Nicole Brichtova graduated from Georgetown University and the Fuqua School of Business at Duke University for her undergraduate and postgraduate studies respectively. Currently, she serves as the head of visual generation products at Google DeepMind, focusing on building generative models to promote the development of products such as Gemini applications, Google Ads, and Google Cloud.

Before joining DeepMind, Nicole worked on product and market strategy in Google's consumer product team, participating in the planning and promotion of multiple projects. In addition, she also worked as a consultant at Deloitte, providing advice on innovation and growth for Fortune 500 technology companies.

Nicole pays special attention to how generative artificial intelligence can support creativity, design, and new ways of interacting with technology. She has shared the latest progress of DeepMind in the field of visual generation on multiple public occasions, emphasizing the model's ability to understand complex instructions and generate high - quality images.

Mostafa Dehghani

Mostafa Dehghani is a research scientist at Google DeepMind, mainly engaged in machine learning, especially deep learning. His research interests include self - supervised learning, generative models, large - model training, and sequence modeling.

Before joining Google, he pursued a doctoral degree at the University of Amsterdam, and his doctoral research focused on improving the learning process under incomplete supervision. He explored the ideas of introducing inductive biases into algorithms, incorporating prior knowledge, and using the data itself for meta - learning, aiming to help learning algorithms better learn from noisy or limited data.

He joined Google DeepMind in 2020 and participated in several important projects, including the development of the multimodal visual - language model PaLI - X, the construction of the 22 - billion - parameter Vision Transformer (ViT22B), and the proposal of DSI++ (Differentiable Search Indices), a retrieval - enhanced learning method for incremental document updates.

What are the technological highlights of Nano Banana?

At the beginning of the program, the researchers demonstrated several highlights of this photo - editing tool.

Image editing and scene consistency:

Ask the AI to "dress Logan in a huge banana suit". The generation only took more than ten seconds, and the result not only retained Logan's facial features but also added the background of the Chicago streets.

Creative interpretation and handling of vague instructions:

When prompted to "make it nano", the model actually generated a "mini Q - version" image of Logan, still maintaining the banana suit setting.

The model can conduct multi - round interactions through natural language instructions and maintain scene consistency in multiple edits without inputting long prompts.

In the past, the biggest complaint about image - generating AI was that "writing looks like alien language". This time, Gemini 2.5 Flash Image can already correctly generate short texts in the image, such as "Gemini Nano".

The team even uses text rendering ability as a new indicator for model evaluation because it can reflect the model's ability to generate the "structure" of an image and serve as a signal to measure the overall image quality, which helps guide model improvement.

They tracked this indicator to avoid model regression. Although there are still deficiencies in text rendering, the team is working hard to improve.

Moreover, Gemini 2.5 Flash Image is not just a "drawing machine"; its core charm lies in "understanding pictures".

The team introduced that this model closely combines native image generation and multimodal understanding: image understanding provides information for generation, and generation in turn strengthens understanding, and the two complement each other.

Through images, videos, and even audio, Gemini can learn additional knowledge from the world, thereby improving text understanding and generation abilities — visual signals become a shortcut to understanding the world.

In terms of the operation experience, the model introduces the "interleaved generation mechanism".

Facing complex tasks with multiple - point modifications, it will break down one - time instructions into multi - round operations, gradually generating and editing images to achieve "perfect pixel - level editing". Users only need to issue instructions in natural language. Even if the prompts are vague, Gemini can interpret them creatively and maintain scene consistency.

Whether it is the character's actions, clothing, or the background environment, modifications and generation can remain coherent in multiple rounds.

Generate multiple images in the style of an American charming shopping mall in the 1980s, and each image maintains a consistent style and context - related. The model will use multimodal context and refer to previous images to generate modifications.

Therefore, in addition to entertainment and fun, Gemini 2.5 Flash Image is also very useful in practical application scenarios. In home design, users can quickly view multiple solutions. For example, when visualizing different curtain effects in a room, the model can make precise modifications without destroying the overall environment.

In terms of personal OOTD, whether it is changing clothes, changing angles, or generating a retro - style image from the 1980s, the consistency of the person's face and identity can be stably maintained. It only takes more than ten seconds to generate an image, and if it fails, it can be quickly retried, greatly improving the creative efficiency.

So, in practical applications, how should developers choose between Imagen and Gemini?

Nicole Brichtova said that the ultimate goal of Gemini is to integrate all modalities and move towards AGI (Artificial General Intelligence). This means that Gemini is not just an image - generating tool but a system that can use "knowledge transfer" to play a role in complex cross - modal tasks.

In contrast, Imagen focuses on text - to - image tasks and offers multiple variants on the Vertex platform, optimized for specific needs, such as high - quality single - image generation, fast output, and cost - effectiveness.

In short, if the task goal is clear and speed and cost - performance are pursued, Imagen is still the ideal choice.

In complex multimodal workflows, Gemini's advantages are more prominent. It is suitable for complex multimodal tasks, supports generation + editing, multi - round creative iteration, and can understand vague instructions.

Gemini can use world knowledge to understand vague prompts and is suitable for creative scenarios. Nicole also added that Gemini can directly use reference