Nano Banana is more like DeepSeek in the field of AI images.
I've long been accustomed to using AI to generate various images, but I've never seen an AI that can edit and adjust images so precisely.
This is the 66th issue of Narrowcast Weekly. The business trend we're focusing on in this issue is: Google's latest AI image generation model, Nano Banana, may trigger an explosion of applications in the AI image field.
At this time last year, when I wanted to replace the toy train in a person's hand in a photo with a toy plane, I had to make precise markings on the toy train, find a clean picture of a toy plane, and tell the AI to replace it with the plane in another picture. After spending an hour and making multiple attempts, the plane would be deformed, the person's hand would disappear, and the final result was just barely acceptable.
This week, I gave the same task to Nano Banana, simply telling it to "replace the toy train in the person's hand with a toy plane" without even preparing another photo of a toy plane. In just over 20 seconds, I got a new photo. In this photo, only the toy train was replaced, with no unnecessary modifications to other parts of the photo, and no disappearance or addition of the person's fingers.
This change in experience makes me think of my first experience using DeepSeek – I already knew I could chat with AI casually, but I didn't expect the AI I was chatting with to become so smart. Similarly, I've long been used to using AI to generate various images, but I've never seen an AI that can edit and adjust images so precisely.
Just as DeepSeek demonstrated the application prospects of AI through its deep - thinking ability, the more certain image - creation ability shown by Nano Banana will also encourage the public to use AI more widely for image - related tasks.
Currently, there are already a large number of figurine models, OOTD, and outfit - change pictures made with Nano Banana on social media. Some users are even using Nano Banana in combination with large video - generation models to generate video content. This is not just another popular "Studio Ghibli - style" filter, but the popularization of a more efficient and general image - creation ability.
This model's ability can support more product innovations. Not only on Gemini, but in the future, Nano Banana or similar models may appear in many products.
However, this also requires model manufacturers, like the Nano Banana team, to think more comprehensively from a multimodal perspective about how to enhance the model's image - creation ability.
MeituXiuxiu in the AI era is a kind of ability
The feeling of using Nano Banana to adjust images is very similar to the early experience of switching from PhotoShop to MeituXiuxiu. When using PhotoShop to beautify photos, users need to remember different operation steps. Users without a basic understanding need to search for tutorials on Baidu every time they edit a photo. But when using MeituXiuxiu to beautify photos, it may only take a few clicks or drags, and even users without a foundation can quickly get the hang of it.
Now, Nano Banana allows ordinary users to precisely adjust and modify images with just one sentence. This is another revolution in the field of image creation. The difference is that the transition from PhotoShop to MeituXiuxiu was more of a change in product thinking, turning a series of fixed operations into a single click or drag; while Nano Banana brings an ability, enabling AI to understand and edit images.
In the introduction from the Nano Banana team, there are two key points to achieving this ability:
Native multimodal architecture. This architecture enables Nano Banana to simultaneously understand and process context including text and images, and obtain pixel - level information from the context to achieve pixel - perfect editing. This ensures that Nano Banana can adjust a picture precisely to a specific element.
Interleaved Generation. On the basis of achieving pixel - perfect editing, Nano Banana can break down complex prompts into multiple steps and complete the modifications step by step. The Nano Banana team believes that this is a paradigm shift, allowing the model to construct complex images incrementally, rather than challenging the model's upper limit by asking it to generate the final answer all at once, as in traditional methods.
To some extent, this is a more Agent - like ability achieved through model definition. DeepSeek was able to promote the wide application of AI by essentially using its deep - thinking ability to break down prompts and execute them step by step to get more satisfactory results. The logic of Nano Banana is the same. Through more precise understanding and more detailed task decomposition, it achieves highly consistent image editing.
On this basis, Nano Banana also has low cost and high speed. According to Google's introduction, Nano Banana is priced at $30 per million tokens. Each image generation requires about 1290 tokens, with a cost of approximately $0.039.
Moreover, Nano Banana can generate an image within ten to dozens of seconds. The combination of rapid generation and precise adjustment ability supports users' iterative creation, allowing them to continuously try and adjust to get closer to their ideal goals.
Outstanding ability will lead to more extensive applications
From my own experience, Nano Banana can easily help me dress Nezha in bean shoes and generate a fight scene between Luffy and Ace based on the searched fight footage. Not all the results may satisfy me. For example, in one generated picture, Luffy was much smaller than Ace, and when I asked it to adjust a previously generated result, the output image didn't change at all.
However, this doesn't prevent me from thinking that Nano Banana can become a basic ability for building AI image applications or bring further upgrades and wider use of some existing experiences.
The first type of application is virtual try - on. The ability provided by Nano Banana allows users to see a more realistic on - body effect of the outfits they want to try, attracting more people to use this function.
A Forbes report believes that Nano Banana's ability to maintain character consistency can improve the efficiency of creators and studios in creating storyboards, children's books, and comics; reduce the production cost of product promotion materials, as a single product photo can be used to generate promotional posters for different scenarios; and interior designers can adjust the decoration effect at any time based on room photos, providing users with more timely and low - cost services.
The release of Nano Banana is also further raising the upper limit of image - to - video generation. Creators can make more precise adjustments to the first and last frames based on Nano Banana, making the video generation result closer to the expected outcome. Then they can edit and splice the segments together to form a complete video. Currently, what the public can see more quickly is that many video - creation products use Nano Banana's ability to allow users to adjust photos and generate better - looking face - swapping videos.
These application explorations basically combine Nano Banana's basic ability with the tacit knowledge of certain fields, reducing the usage threshold for users with specific image needs. Although Google will integrate Nano Banana into Gemini to make it a basic ability of the general assistant, the general assistant is not omnipotent and sometimes cannot provide the tacit knowledge of specific industries.
More applications are needed to help Nano Banana better understand industry - specific tacit knowledge. When well - known investment bank Morgan Stanley analyzed "whether Meitu will be affected by Nano Banana", it believed that Meitu's real value lies in providing a "last - mile" solution that basic AI models cannot achieve.
Of course, this kind of solution will become more and more segmented as the model's ability improves, and it will tend to provide more extreme services for a specific task. This may stimulate a large amount of innovation, making AI applications related to images more professional and widespread.
Just as Meitu may become a collection of different types of image tools in the future, selling tacit knowledge to users instead of being a basic photo - editing tool that attracts users with free functions for high - frequency use.
Making Nano Banana a success is a more comprehensive competition
In essence, the Nano Banana team is not just creating an image - generation model, but applying multimodal abilities in the field of image creation.
The Nano Banana team believes that the difference between Gemini and Google's image - generation model Imagen is that Gemini aims to integrate multiple modalities to ultimately achieve AGI, while Imagen focuses solely on image generation.
If users only want to efficiently generate high - quality and beautiful images, Imagen is the best choice. However, if users also want to make some edits on the basis of image generation, generate more creative ideas, or even obtain more creative results, Gemini is a better option.
@Travis Davids
Looking to the future, the Nano Banana team hopes that the model will have more Smartness and Factuality.
Smartness means that when the user's instructions are not clear enough or their understanding of reality is inaccurate, Nano Banana can keep the result consistent with the real world. Although the result may deviate from the user's instructions, it can achieve a more correct or better effect, making users think that Nano Banana is very smart.
Factuality means that Nano Banana can not only create beautiful images but also generate accurate icons, infographics, and schematics, and even directly generate PPT pages for users. This requires Nano Banana to be precise not only in image elements but also in text and data.
To achieve these two goals, it is necessary to rely on Gemini's world knowledge to understand multimodal context. For example, Nano Banana can understand what the toy plane I want to add is and what the characteristics of a trendy young man's dressing style are.
For Google, the success of Nano Banana lies in establishing a synergy between understanding and generation. Gemini's image - understanding ability helps the large model learn more world knowledge from images and videos in addition to text. Then this knowledge assists it in more accurately understanding and executing image - generation instructions.
This also means that integrating different model abilities under the right mechanism is more likely to bring a leap in the capabilities of large models. To some extent, this is not only a victory for the model cluster but also a victory for the enterprise organization and innovation mechanism.
This article is from the WeChat official account "Narrowcast", author: Li Wei. Republished by 36Kr with permission.