It's so precise that one can write on a grain of rice. OpenAI has made people's trust in all screenshots drop to zero.
In the current intense competition among AI giants, no one could have imagined that there is an image generation model based on text input that can create a significant gap in performance through ranking data.
Yes, I'm talking about GPT Images 2 .
I believe most of you have seen enough test samples to summarize how it outperforms its predecessor.
For example, it offers highly accurate and dense text rendering, especially for Chinese, and can even generate runnable code. It can create incredibly realistic and somewhat eerie UI screenshot simulations. Many people were deceived by a tweet about Claude Code yesterday. The image refinement and aesthetic level have also been greatly improved. The probability of those tell - tale AI lighting effects has significantly decreased, which might give Midjourney a big scare. It has strong reasoning ability, paying attention to many details you didn't specify, and the interaction experience is getting closer to that of large language models. As for its drawbacks, OpenAI itself admits that its spatial understanding ability is still insufficient.
Of course, after testing, I can also sense that many long - standing arguments still hold true. The design circle won't collapse because of this. Aesthetics and creativity still belong to humans. Advertising professionals will benefit the most. The market value of the industry does need to be re - evaluated, but it won't drop to zero. You know, every time a groundbreaking AI model is released, novices, bosses, and investors get the most excited. We all know what bosses and investors are thinking. Novices here don't just refer to complete outsiders, but also those in various fields who need to supplement their artistic creation. For example, advertising directors can save a lot on shooting and post - production costs. At least for now, AI is used to complement human weaknesses rather than replace humans.
However, rather than worrying about AI replacing humans, we should be concerned about another kind of crisis: People's trust in images may collapse, and we need to be more cautious about every screenshot we see.
01
In today's evaluation of GPT Images 2, we will conduct extreme tests on the advantages mentioned above, such as text rendering, UI simulation, fine control, and strong reasoning, to find out its boundaries and the potential security risks.
First, regarding text rendering, I noticed a picture released by OpenAI, which seemingly shows an ordinary pile of white rice on a burlap cloth.
But when zoomed in, there's an Easter egg. You can see the words "GPT Image 2" written on a single grain of rice in the center.
This picture is the most shocking official example for me.
I immediately decided to reproduce this example. But after multiple attempts on ChatGPT and Lovart, the results were mediocre. In most cases, either all the rice grains were large enough to easily write text on.
Or it would use a "cheating" method, like making only the grains with text extremely large.
Later, I tried a multi - step iterative method, asking the model to shrink the rice grains with text. After multiple attempts, it finally looked somewhat similar, but the text was hardly legible.
Then I found out that the example provided by OpenAI is in 4K resolution, while the free version on ChatGPT and Lovart can only generate 1K resolution images. So, I bought a membership to test the highest - quality and highest - resolution version of GPT Image 2 (through Higgsfield AI). All the following pictures are based on this specification.
Does using the highest specification guarantee successful reproduction? No. The same problems persisted. Either all the rice grains were too large, or the grains with text were too large, no matter how many times I reminded the model that "the text is only 75x30 pixels in size" and "the grains with text should be the same size as the others".
Here are two examples that I find quite impressive. The first one stands out for the physical authenticity of text rendering, and the second one for the small yet clear text.
Next, I challenged it further by asking it to copy the word "Zhiwei" from the above picture to another grain of rice. This time, it went smoothly, but it was obvious that the model created a new grain of rice to write the text.
What if we write a large amount of text, like a poem, on a single grain of rice? The "cheating" method appeared again. Even when I emphasized that the font size should be one - tenth of the original and the grain with the poem should be the same size as the others, the result still looked out of place.
When I changed from a Tang poem to a Song poem, the model simply scattered a pile of rice of unknown variety to write on.
At this point, I had to stop. Currently, it seems that either the prompt was not well - chosen, or OpenAI just presented an accidental result, or OpenAI used a higher - level computing power to generate it. After all, the text can only be seen after zooming in, which may imply an additional level of scene complexity and reasoning difficulty. The model may have reduced its performance to save computing power. We've all encountered situations where a model that seemed amazing during the promotional and internal testing phases turns out to be much less effective after the official release.
Of course, this also implies the possibility that the AI capabilities of large - model companies are much stronger than what we can see, but they are limited by computing power and cannot be widely used. However, this example symbolically shows that the text - to - image model has once again broken through its limits.
By the way, guess how GPT Images 2's old rival, Nano Banana Pro, would handle this task?
Don't laugh. GPT Image 2 might do the same.
Next, let's look at some more practical test dimensions, such as text rendering density. This may be the most practical ability of this version of the model, which is very useful in posters, product images, and scientific illustrations.
The test is simple: to see how many words GPT Image 2 can fit into a single image.
We take the original text of Journey to the West as an example. We gradually increase the number of words from the first chapter and provide them to the model to see the results.
First, from the beginning to the moment when Sun Wukong was just born, there are about 1300 words.
The generated result is as follows. There are hardly any misspelled or distorted words, and even the pinyin annotations in the prompt are included.
Next, we increase the text from the beginning to the moment when Sun Wukong becomes the Monkey King, about 2800 words. This time, the model struggles. Not only is there some text missing at the end, but the text at the end also looks messy and crowded.
We further increase the text from the beginning to the moment when Sun Wukong sets out alone in search of the way to immortality and meets an old woodcutter, about 5600 words. This time, the model takes a shortcut and only renders about 1500 words.
Finally, we directly increase the text to the scale of ten thousand words. The model is completely confused. It outputs a scientific illustration about changing a tire and a PPT about cutting - edge information technology. I don't know what stimulated it. Actually, the content I input was the full text of my previous interview article. I don't know how the model associated it with changing a tire.
Finally, we make a compromise. We slightly reduce the text length to the point where the model starts to show slight distortion, from the beginning of Journey to the West to the moment when the Water Curtain Cave is just discovered, about 2500 words. This time, the model completes the task decently.
What can Nano Banana Pro achieve?