Leaked: Raw Image Model More Popular Than Nano Banana - Screenshots Are No Longer Evidence!

Scarily real

Is your impression of text-to-image generation still stuck on Nano Banana?

But, kid, the times have changed again.

@johnAGI168 https://x.com/johnAGI168/status/2044781168151724067

@0115hippo https://x.com/0115hippo/status/2044722124611539160

At the beginning of April, three anonymous image models appeared on the LM Arena evaluation platform, codenamed maskingtape-alpha, packingtape-alpha, and gaffertape-alpha. They disappeared a few hours later.

OpenAI has not officially announced this model yet, but based on the metadata returned by the API and user-side test records, it already has a widely accepted name: GPT Image 2.

Screenshots can no longer be used as evidence.

In the past few years, one of the most obvious shortcomings of AI image generation models has been the text in the images. In the era of DALL-E 3, if you asked it to write "Hello" in the image, what came out might be "Hellp" or even "Hl10", with the letters looking as if they were drunk and skewed. GPT Image 1 was much better and could handle simple English labels. By GPT Image 1.5, its rendering accuracy for English text had approached 95%, but there were still obvious defects in non-Latin alphabet systems such as Chinese, Japanese, and Korean.

The leaked sample images of GPT Image 2 have changed this impression.

@MrLarus https://x.com/MrLarus/status/2044824800909054181

@akokoi1 https://x.com/akokoi1/status/2044789531615056175

The text in the images is exactly as it should be. Chinese characters are clear, with accurate shapes and complete strokes. Someone tested generating an image in the style of an ID card, and the name, address, and ID number were all correctly rendered, with regular typesetting. At first glance, it looked like a photo of a real document.

This is good news. The progress in text rendering means that generating infographics, posters, product packaging, and complexly typeset charts has become more reliable.

But every coin has two sides. A model that can generate realistic ID-style images and accurately render UI screenshots naturally makes the idea that "screenshots can be used as evidence" increasingly questionable.

In comparison, this is also the core difference between the GPT Image series and other models. Midjourney has made no achievements in text rendering so far, and the Stable Diffusion series also has the same old problem. According to the leaked Arena test results, GPT Image 2 surpasses Midjourney in four dimensions: text rendering, instruction following, photo realism, and world knowledge. The latter's advantages are mainly retained in artistic style and aesthetic control.

Does it really know what the world looks like?

Some testers asked the model to generate an imagined pricing page for the GPT - 8 product. The resulting image had a typesetting in the style of the OpenAI official website. The button positions and font selections seemed to be taken from a real interface, and the hierarchical logic of the price table was also correct.

GPT Image 2 can generate images that are extremely similar to real software interfaces, including browser windows, mobile application interfaces, and data visualization charts. The fidelity is incomparable to that of the previous generation of products.

@johnAGI168 https://x.com/johnAGI168/status/2044781168151724067

@levelsio https://x.com/levelsio/status/2040333489476681758

This will bring some very interesting practical uses. When designers are creating product prototypes, they don't need to open Figma first to draw a bunch of frameworks. They can directly describe the desired interface in text, and the resulting image can be used as a reference for discussion with the team. When creating an investor deck, they can show a "product screenshot" without waiting for engineers to write code. When writing documents, the example interfaces for illustrations can be directly generated without having to think about where to find screenshots on a blank page.

@marmaduke091 https://x.com/marmaduke091/status/2040338311873515597

Text-to-image generation is no longer just about "generating images".

OpenAI has announced that DALL-E 2 and DALL-E 3 will officially stop service on May 12, 2026. Azure OpenAI's DALL-E 3 was retired early in February.

DALL-E was where many people first came into contact with AI text-to-image generation. It has only been a few years from those blurry early works to today.

Meanwhile, Google, which just established its industry status with Nano Banana Pro at the beginning of 2026, may feel the pressure. Early test reports show that GPT Image 2 surpasses Nano Banana Pro simultaneously in three dimensions: realism, text rendering, and world knowledge. Such a three - win situation is not common.

For creators, the feelings are complex. Illustrators, graphic designers, and photographers have faced this topic more than once. Since the release of GPT Image 1, the number of freelance graphic design positions has decreased by about 18%. AI has indeed replaced the decision of "I need to hire someone to do this" in some scenarios, but it is also creating new ways of working and increasing the things that one person can do.

The evolution speed of text-to-image generation models no longer gives people much time to adapt. It took only a few months for GPT Image 1 to evolve to 1.5, and about half a year from 1.5 to 2. Each generation solves the core shortcomings of the previous generation while opening up new possibilities.

GPT Image 2 is currently in the A/B testing phase, and some ChatGPT users have randomly obtained access rights. The general prediction for the official release time is around the retirement of DALL-E in May. If you want to experience it in advance, you can currently try your luck on the LM Arena evaluation platform.

Test Address: https://arena.ai

Based on community feedback and the known advantages of this model, the following prompt templates can maximize your chances of success:

UI/Screenshot Prompt: A photo-realistic screenshot of a mobile banking app, clearly showing transaction records, with the dates, amounts, and merchant names clearly distinguishable. iPhone 16 screen, with the phone held naturally, and a coffee shop background.

Product Label Prompt: A photo-realistic product photo of a craft beer bottle, with clear label details, showing the brewery name "Oakridge Brewing Co.", an alcohol content of 6.8%, a mountain logo, and an ingredient list. Indoor lighting, white background.

Signage Prompt: A street view photo of a Tokyo alley at night, showing multiple bilingual neon signs in Japanese and English, including a ramen shop sign reading "Ichiban Ramen — Est. 1987", a karaoke bar sign, and various illuminated billboards. The wet sidewalk after rain reflects the lights.

Interface/World Knowledge Prompt: A photo-realistic YouTube video screenshot showing a video titled "How to Assemble a Computer in 2026", which has 2.3 million views, with a realistic comment section, sidebar recommended videos, and channel information. Desktop browser view.

Wide-Screen Trigger Prompt: This is a movie-like wide-screen photo of the exterior of an IKEA store at dusk, showing the illuminated IKEA sign, realistic cars in the parking lot, and shoppers going in and out. Golden hour lighting, format 16:9.

Image sources and references not marked: https://miraflow.ai/blog/how-to-use-duct-tape-ai-model-arena-gpt-image-2-guide

This article is from the WeChat official account

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

The raw image model even more popular than Nano Banana has been leaked. Screenshots are no longer evidence! | Prompt words attached

Screenshots can no longer be used as evidence.

Does it really know what the world looks like?

Text-to-image generation is no longer just about "generating images".