HomeArticle

Another domestic large image model has been open-sourced. In actual tests, its continuous image editing is amazing, but Chinese rendering is its weakness.

智东西2025-12-08 18:45
The image generation model with 6B parameters has entered the open-source state-of-the-art (SOTA).

According to a report by Zhidongxi on December 8th, today, Meituan officially released and open - sourced the image generation model LongCat - Image. This is a 6B parameter model that has reached the open - source SOTA level in image editing capabilities, focusing on two core scenarios: text - to - image and single - image editing.

▲ Image source: Hugging Face

Judging from the official benchmark test results, LongCat - Image mainly benchmarks against mainstream open - source and closed - source image generation models such as Seedream4.0, Qwen - Image, HunyuanImage - 3.0, Nano Banana, and FLUX.1 - dev. Its core optimizations are concentrated on two capabilities: "editing controllability" and "Chinese character rendering".

In actual experience, it performs well in continuous image modification, style change, and material details. However, in complex layout scenarios, the Chinese character rendering is still unstable. When it comes to tasks such as complex UI design and game interface generation, the model also shows certain shortcomings in aesthetics, which may be related to its lack of Internet search ability.

In terms of experience entry, Meituan also provides multiple usage methods simultaneously. On the mobile side, the LongCat APP already supports text - to - image and image - to - image capabilities; on the web side, users can also enter the image generation entry through https://longcat.ai/ to experience it.

For developers, the model weights and code of LongCat - Image have also been open - sourced simultaneously:

Hugging Face: https://huggingface.co/meituan - longcat/LongCat - Image

GitHub: https://github.com/meituan - longcat/LongCat - Image

Now let's take a look at the model structure, evaluation results, and specific actual test performance of LongCat - Image.

01. From the model structure to the evaluation results, LongCat - Image focuses on "editing controllability" and "Chinese rendering"

In terms of model design, LongCat - Image adopts a unified architecture for text - to - image and image editing, and through a progressive learning strategy, it balances the collaborative improvement of three capabilities: instruction - following accuracy, image generation quality, and text rendering with only a 6B parameter scale.

▲ Model architecture

This training route does not start from scratch by piling up parameters. Instead, it initializes based on the mid - stage text - to - image training model and adopts a multi - task joint learning mechanism for text - to - image and instruction editing in the subsequent stages to avoid the problem of the editing ability being compressed in the post - training stage.

In terms of image editing capabilities, LongCat - Image has achieved open - source SOTA results in multiple editing - related benchmarks such as GEdit - Bench and ImgEdit - Bench.

▲ Objective benchmark test performance comparison

Through multi - source data pre - training, instruction rewriting strategies, and the introduction of manually refined SFT data, LongCat - Image makes the model less prone to style drift and structural distortion when facing complex editing requirements.

To address the long - standing pain point of Chinese character rendering, LongCat - Image pre - trains with synthetic glyph data covering 8105 standardized Chinese characters. In the SFT stage, it introduces real - world text images to enhance typesetting and font generalization capabilities. In the RL stage, it also introduces both OCR and aesthetic dual - reward models for joint constraints. Finally, it achieved a score of 90.7 in the ChineseWord evaluation, leading among existing open - source models.

In terms of realism, LongCat - Image deliberately avoids the "plastic - like" texture trap of AIGC through adversarial training and a strict data screening mechanism. In the RL stage, it introduces an AIGC detector as a reward signal to guide the model to learn the physical textures and light - shadow changes in the real world in reverse.

The comprehensive evaluation results show that in the human subjective scoring (MOS) dimension, LongCat - Image's performance in multiple sub - items such as text alignment, visual realism, and aesthetic quality is close to that of commercial models such as Seedream4.0.

▲ Human subjective scoring (MOS) comparison

▲ Side - by - side comparison evaluation win - rate (SBS)

In the side - by - side comparison evaluation (SBS) of image editing tasks, LongCat - Image - Edit achieved a high win - rate against models such as NanoBanana and Qwen - Image - Edit in two key indicators: comprehensive quality and consistency.

Overall, LongCat - Image has approached the level of some closed - source models in image editing tasks and remains in the leading position among open - source models in basic text - to - image capabilities.

02. From comic redrawing to doll product rendering, continuous editing is stable, but Chinese rendering remains a shortcoming

From the actual experience process, LongCat - Image's performance in "continuous instruction editability" is relatively stable. We directly used the pictures related to the recently popular "Zootopia 2" for testing and continuously made multiple rounds of modifications based on the same character.

▲ Reference image

Instruction: Modify it into a pixel - style work.

Instruction: Redraw it in color while retaining the pixel texture.

Instruction: Redraw the characters in the picture as animals imitating the Lego theme.

In the comic image test, through continuous redrawing instructions such as pixel style, color pixel redrawing, and imitating the Lego animal theme, the model can keep the character structure stable while completing multiple rounds of style and material migration. During multiple modification processes, there were basically no obvious errors in the character outline and composition.

On this basis, we also further tried the movie poster production scenario, using the same character image for the main visual poster generation and multi - language title rendering test.

Instruction: A promotional poster for the movie "Zootopia 2". The main picture of the poster is an exciting scene of the movie's protagonist. The main title uses the artistic handwritten font "Zootopia 2", and the English name "Zootopia" is attached below. In addition, attach other small characters required for the movie poster, and the text should be clear and recognizable.

In the movie poster scenario, the model's inheritance ability of the reference image is relatively stable. Whether it is the character image or the dynamic pose, it can maintain a high degree of consistency with the original picture. The main titles of the Chinese and English titles are also relatively clear. However, in the "small characters" area, there are still problems of garbled characters and mixed English in a series of detailed texts, indicating that Chinese character rendering is still unstable in complex layout scenarios.

When further testing the Chinese poster in the form of a character profile, the model can correctly render some core field information, but there are still inevitable misalignments between Chinese and English and local garbled characters.

Instruction: Generate a promotional poster in the form of a character profile for an animated movie character, using text to reflect the following information: Nick Wilde, a fox appearing in the Disney animated movie "Zootopia". Chinese name: Nick the Fox. English name: Nick Wilde. Prototype: Red fox. Occupation: From a con - man to a police officer. Partner: Police Officer Judy Hopps. Classic line: "Did I break your little heart?"

In the product - level rendering test, the texture performance of Police Officer Judy's doll is relatively stable in multiple real - world scenarios such as studio lighting, warm table lamp lighting, natural light in the living room, and bedding lighting. The details of the short fluff, the highlight reflection of the eyes, and the material contrast between the sofa fabric and the doll's fluff can all be accurately presented, making the overall performance closer to the commercial product rendering effect.

In contrast, in the game interface generation scenario where mainstream models are more proficient, the shortcomings of LongCat - Image are more obvious. Whether it is a card game, a shooting game, or the first - person interface of a MOBA game, the overall style tends to the UI design aesthetics of more than a decade ago, showing an obvious generation gap with current mainstream game products.

Instruction: Generate a card game interface.

Instruction: Generate a shooting game interface.

Instruction: Generate a game interface for "League of Legends".

Instruction: Generate a first - person game interface for "Honor of Kings".

From the results of this test, Long