Domestic All-in-One Image Generation Model: A Divine Tool Gets 2,000 Stars on GitHub in a Week, Upgraded with Enhanced Understanding, Quality, and "Reflection"

ZhiYuan Research Institute upgrades OmniGen2.

Domestically developed open-source unified image generation model undergoes a major technological upgrade!

The new progress comes from the Institute of Intelligence and Computing:

OmniGen, a model that supports text-to-image generation, image editing, and theme-driven image generation, has officially released its 2.0 version.

Specifically, on the basis of maintaining a concise architecture, OmniGen2 significantly enhances the context understanding ability, instruction-following ability, and image generation quality.

Meanwhile, OmniGen2 fully inherits the capabilities of its base multi-modal large model in context understanding and generation. It supports both image and text generation simultaneously, further integrating the multi-modal technology ecosystem.

As soon as the model was launched, it sparked extensive discussions in the open-source community. Within a week of its release, it received over 2,000 stars on GitHub, and the related topics on X had over hundreds of thousands views.

Now, the research experience version is open. You can try out the featured capabilities such as image editing and context-referenced image generation in advance (links at the end of the article).

The official also promised that the model weights, training code, and training data of OmniGen2 will be fully open-sourced, providing a foundation for community developers to optimize and expand.

Multiple gameplay modes can be unlocked with prompts

OmniGen2 is easy to use. You just need to input prompts to unlock a wealth of image editing and generation capabilities.

1. Image editing based on natural language instructions

OmniGen2 supports the image editing function based on natural language instructions, enabling local modification operations, including adding or deleting objects, adjusting colors, modifying facial expressions, and replacing backgrounds.

2. Image generation with multi-modal context reference

OmniGen2 can extract specified elements from the input image and generate new images based on these elements. For example, placing an item/person in a new scene. Currently, OmniGen2 is better at maintaining object similarity than facial similarity.

3. Text-to-image generation

OmniGen2 can generate images in any aspect ratio.

From the innovative architecture to the image generation reflection mechanism

Let's take a look at the specific technical details.

Separate architecture + dual-encoder strategy

OmniGen2 adopts a separate architecture to decouple text and images and uses a dual-encoder strategy of ViT and VAE.

Different from other works, ViT and VAE act independently in MLLM and Diffusion Transformer, improving image consistency while maintaining the original text generation ability.

Reconstruction of the data generation process

OmniGen2 is also exploring solutions to the problems in basic data and evaluation that hinder the development of the field.

Most of the relevant open-source datasets have inherent quality defects. Especially in image editing tasks, the image quality and accuracy are not high. For the image generation task with context reference, there is a lack of corresponding large-scale and diverse training data in the community. These defects have greatly contributed to the significant performance gap between open-source models and commercial models.

To solve this problem, OmniGen2 has developed a construction process for generating image editing and context reference data from video and image data.

Image generation reflection mechanism

Inspired by the self-reflection ability of large language models, OmniGen2 has also explored strategies for integrating reflection ability into multi-modal generation models.

Reflection data for image generation is constructed based on the base model of OmniGen2.

The reflection data consists of an interleaved sequence of text and images. First, there is a user instruction, followed by an image generated by the multi-modal model, and then a step-by-step reflection on the previous generated output.

Each reflection involves two key aspects:

Analysis of defects or unmet requirements related to the original instruction;

Solutions proposed to address the limitations of the previous image.

The trained model has a preliminary reflection ability, and the future goal is to further train it with reinforcement learning.

New benchmark

OmniGen2 has achieved competitive results on existing benchmarks, including text-to-image generation and image editing.

However, for the image generation task with context reference (in-context generation), there is currently a lack of a comprehensive public benchmark to systematically evaluate and compare the key capabilities of different models.

The existing benchmarks for context image generation are insufficient in capturing real-world application scenarios. They do not consider scenarios with multiple input images and are limited by the context type and task type. At the same time, previous benchmarks use CLIP-I and DINO metrics to evaluate the quality of context-generated images. These metrics rely on image-level similarity between the input and output, making them unsuitable for scenarios involving multiple subjects and lacking interpretability.

To address this limitation, the team has introduced the OmniContext benchmark, which includes 8 task categories specifically designed to evaluate the consistency of individuals, objects, and scenes.

The data is constructed using a hybrid method that combines initial screening by a multi-modal large language model and manual annotation by human experts.

As the first model to be evaluated on this benchmark, OmniGen2 has achieved an overall score of 7.18, surpassing other leading open-source models such as BAGEL, proving that it can well balance the prompt-following ability and subject consistency and perform stably in various task scenarios.

In addition, OmniGen2 relies on the large model training and inference parallel framework FlagScale independently developed by the Institute of Intelligence and Computing to optimize the inference deployment. By deeply reconstructing the model inference link and integrating the TeaCache caching acceleration strategy, it has achieved a 32% improvement in inference efficiency, significantly reducing the response time and enhancing the service performance.

Meanwhile, the framework supports one-click elastic deployment of multiple instances across machines, effectively improving the overall utilization rate of cluster resources. The team will continue to promote hardware-software collaborative optimization and build an efficient inference deployment capability system.

The model weights, training code, and training data of OmniGen2 will be fully open-sourced, providing a new foundation for developers to optimize and expand, and accelerating the transformation of the unified image generation model from concept to reality.

Related links to OmniGen2

Github: https://github.com/VectorSpaceLab/OmniGen2/ Paper: https://arxiv.org/abs/2506.18871 Model: https://huggingface.co/BAAI/OmniGen2 Research experience version link: https://genai.baai.ac.cn

*This article is reprinted with permission from QbitAI. The views are solely those of the original author.

This article is from the WeChat official account "QbitAI". Author: Yun Zhong. Republished by 36Kr with permission.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

This domestic all-in-one image generation model, a divine tool, has gained 2,000 stars on GitHub in a week. It has been upgraded, with both its understanding and quality improved. Moreover, it has learned to “reflect”.