HomeArticle

Claiming Global Top Spot Twice in Half a Month: Chinese Startups Are Shaking Up the AI Image Generation Arena

机器之心2026-06-11 11:13
Zhixiang Future HiDream-O1-Image-1.5 once again achieves SOTA.

Every major transformation of the technological paradigm is a window period when the old order loosens and new species are born.

As the competition in the large model field entered 2026 and the industry regarded AI as an infinite game of parameter scale and computing power stacking, a Chinese startup founded just over three years ago, HiDream.ai, tore open a crack in the image model field surrounded by giants with its innovation in the underlying architecture.

Number One in China, Number Two Globally

HiDream.ai Breaks the Record of Domestic Image Generation Models

Recently, the commercial version of the image generation model HiDream-O1-Image-1.5 newly launched by HiDream.ai achieved SOTA again. On the Text to Image Leaderboard of the globally renowned independent AI model evaluation and analysis platform Artificial Analysis, it topped the list of Chinese image generation models at once, becoming the Chinese large model company with a score second only to OpenAI and surpassing the mainstream image generation models of domestic and foreign large companies such as Google Nano Banana 2 (Gemini 3.1 Flash Image Preview), NVIDIA Cosmos3-Super-Text2Image, and ByteDance's Seedream 4.0.

This is not an accidental technological outburst. Just a few weeks ago, the open-source model HiDream-O1-Image-Dev-2604 of the native full-modal HiDream-O1 series of HiDream.ai just topped the global list of open-source models in the text-to-image leaderboard.

Winning the global championship twice within half a month naturally arouses people's curiosity: How can a startup founded just three years ago surpass Google and ByteDance in the authoritative list of image generation? Is this an accidental optimization of the evaluation or a competition of hardcore strength? What trends of the times does this evaluation result reflect?

Behind the Leaderboard - The Victory of the UiT Architecture Route

The Text to Image Leaderboard of Artificial Analysis adopts an anonymous comparison, user voting, and ELO dynamic ranking mechanism to minimize the influence of brand awareness on the evaluation results, which is closer to the preference judgment of real users in an open generation scenario. Under this professional evaluation system, HiDream-O1-Image-1.5 achieved 1265 ELO in more than 4000 sample comparisons. The performance of HiDream-O1-Image-1.5 not only reflects the model's competitiveness in image quality but also shows the improvement of its comprehensive capabilities in semantic compliance, complex picture generation, text rendering, and multi-subject control.

Looking at the entire "competition field", there are many giants with a market value of trillions among the competitors of HiDream.ai: Google has a TPU cluster and the world's top talent accumulation, and ByteDance has a large traffic entrance and application foundation. Against the background of not having an advantage in computing power, data, and ecological endowment, this startup achieved a breakthrough mainly because it chose a completely different technological path.

Currently, the mainstream global text-to-image models generally follow the modular architecture of text encoder + VAE (Variational Autoencoder) + DiT (Diffusion Transformer), and the industry has long taken increasing parameter scale and stacking computing power as the main iteration directions. However, HiDream.ai gave up this mature route and chose a more difficult but more imaginative path - the pixel-level native full-modal architecture UiT.

Traditional text-to-image models usually adopt the modular path of "text encoder + VAE + DiT / diffusion model", and their form is more like a tree that keeps branching and growing: text has its own tokenizer, images and videos have their own encoders/decoders, and audio, actions, and spatial relationships are also often processed along different paths. Information needs to be converted multiple times between modules. In complex tasks such as long text typesetting, UI design, multi-subject pictures, multi-reference image linkage, and continuous storyboards, multiple information conversions can easily lead to detail loss, semantic deviation, and unstable picture structure, which are also the common pain points of most current commercial image models.

The native full-modal architecture adopted by the HiDream-O1 series of HiDream.ai completely reconstructs the information processing logic. This architecture eliminates the independent VAE and dedicated text encoder in the traditional scheme and maps the original signals such as image pixels, text tokens, video voxels, audio, actions, and spatial relationships to the same shared representation space. It completes the understanding, calculation, and generation of full-modal information through a set of UiT (Pixel-level Unified Transformer). Different from the common "multi-modal post-splicing" scheme in the industry, this architecture realizes the fusion and interaction of various signals at the model's underlying level, fundamentally reducing the loss caused by modal conversion.

The choice of an enterprise's technological route is often highly related to the cognitive structure and practical experience of the team. To understand the technological route of HiDream.ai, we need to go back to the historical coordinate system of this team.

The core technical team of HiDream.ai has been focusing on the AIGC field for more than 10 years and has been deeply involved in the technological evolution of three generations of AI models. It is one of the few domestic multi-modal AI teams led by academicians with both a complete technological path and industrial experience. As early as 2017, the team proposed TGANs-C, which is also one of the world's earliest papers on video generation models. They have also been deeply involved in the construction of large-scale systems such as the world's second-largest video search engine and the largest self-operated e-commerce platform's picture search engine in China, and have further implemented multi-modal technology in high-complexity industrial scenarios such as logistics embodied intelligence and thousand-card-level quasi-real-time intelligent video reasoning.

This means that HiDream.ai not only has experience in model research and development but also has gone through the complete closed-loop of cutting-edge algorithms, engineering systems, and real business scenarios. What determines the development height is the ability to continuously deepen underlying innovation; what determines how far one can go is the experience of implementing in complex industrial scenarios.

HiDream.ai has never lacked the courage to innovate.

In the technological system of HiDream.ai, images are defined as the spatial basis for modeling the real world. A single image carries complete scene, light and shadow, structure, and subject information at a certain moment. It is not an independent single ability but a key entrance to video generation and even the native full-modal world model. Based on this forward-looking judgment, the enterprise determined the development route of "taking images as the foundation and extending to video and full-modal".

Looking at the industry pattern, leading large companies have long built multi-modal systems with large language models as the core. As the mainstream cognitive intermediary, the technical stack, product ecosystem, and commercial barriers built around text have become deeply rooted, making it difficult for large companies to completely overthrow the existing architecture and re-layout. However, HiDream.ai, which was founded more recently, has no historical technological burden. The team put forward a new concept: In the new stage of multi-modal development, the signal itself can be used as a cognitive carrier, and text is no longer a necessary intermediate medium.

Currently, the global multi-modal technological route has not fully converged, and the industry is still in the window period of route competition. When giants are restricted by the mature technological system and difficult to make comprehensive innovations, startups, with their lightweight organizations and flexible trial-and-error space, relying on underlying architecture innovation, may have the opportunity to achieve intergenerational technological leaps.

The breakthrough of HiDream.ai can be deconstructed into three levels:

First, seek intergenerational advantages at the architecture level and use limited resources to develop core businesses.

HiDream.ai did not get involved in the computing power and parameter competition in the mainstream DiT track but focused on polishing its self-developed UiT native full-modal architecture. This route requires large R & D investment and high trial-and-error costs in the early stage, but once it succeeds, it is expected to form a structural intergenerational advantage. According to the team's disclosure, with similar training data and computing resources, its 8B parameter model has achieved comprehensive performance comparable to or even surpassing that of the industry's traditional models with tens of billions of parameters, showing higher parameter efficiency.

This extreme pursuit of the underlying architecture has not made HiDream.ai indulge in "innovation for the sake of innovation". On the contrary, HiDream.ai maintains a highly pragmatic attitude at the engineering implementation level. Taking video generation as an example, the team adopted the idea of "first images, then videos": first use the image model to complete technical verification and rapid trial-and-error, and then transfer the mature capabilities to the video field. This strategy reduces the training cost to one-fifth to one-tenth of the industry average - it is this survival wisdom of using limited resources to develop core businesses that allows a startup to find its own rhythm in an environment full of giants.

Second, deeply integrate the model with vertical scenarios to build a moat that is difficult for others to replicate.

HiDream.ai is not just a model company. As mentioned before, commercialization has been a major concern since the company's establishment. After years of exploration, it has currently formed a "1 + 1+3" layout: a HiDream model base, a platform for external output of capabilities, and three intelligent agent application scenarios, namely the film and television creative collaboration intelligent agent "Frame Praise" for professional film and television teams, HiBurst for batch marketing content production for e-commerce (especially cross-border merchants), and vivago for professional social media creators, achieving the strongest coupling between the model and products.

The commercial marketing intelligent agent HiBurst has entered the top 5 of TikTok's official service providers, producing over one million e-commerce marketing videos annually, covering a GMV of over 100 million yuan; the AI film and television creative and collaboration intelligent agent "Frame Praise" has connected the entire process of "creativity - storyboard - finished film", producing over 5000 minutes of short comic dramas in total, and has connected with film and television institutions such as Yangtze River Film Group and Ciwen Media; the social media creative intelligent agent vivago recently topped the daily list of Product Hunt, covering more than 100 countries and regions around the world and serving over 40 million users.

HiDream.ai's professional film and television video generation business can currently stably produce 1 - 3-minute videos in one shot, with a success rate of over 70%. In today's era of large-scale random draws, this figure is impressive.

Third, maintain extreme strategic determination and cognitive upgrading.

While most players in the industry are still focusing on the traditional architecture, HiDream.ai dares to "start over" and bet on the native full-modal. This courage to "reset one's identity" comes from two aspects of the founding team: strategic determination on the one hand and cognitive upgrading on the other. They were not distracted by the computing power competition and parameter involution and always believed that "full-modal fusion is the only way to the world model"; at the same time, they re-examined the path and refreshed their cognition in each technological iteration. This ability to stay stable and keep up with the times enables the company to always have a strong driving force for continuous innovation.

Can Write, Understand Typesetting, and Create Storyboards

The Native Full-Modal Enters the Production Verification Stage

This ability of continuous innovation is gradually being transformed into a series of visible strategic achievements. The fact that HiDream1.5 topped the global authoritative list is a vivid example.

HiDream-O1-Image-1.5 shows all-round image generation capabilities far beyond the scope of "good-looking pictures". It is no longer satisfied with outputting a beautiful static picture but can understand complex typesetting, render multi-language text, and control the logic of continuous storyboards.

At the same time, the positioning of the commercial model of HiDream1.5 marks that the native full-modal has entered the production verification stage and can solve various difficulties in actual production. Many previous AI image models could not be used in commercial scenarios, especially in scenarios such as complex typesetting, multi-subject control, and long text rendering, where there were shortcomings. HiDream1.5 has made a major breakthrough in this regard.

HiDream1.5 is targeted at higher-requirement commercial scenarios such as advertising and marketing, brand design, e-commerce vision, game content, film and television storyboards, and IP creation, comprehensively demonstrating enhanced image quality, text rendering, complex typesetting, multi-subject consistency, and visual storytelling ability.

Portrait Photography Scenario

The model can output photography-level image quality, suitable for various styles such as magical light and shadow, close-ups of people, and two-person interactions. It performs naturally in details such as skin texture, clothing texture, limb interaction, and environmental blurring. Facing complex compositions such as wide-angle, low-angle, and indoor warm light, it can also ensure the coordination of the character proportion, spatial perspective, and picture narrative, meeting the professional needs of commercial portraits, brand vision, and film and television storyboards.

Natural Scenery Scenario

For large scenes and complex landforms such as snow-capped mountains, lakes, deserts, and caves, the model can accurately control the spatial hierarchy, light and shadow changes, and environmental atmosphere. The picture has a movie-like texture and rich details, suitable for scenarios such as tourism promotion, film and television concept maps, game scene design, and brand visual communication.

E-commerce Poster Scenario

It can quickly match the visual styles of different categories of products and naturally integrate products, scenes, decorative elements, and marketing copywriting. Facing the needs of mixed Chinese and English typesetting, multi-level selling points, and complex layouts, it can still ensure text readability and picture integrity, effectively improving the production efficiency of e-commerce new product launches, advertising materials, and social media grass-planting content.

Multi-Grid and Storyboard Design

The model has the ability to understand continuous narratives. In the creation of multi-picture content such as picture books, story scripts, advertising storyboards, and short video scripts, it can generate logically coherent content while maintaining the unity of characters, scenes, and visual styles. It can also reasonably arrange elements such as grid layouts, titles, and numbers, supporting the visual creation of comic, film and television, and educational content.

The excellent performance of HiDream-O1-Image-1.5 shows that