Nach Google I/O: Das Bezugssystem der KI hat sich verändert

Wo liegt der nächste Alpha der KI?

Currently, the industry consensus in the coding era is firmly established.

"Although we have increased the token price, the acceptance among customers is still very high, the demand remains strong, and the current supply cannot fully meet the demand yet. There are still a large number of customers waiting for the service."

At the quarterly report meeting for the fourth quarter of fiscal year 2026, Alibaba CEO Wu Yongming revealed in a speech the immense opportunities in the coding field.

AI has finally moved from presentations into the production budgets of companies. Alibaba has solved the first problem: Is there a real demand for AI?

The second problem comes from Google: What will AI look like in the future?

At midnight on May 20th (Beijing time), Google I/O 2026 started as planned.

The highlight of this conference was undoubtedly the demonstration of agents and multimodal capabilities. When releasing Gemini Omni Flash, Google gave a precise definition - it supports input in any modality and generates output in any modality.

The video output shown at the conference is just the beginning. According to Google's planning, Omni is able to realize a complete multimodal output of text, image, sound, and video and generate more accurate physical effects such as gravity and dynamics based on the world model capabilities of Gemini.

For Google, Omni is no longer just a video model but a real super - input for content creation. It can be integrated into the workflow of all content creators and creates a multimodal application market with even greater opportunities than the coding field.

Compared to programming, this is the real treasure of AI. According to the industry - standard pricing, the price per million tokens for video models is much higher than for images and texts. This means that once the token calls increase, videos can create a far higher API value than texts.

Importantly, multimodality has reached a historical technological turning point.

Compared to the simple combination of text models, image models, and video models in the initial stage, the emergence of a unified base model for all modalities, such as Google Gemini Omni 2026, marks the beginning of a new era in the industry.

Multimodality, the next token turning point

OpenAI CEO Sam Altman would not have thought that at the beginning, ChatGPT took 5 days to reach 1 million users, while GPT - 4o for image generation only needed 1 hour.

Thanks to the high - quality imitation of the Studio Ghibli style, the image generation function of GPT - 4o became a huge hit right after its release. OpenAI had to restrict free use and asked users not to generate so many images so that the team could get some sleep.

The image generation model Image 2 released that year reached over 1.8 million new users worldwide within an hour, breaking the record of GPT - 4o. Within a week, it had over 120 million active users worldwide and drove up the number of ChatGPT Plus subscriptions by 23% compared to the previous quarter.

The release of Google Nano Banana 2 at the beginning of the year led to a breakthrough in global tests. The product reduced the generation time for a detailed 4K image from minutes to seconds.

To date, the Nano Banana series has generated more than 50 billion images. The media says that Google has ended the era of Photoshop.

There is no denying that revolutionary multimodal models have a decisive market influence.

At last year's Google I/O conference, VEO 3 was a sensation. The videos on the theme of "fruit cutting" spread like wildfire on TikTok. Within just six months, the total number of generated videos exceeded 230 million. Some media reported that VEO 3 had saved Google's balance sheet.

But there's more.

A few days ago, a Reddit user accidentally found a demo of Gemini Omni and shared it. Immediately, it caused a global stir in the AI community:

A teacher gave a lecture and wrote formulas on the blackboard at the same time. The voice, the image, and the handwriting were precise and fluent, incredibly smooth.

An X user commented that the "Nano Banana moment" for video models would soon arrive.

What's even more impressive about Gemini Omni is that the model supports one - click watermark removal, object replacement, and lighting adjustment. The text consistency and role coherence surpass all previous video models.

AI users who have ever received cryptic images know how difficult it is to get a clear and accurate text from AI, let alone mathematical formulas, and that during a lecture on the blackboard.

Compared to VEO, Google Omni is a real model for all modalities, suitable for both input and output. It supports mixed input of content in any modality, generation of high - quality videos, and dialogue processing.

This means that Google Omni is able to complete the analysis and generation of all modalities within a unified model, rather than integrating multiple systems later.

According to Google's definition, Omni is the further development of the main architecture of Gemini. It extends the native multimodal capabilities of Gemini, which have existed from the start, from the input side to the output side.

In comparison, VEO and Nano Banana are not independent products but capability components of Omni.

During the live demonstration, a Google manager showed specific processing scenarios - when the user inputs "Change the background to a snow landscape", the model changes the video environment; when the user inputs "Change the shooting direction to a side view", the camera perspective changes; when the user inputs "Add a comment", the video generates a voice description and background music.

From start to finish, the user can simply instruct the video to be processed and every detail to be precisely changed through dialogue, just like instructing an employee, without switching threads or re - uploading the video. This completely changes the model of previous video models like VEO, where generation worked through input words and random selection.

DeepMind CEO Demis Hassabis said that in the future, Omni will support all input and output functions in every modality. Access will be through the Gemini application, Google Flow, and YouTube Shorts. Stronger versions of Omni will be released later.

Google's ambition is obvious. It wants to create a real world model without media restrictions and modality barriers. AI should be able to interact with the world in all ways that humans can understand, and one model should define the future form of AI.

The basis for this ambition is the ability for all modalities.

Many people don't notice that a unified base model for all modalities actually has advantages in R & D efficiency.

When performing trans - modal tasks, improving text understanding can improve the quality of images and videos and make the generated content logic stronger; the training data of images and videos can in turn help the model better understand the physical world and improve the text inference and common - sense capabilities.

This is a positive cycle where 1 + 1>2. This also explains why experts like Yann LeCun and Fei - Fei Li believe that the multimodal world model is the future path of AI.

In the past, the market has focused on coding and underestimated multimodality. This mindset is now being overturned.

Morgan Stanley wrote in a recently published report that the potential value of Minimax is ignored by the market and its ARR (Annual Recurring Revenue) will reach 1 billion US dollars by the end of 2026. An important reason is that the market underestimates the commercial value of multimodality technology, especially the mutual promotion between large language models and multimodal models.

This sentence reveals the biggest blind spot in the current AI market.

A jack - of - all - trades with five senses by nature?

In the Chinese market, growth is being prepared through technological innovations.

Morgan Stanley wrote that the Chinese model market has reached a turning point and will imitate the supernova development in the United States. There are two reasons: First, the capabilities of Chinese models are already close to or even better than those of top American products. Second, the prices of Chinese models are generally more favorable than those of American models.

In the Chinese market, the main players follow similar strategies: They strive to replace the Claude ecosystem, then look for their own strengths, such as specialization in long texts, agents, or inference, and finally try to break free from the competition through more favorable subscription prices.

But that's not the whole story.

There are still players whose technological direction is getting very close to Gemini Omni. Minimax has the chance to establish this ecosystem in China first.

Recently, Goldman Sachs wrote that ByteDance, Alibaba, and Minimax are on the same level. The reason is that Minimax, as an independent Chinese AI company, follows a unique and comprehensive multimodal strategy and has a cost - effective and flexible computing architecture.

Goldman Sachs: Chinese multimodal models enter the global market, pay attention to Hailuo 3

According to Goldman Sachs' forecast, the releases of the models M3 and Hailuo 3 will be an important milestone for Minimax. The gross margin of the text - API business will reach 40%, and the gross margin of the multimodal API business will be 60 - 70%, which is higher than the industry average.

UBS has set the target price for Minimax at 1,000 Hong Kong dollars. The reason is that the development of multimodal capabilities and the coordinated R & D between different modalities will reduce training costs and quickly improve the capabilities of the models.

In other words, multimodal R & D for Minimax not only brings a product portfolio but also more refined and efficient engineering. This will further lower the access threshold for enterprise models and expand the user base from developers to ordinary users.

Morgan Chase has rated Minimax as "Overweight" because it has a rare combination of technical strength, multimodal commercial potential, and global scalability.

Minimax is the only independent large - model company in China that has capabilities in text, image, video, audio, and music at the same time. Its capabilities in text, speech, and video generation are all world - class.

In the past, multimodality was often misunderstood as a "feature list". People thought that if they ticked off the five categories of text, image, video, speech, and music, it was multimodality.

In fact, the real value of multimodality lies not in "what one can do" but in "whether these capabilities can strengthen each other". This is the essential difference between an initial strategic choice and a later adjustment.

Video generation is a good example.

It's difficult to verify whether a text model really understands the physical world. If you ask it to write an article about an apple falling, it can do it well, but you never know if it really understands gravity.

But in video generation, everything becomes immediately visible. Is the position of the hand correct? Does the trajectory of the object conform to physical laws? Is the camera cut smooth? Is the text clear and accurate? Do the sound and the image match? If something is wrong, the user will notice it immediately.

This is the ultimate test for a large model's ability to understand the world. It requires not only stronger spatial understanding ability but also causal inference, long - term coherence, and the ability to model relationships between multiple objects. This in turn improves performance in text processing, agent control, and tool use.

In other words, a unified base model for all modalities is not simply the sum of five independent models but an organic whole.

This is Minimax's strategy. From the large language models of the M - series, through the Hailuo video model, to the Music - audio model, the integrity of independent R & D and implementation of all modalities is unique among independent Chinese AI companies.

This in - depth, unified - from - the - start strategy enables Minimax to achieve a smoother holistic sensory intelligence at a lower cost.

Morgan Stanley has estimated that Minimax can achieve a turnover of about 1 US dollar per minute on an 8 - card H800 inference server through infrastructure optimization, while the cost is below 0.3 US dollars. The industry average is only about 0.5 US dollars per minute.

According to the IPO documentation, Minimax has only spent 500 million US dollars since its establishment and has thus reached the world - class level in multimodal capabilities. This amount is only about...

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Nach Google I/O hat sich das Bezugssystem der KI geändert.

Multimodality, the next token turning point

A jack - of - all - trades with five senses by nature?