Mei Tao, CEO of Zhixiang Future: Higher Gross Profit Margin of Tokens in Multimodal Models Than in Language Models

Before the computing power cost drops significantly, video generation startups should try not to compete with tech giants in the To C market.

Text | Wang Xinyi, Li Jiaxing

Editor | Zhou Xinyu

A company that has been working on multimodal large models since Day 1 cannot resist participating in the upsurge of embodied intelligence and world models.

In 2026, with the popularity of models such as Seedance 2.0 and GPT Image 2.0, multimodal capabilities have increasingly become a key word that cannot be avoided in the industry. On May 19th, at the first open day of Zhixiang Future, Zhixiang Future gave their judgment and answer: "Native multimodality is the inevitable path to achieving AGI."

The theme of this open - day event is "Imaging the World". However, for Zhixiang Future, the importance of "World" almost precedes that of "Video" and "Image".

"Our ultimate goal is to build a world model." Mei Tao, the CEO of Zhixiang Future, repeatedly mentioned this view during the event. In his definition, Zhixiang Future aims to be a native multimodal large - model company.

In Mei Tao's view, a real world model needs to meet multiple conditions simultaneously: mastering physical laws, solving long - term causal reasoning, achieving full - modality interaction, and ensuring absolute safety.

Currently, the mainstream world - model training routes in the industry can be divided into Li Feifei's "generating a 3D world" school and Yann LeCun's "self - supervised prediction of the world" school.

Zhixiang Future has made a different choice - innovating in the algorithm and architecture layers. Starting from the scarcest and most costly multimodal data in world - model training, it first focuses on the generation of data such as videos, images, and 3D interactions. It uses low - cost synthetic data to overcome the bottleneck of data scarcity in the industry and accumulates some reusable visual - model capabilities for world models.

Specifically, they want the multimodal model to have the ability to understand the rules of the real world from the very beginning. The native full - modality Unified Transformer (UiT) architecture can achieve "Any to Any" (any form of input supports any form of output), which is exactly the ability required by the world model: understanding, generating, and predicting different states of the real world in a unified architecture.

In the past period, they have made a transformation from "model as product" to "building an Agent platform".

As a company mainly targeting the B2B market, they summarize their strategy as building a "1 + 1+3" MaaS (Model as a Service) platform, which includes the underlying HiDream series of large models, the middle - layer HiHarness enterprise service platform, and the applications in three upper - layer scenarios: commercial marketing, film and television production, and social media creation.

While the concept is hot, capital is also constantly making bets. After receiving a 500 - million - yuan Series B financing from institutions such as Anhui Industrial Investment and Orient Fortune Capital last month, Zhixiang Future quickly announced the next - round financing and completed another round of financing in the hundreds of millions within two weeks.

A series of competitions have also followed. As the current model capabilities are getting stronger and stronger, while competing with domestic and overseas base - model manufacturers for the model market, Zhixiang Future also needs to hold tightly to their new card - the MaaS platform and aim at the vertical track in the video - generation field to compete with large companies.

After the open day of Zhixiang Future, media such as "Intelligent Emergence" had a conversation with Mei Tao, the CEO of Zhixiang Future, and Wang Bing, a partner of Orient Fortune Capital, the investor. The following is the transcript of the conversation (slightly edited).

Many embodied - intelligence companies underestimate the importance of video models

Question: In people's perception, Zhixiang Future has mainly worked on images and videos before. How did your strategic transformation from two - dimensional images and videos to the three - dimensional physical world occur?

Mei Tao: It's still too early to call some models in the market world models. There are different paths for world models, and there may also be multiple possibilities in the future.

As of today, we won't claim externally that we are currently a world - model company. We prefer to define Zhixiang Future as a native multimodal large - model company.

Zhixiang Future pays more attention to native full - modality large models and their application fields. However, a native multimodal large - model company will surely lead to world models in the future.

Question: Many companies now claim to be working on "world models", but the definition of this concept in the outside world is very vague. How do you define the "world model" that Zhixiang Future is pursuing?

Mei Tao: We will very rigorously consider that Zhixiang Future is working on a native multimodal model. In the process of moving towards the world model, we will focus more on the generation of data such as videos, images, and 3D interactions.

Question: Zhixiang Future has upgraded from spliced multimodality to native multimodality. What inflection point has occurred in technology? Is the current technology mature?

Mei Tao: The technology in the field of multimodal generation has not converged yet, which is an opportunity for startups. If the technology is completely converged and the DIT framework is uniformly adopted, we will have no room.

Precisely because the technology has not converged quickly, we can achieve the same effect as large companies with a small amount of resources through algorithm innovation, rather than simply competing in terms of data and computing power.

Question: What mature technical conditions need to be achieved to move from a full - modality large model to a world model?

Mei Tao: First, master physical laws, including fluid mechanics, solid mechanics, molecular dynamics, and Newton's laws. Currently, the industry still has difficulty in comprehensively enumerating and controlling them;

Second, solve the problem of long - context causal relationships;

Third, achieve full - modality interaction with the physical world. For example, how a robot picks up a cup, unscrews the lid, pours water, and judges the user's needs is still a long way from us;

Fourth, ensure safety. If a robot enters a family, it must guarantee 100% safety to avoid causing damage to people or valuable items.

Our more practical choice at this stage is to focus on the native multimodal problem, which can not only achieve commercialization but also lay a technical foundation for moving towards the world model in the future.

Question: Many video - generation model companies are competing in terms of long - video generation and authenticity. Will these indicators change in the process of moving towards the world model?

Mei Tao: The world model emphasizes the ability to generate the world, including logical relationships and visual effects.

We have three - dimensional requirements for our video model:

First is model ability, that is, the rationality and quality of visual content, and the degree of conformity with physical laws. We must aim for the ceiling in terms of model ability;

Second is video duration. Currently, we can achieve minute - level generation, and even technically, we can achieve 3 - minute, 5 - minute, or even infinitely long generation;

Third is real - time and interaction ability. When the model can generate a 1 - minute - long video in 1 minute, basic interaction can be achieved. I hope our products can strive in this direction. For example, provide a low - quality preview through the algorithm, and then output a high - precision 2K or 4K video after the user confirms.

Question: In the process of training a world model, data is a relatively scarce resource. What differences do you think there are between the data acquisition, cleaning, and annotation strategies for world models and those for previous image and video models?

Mei Tao: The model - training process includes three elements: algorithm, data, and computing power.

If the algorithm framework is fixed, then it's about competing in terms of data and computing power. For example, if everyone is using the DiT (Diffusion Transformer) architecture to build video models, the quality, distribution, and annotation quality of the data are quite important for the model's ability.

However, once the algorithm and architecture change, or new architectures and algorithms emerge, the importance of data will be a bit weaker. This is also an opportunity for startups - we don't completely compete in terms of computing power and data but focus on the innovation of the algorithm itself.

Looking at the data level, to obtain high - quality real data and feedback, we have developed a set of tool chains for collecting, cleaning, and annotating this data.

We have 200,000 hours of video data with film and television copyrights, maintain cooperation relationships with many manufacturers, and are also seeking cooperation with leading film and television companies with copyrighted data.

The data situation of the world model is different from that of the video model. The data collection for the world model requires all - around multimodal data, which is more costly and scarcer. Therefore, what Zhixiang Future is doing is using the millimeter - level real - person operation data collected by other manufacturers to generate tens of thousands of different scenarios and skin - tone real - person data through the video model, and using this real data and machine - synthesized data to train the VLA (Vision - Language - Action) and WAM (World Action Model).

Question: Will there be a difference in the effect between training a model with pure real data and training it with machine - synthesized data?

Mei Tao: We will conduct small - scale verification to form a closed - loop from data to model training. Specifically, it means to see if the data generated by the machine is beneficial to the ordinary and even the best VLA and WAM models in the market, and then verify the effectiveness of the data in reverse.

Question: You once mentioned that many embodied - intelligence companies underestimate the importance of video models. Why do you think that without a video model, embodied intelligence will be difficult to go far?

Mei Tao: Currently, the model scale of embodied - intelligence companies is generally small (less than 100B). If they really want to undertake complex tasks similar to those of a world model, it's unlikely to achieve wide generalization with small models and limited data collection.

What we are doing is one of the three global multimodal models that can reach the scale of hundreds of billions. Without a solid and thick base model, it's difficult to achieve good generalization in embodied intelligence. Even if it can do something in a specific scenario due to data collection, it's difficult to expand to other scenarios.

The gross profit margin of multimodal model Tokens is much higher than that of large - language - model Tokens

Question: At the beginning of this year, the shutdown and removal of Sora had a certain impact on domestic video - field entrepreneurs. Will this affect investors' investment decisions?

Wang Bing: No, because this is OpenAI's strategy. Currently, in the coding field, OpenAI is being strongly pressured by Anthropic. Under limited resources and competition, OpenAI may naturally lower the priority of image and video models, which are difficult to make money in the short term.

However, from last year to this year, the commercialization progress of the entire image and video model industry has been very fast. For example, Keling and Seedance 2.0 have performed well in terms of revenue.

However, the real problem in the video - generation track may be the copyright issue. This is also an important reason why we chose to invest in Zhixiang Future. All of Zhixiang's data is legal and copyrighted.

In the context where large models are extremely costly, we particularly value how enterprises can compete differently from large companies in the most effective way. Specifically, on the one hand, how can an enterprise build a base model with capabilities comparable to those of leading large companies at a lower cost and improve R & D efficiency and capital efficiency; on the other hand, can the enterprise use advanced technology to quickly implement it in different business scenarios and provide differentiated B - end scenario services.

Question: Can video models really make money?

Wang Bing: It will definitely make money.

First, the computing - power cost will definitely decrease exponentially. The computing power of the chips launched by NVIDIA every year is 5 - 10 times that of the previous generation, but the price hardly increases much. Therefore, the average price of computing power is decreasing every year.

The projects that you see not making money today may start making money in two years because the computing - power cost is constantly decreasing.

Second, in the past few years, the generation effect of images and videos could not reach the commercial level. But this year, we can see that the quality of AI short dramas, short videos, and e - commerce videos can almost all achieve commercialization. In almost all video - field application scenarios such as film and television and advertising, AI will surely replace most of the manual work.

Question: Can the gross profit margin of B2B services in the video - generation industry be positive?

Mei Tao: The gross profit margin of B2B services is quite high. At the same time, the gross profit margin of multimodal model Tokens is also much higher than that of large - language - model Tokens.

Question: Is there a standard to measure whether a company in the video track has reached the commercialization stage?

Wang Bing: We have observed this track for a long time, and the reason we haven't made an investment is that we are not sure when the quality and cost can reach the commercialization level.

Since last year, I have felt that that "point" is approaching, and we are also waiting for the most suitable turning point for commercialization. This "point" will definitely come, and currently, it is coming faster than we expected.

From the specific measurement standards for enterprises: the first is the technical background of the team. The team is a pioneer in this field and has gone through a long - term accumulation; the second is stability. The team has a high talent density and can maintain stability continuously; the third is capital efficiency, R & D efficiency, and the long - term focus of the team.

Question: What do you think of the commercialization - path choices of startups in the video - generation track?

Wang Bing: Before the computing - power cost drops significantly, try not to compete with giant companies in the C2C market.

Startups represented by Zhixiang are definitely right to start with B2B services. By doing B2B, enterprises can improve the logical ability of products and the ability to implement scenarios, and can achieve a certain amount of revenue without burning a lot of money.

Question: What is the cooperation mode and revenue - sharing mechanism between the platform and various model platforms? Is it convenient to disclose the commission - sharing ratio? What cooperation modes exist in e - commerce and short dramas?

Mei Tao: It is a common understanding in the industry that no single manufacturer's model can meet all the requirements of customers. So we have built an MaaS platform, which not only precipitates our self - developed multimodal capabilities but also integrates third - party large - language models such as Deepseek to meet the end - to - end needs of customers. A large number of APIs and Skills are precipitated on the platform. Users contribute industry skills, and we will share the commission with them.

In e - commerce short - video advertising, we have three charging modes: one is to sell tools by Token; the second is RaaS material service; the third is to share the commission according to the GMV, with the ratio ranging from 15% to 30%.

In the short - drama field, at this stage, we mainly provide AI production tools to production contractors and do not share the commission for now. If we encounter some high - quality short - drama production contractors, such as our cooperation with Anhui TV Station and Huace, we will jointly produce and distribute, and in this mode, there will be commission sharing.

Question: You previously mentioned that in the AI era, competition is inevitable. This year, you adjusted the company's strategy to build a "1 + 1+3" MaaS platform. What time point or market perception prompted you to make such a decision?

Mei Tao: We have a scientific - entrepreneurship background and are used to doing things down - to - earth. But what really touched us was the excellent performance of companies such as Minimax and Zhipu on the Hong Kong stock market. In the secondary market, people have strong confidence in and high valuations of Chinese AI companies, which made us realize that we need to improve our brand - storytelling ability.

In the primary - market perception, in 2023, people valued model effects more, but from the end of 2024 to 2025, they valued commercialization achievements more. This year, people began to benchmark against overseas model capabilities.

This year, both primary - market and secondary - market investors have started to focus on the model capabilities themselves and have realized that the model is

This article is originally produced by「王欣逸」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

Mei Tao, CEO of Zhixiang Future: The gross profit margin of tokens in multimodal models is much higher than that in language models.

Many embodied - intelligence companies underestimate the importance of video models

The gross profit margin of multimodal model Tokens is much higher than that of large - language - model Tokens