HomeArticle

He raised 2.5 billion yuan in six months and is the most aggressive AI entrepreneur in the "ByteDance ecosystem".

中国企业家杂志2026-05-21 09:00
In the future, the barriers to video generation are complex, involving the synergy among data, products, and users.

The battle for AI video generation has heated up earlier than expected.

ByteDance's Seedance 2.0 became extremely popular in February, and its performance has directly changed the development direction of the AI comic drama industry. Soon after, Alibaba's HappyHorse started the API internal testing. It is reported that Kuaishou's Keling is seeking independent financing at a valuation of $20 billion, and its ARR (Annual Recurring Revenue) has reached $500 million.

While tech giants are gathering resources, some players are being eliminated. In March this year, OpenAI shut down Sora, blaming the need to concentrate resources and attention. This has also cast doubt on whether the business model of text - to - video generation can be successful.

However, Wang Changhu, the founder and CEO of AI video generation company Aishi Technology, still optimistically said that currently the opportunities in video generation outweigh the challenges. "If only one or two products (like Douyin and Kuaishou) with billions of users have a chance to survive in each era, it would be so boring."

Wang Changhu once served as the head of ByteDance's visual technology, built ByteDance's visual algorithm platform and business middle - platform, and led the construction of ByteDance's visual large - model from scratch. After starting his own business in 2023, Wang Changhu has become one of the most promising and well - funded entrepreneurs from the "ByteDance ecosystem".

In the past six months, Wang Changhu and Aishi Technology have received a total of 2.5 billion yuan in financing. In March this year, Aishi Technology received $300 million in Series C financing, led by CDH Hong Kong Fund, CDH VGC, and CDH Baifu. Industrial investors such as China Ruyi and 37 Interactive Entertainment, as well as investment institutions like Yizhuang Guotou, Zhongwei Capital, and Guotai Junan Innovation Investment participated. The company's valuation reached $1 billion.

In terms of models and products, Wang Changhu and his team upgrade the model almost every three months. In October 2023, Aishi Technology launched PixVerse V1, becoming the world's first large - scale video model capable of generating 4K videos. By the PixVerse V4 version, Aishi Technology can generate videos within five seconds. Currently, the PixVerse series of models has reached version V6. AI - generated videos not only achieve audio - video synchronization, but also the texture of characters and scenes is closer to the real world.

By the end of 2025, the number of users of Aishi Technology's PixVerse on the app and web versions exceeded 100 million, and its ARR exceeded $40 million.

Wang Changhu is an introverted person. Dachen Venture Capital led the Series A financing of Aishi Technology and participated in the Series B financing. Wu Xi, a partner, executive president, and chief investment officer of Dachen Venture Capital, told "China Entrepreneur": Wang Changhu doesn't have his own independent office and works with more than 100 colleagues. Wang Changhu summarizes Aishi Technology's corporate culture as "Aishi Style" - simple and direct. There are only two levels of reporting, with a flat organization and fast response speed.

In an exclusive interview with "China Entrepreneur", Wang Changhu mentioned "evolution" 10 times, "efficiency" 8 times, and "raise a question mark" 3 times. Regarding some investors comparing Aishi Technology to the "DeepSeek" in the video generation field, Wang Changhu said, "Since the start of our business, we have used only a fraction, or even 1% of the cost resources of our peers to develop better or comparable technology and products."

This pursuit of efficiency stems from Wang Changhu's technical accumulation at ByteDance. Wu Xi said that Wang Changhu and his team managed 20,000 V - series GPUs at ByteDance, and they know very well how to efficiently use limited resources to iterate products.

The three "question marks" mainly come from three aspects: opportunities outside of Douyin and Kuaishou, how entrepreneurs who left big tech companies handle the competitive relationship with their former employers, and the divergence between to - C and to - B products in the AI era. At the same time, Wang Changhu also has confidence and courage. For example, he doesn't agree that entrepreneurs should "avoid" the competition from big tech companies.

In addition to the V - series models, Aishi Technology is also deploying a series of industry - specific video generation models such as C (for the film and television industry) and E (for marketing). In January 2026, Aishi Technology launched the world's first general real - time world model, PixVerse R1. In April 2026, it launched the world's first large - scale model for the film and television industry, PixVerse C1.

A business leader of Aishi Technology told "China Entrepreneur": In 2026, the key topic of internal discussions among the company's senior management was that Aishi is not just an MaaS company, and it doesn't want to simply exist to provide tokens. The current trend of model development is to integrate more and more with industries.

This means that Aishi Technology is fighting on two fronts. On the one hand, it pursues a large - scale to - C strategy of "letting everyone become the director of their own lives", as Wang Changhu said: "Let billions of people around the world have the opportunity to change from spectators to participants, and from ordinary consumers to creators." On the other hand, it also needs to delve into the industrial end and compete directly with giants like ByteDance and Kuaishou.

Recently, Aishi Technology announced cooperation with leading film and television companies such as Mango TV and China Ruyi. China Ruyi is also an industrial and strategic investor in Aishi Technology. In January this year, Aishi Technology received a strategic investment of $14.2 million from China Ruyi.

The following is the exclusive dialogue between Wang Changhu and "China Entrepreneur" (with some deletions):

Achieve 100% results with 1% of the investment of peers

"China Entrepreneur": The video generation industry has been very active recently, with various companies iterating intensively. Do you think the industry has entered a stage of differentiation?

Wang Changhu: I think it has become more prosperous. When we started our business in 2023, large - scale models had just emerged, and we chose to fully invest in video generation. Why could we foresee the prosperity of the video large - scale model and application track earlier? Because video is the closest to us, and it should naturally be more prosperous.

In the past two years, the evolution of video generation has been very rapid. Just look at our company. In the past year and a half, we have released eight or nine major model updates, and a new large - scale model (version) is born every two or three months. We believe that video generation still has a long period of explosive growth and a lot of room for evolution.

"China Entrepreneur": The rapid evolution and upgrade of the model also mean that its capabilities are not fully stable yet, right?

Wang Changhu: If a thing stabilizes quickly, it will fall into path convergence, with stable results and competition based on resources, which is more suitable for big tech companies. However, the rapid development of video generation is constantly creating more possibilities, and startups still have many opportunities.

"China Entrepreneur": Sora has a good product experience but poor user retention. What do you think of this issue?

Wang Changhu: I highly appreciate the pioneers like Sora who are brave in exploration. However, innovation is a high - risk endeavor. So, what you see as our "templates" with high usage volume, and other peers are also continuously innovating, but some may not keep up with the pace.

Sora 2 has achieved two successful things. First, it has done a great job in generating synchronized audio and video, and the model is no longer underperforming. Second, it has made a bold and even radical attempt on the consumer platform. Whether it is ultimately successful or not, it is still a brave attempt.

A failed attempt does not mean that the direction is wrong. Sora may have encountered many difficulties, but their efficiency is not as high as ours. The cost per frame for them may be dozens of times or more than ours.

Third, it has explored the social aspect at the consumer end of human - content interaction, using AI video generation to attempt social networking, which is very valuable.

"China Entrepreneur": Is Sora a bit too far ahead of its time? Is the industry not ready for the AI video social or community - type products it is exploring?

Wang Changhu: We can't simply attribute it to a few words. We believe that in the new era, the boundaries between consumption and creation are becoming increasingly blurred. What kind of scenarios will there be in the future? Everyone can consume, and everyone can create. Sora 2 has taken a step towards this goal, but what kind of product will ultimately win the hearts of users still needs continuous refinement.

"China Entrepreneur": An important contribution of Douyin and Kuaishou is that they have given ordinary people the greatest opportunity to express themselves. What do you think the wave of AI - generated videos can bring to them?

Wang Changhu: I also experienced the glorious era of Douyin. With the popularization of smartphones and 4G/5G, and the decreasing cost of data, Douyin and Kuaishou have created a phenomenon where everyone can easily refresh videos on the short - video platform.

But does this mean that everyone can become a creator? This is a question I raise. Billions of people around the world use videos, but the proportion of those who actually shoot, upload, and share may be less than 10%, which is a very small percentage. So, we want to enable the more than 90% of billions of users who have never had such an experience to turn their imagination into videos through our products, to create, spread, share, communicate, and interact.

"China Entrepreneur": The popularity of PixVerse is inseparable from content templates. Why are templates so important?

Wang Changhu: We launched the templates around October 2024, which was a very special time. Before that, creators had clear intentions, such as creating an advertising video or a short trailer, and then generated segments by calling the model. What was the problem then? The success rate of generation was very low. Only one out of ten generated segments could be considered good. Once users found that the generated result was not good, they would not use it again.

So, we wanted to provide a more accessible creation tool. As a result, the success rate of generation increased from 10% - 20% to nearly 100%.

Second, it lowers the threshold for users to generate content. Users don't even need to enter prompts. They can simply upload a photo of themselves and select a template, which allows billions of ordinary people to use it easily. So, we think it is the GPT moment for video generation.

This has also enabled us to develop the world's best video generation capabilities and launch the most user - friendly and low - threshold generation products, achieving a cross - boundary effect.

"China Entrepreneur": Do you think templates are just a transition or the final product form?

Wang Changhu: It is just a feature of our product. In addition to templates, we also have the ability to generate videos based on start and end frames. You can upload two pictures, and we can generate a dynamic video that transitions from picture A to picture B. We also have an Agent function. Some users want to generate longer and more editable stories, so we developed an Agent that can call different template capabilities and basic video production capabilities to automatically generate longer and more impactful videos.

"China Entrepreneur": A significant technological breakthrough for you was achieving audio - video synchronization in the V5 version released in 2025. Will this increase the cost of a single video? How do you control the cost?

Wang Changhu: We are a startup, but in terms of model capabilities, we have always been in the world's first - tier. Our product is among the "Top 25 Global AI Products" and was also the first in the video generation field to exceed 10 million users.

This means that our efficiency is extremely high, and this is not something that started with audio - video synchronization. From the very beginning, we considered using only 1/10 or even 1% of the cost resources of our peers to develop better capabilities and products.

"China Entrepreneur": How did you specifically achieve this in terms of technology?

Wang Changhu: The biggest cost is the cost brought by cognition, that is, your judgment. For example, when dealing with a complex task like developing a large - scale model, you need to make judgments at many nodes, and each node is unknown.

When I need to solve five difficult problems, and each problem is unknown. You have five solutions, and you need to decide which one to choose and which one not to choose. This will result in a huge cost difference. The best team can always choose the right path. On the contrary, another team may choose the wrong path every time. You'll find that the efficiency difference between the best and the worst teams is 5 to the power of 5.

On the non - technical side, it means a flatter decision - making chain. The number of levels between those with judgment and those with resource - decision - making power should be as small as possible, which can greatly improve the organizational efficiency of the team. Our company practices the culture of "simple and direct" and "Aishi Style", which helps us perform better and faster on the non - technical level.

On the technical side, after the emergence of DeepSeek, investors or companies familiar with us regard us as the "DeepSeek in the video generation field". The success of DeepSeek lies not only in its open - source nature but also in the fact that it achieved its results with 1/10 of the cost of others. In contrast, our cost pressure may be greater, and achieving this in the text - to - video field also depends on many factors.

We have natural advantages in terms of data, models, and the DiT (Diffusion Transformer) architecture.

First, in terms of data, how to find the most valuable data that can help evolve and improve the model's performance. Whether this is done well or not will be reflected in cost, efficiency, and training time.

Second, the same goes for the model side. For example, when building an AI model, what approach should be used, how to improve video quality, and at the same time, minimize the cost of the model training and inference process. In fact, during the model training process, how to ensure that it is trained successfully every time, rather than having to retrain after a poor result, also involves our overall investment cost.

Third, in terms of the model architecture, how to be both effective and fast? How to better mobilize resources during the inference process? Since we have users all over the world, how to "even out the peaks and troughs"? Use limited resources to ensure inference capabilities. This is a comprehensive task, which involves both non - technical and technical aspects. The technical aspect involves data, models, and engineering. We need to excel in every aspect to reach where we are today.

"China Entrepreneur": Currently, large - scale model companies are all improving their attention mechanisms. I noticed that in the V5 version, you mentioned the "adaptive Attention structure", Full Attention, and Sparse Attention. Why did you choose to combine them?

Wang Changhu: Choosing to combine the two, first, ensures that the effect is not affected. Second, we need to complete the modeling with extremely high efficiency, so we use different three - line structure combinations. The model not only needs to process visual information but also integrate the audio dimension outside of three - dimensional space, so a new structural evolution is required.

"China Entrepreneur": Does adding sound make the technical difficulty higher?

Wang Changhu: It must be more difficult because the model perceives one more dimension of the world. We hope to keep our overall data volume under control. Although the data volume will definitely increase, it must also be controllable. How to extract the essential laws as much as possible under the premise of limited samples and strengthen the understanding of the simultaneous synchronization of the world, audio, and video requires the model to play a more important role.

Don't avoid competition with big companies

"China Entrepreneur": After users generate videos on "Paiwo AI", they will definitely share or distribute them next. How do you plan to build your own ecosystem?

Wang Changhu: First, we encourage users to post videos created with our products on various platforms. Second, we also encourage users to post valuable videos on our platform to build their personal brands. Users can also refer to the content posted by others and create secondary content with one click, enhancing their sense of belonging.

"China Entrepreneur": Is user operation more difficult than model and technology development?

Wang Changhu: In our view, models and products are on one dimension. Users will tell us in many ways which direction the technology and products should develop. It is a collaborative process. We will evolve our products