HomeArticle

From being shocked by Sora to counterattacking in a dream, the path of AI video generation in China

脑极体2026-03-09 20:34
After breaking through the technological blockade, the real problems of AI video generation in China have just begun to surface.

During the Spring Festival Gala in the Year of the Horse, the stunning visuals of "Hailing the Flower Goddess" went viral across the internet. Soon after, the Seedance 2.0 model, which supported this visual spectacle, opened its API for calls, charging one yuan per second.

Looking back two years ago, AI video generation was a field considered "completely out of reach for China." In early 2024, OpenAI's Sora emerged suddenly. A 60 - second video of "a girl walking on the streets of Tokyo" had a realism comparable to real - shot footage. At that time, most domestic video models could only generate 5 - 12 - second clips, with stiff expressions, through - the - model fingers, and physical glitches being the norm, making the effects obviously fake at a glance.

Sora was like a slap in the face, stunning the Chinese AI industry. Facing this technological deterrence, the Chinese AI industry couldn't just sit back and wait. It began a "Wallfacer Project" similar to how humans in the novel "The Three - Body Problem" faced the technological blockade of the Sophons, each using their unique skills. Eventually, it achieved a remarkable comeback from the shock of Sora, pushing the video quality to new heights and driving down the generation cost to the lowest level. This history is worth recalling at this time when AI video generation is booming.

The Darkest Hour of Chinese AI: The Arrival of Sora

In early 2024, the emergence of Sora plunged the Chinese AI industry into its darkest hour. During that period, my WeChat Moments were filled with complaints about the inferiority of Chinese video - generation technology.

People's disappointment stemmed not only from the obvious technological gap but also from the imagined disasters.

Video generation is much more complex than text generation. It needs to solve a series of problems simultaneously, such as the consistency of object forms in the spatial dimension, the continuity of motion in the temporal dimension, the accurate simulation of physical laws, and the synchronization of audio and video. Compared with Sora, domestic models had no chance of winning.

What's more terrifying than being backward is that this technological barrier seems insurmountable.

At that time, the global mainstream video - generation models were overseas products like MidJourney, Runway, and DALL·E. China lacked both the core technological barrier like Sora's DiT architecture and sufficient top - level NVIDIA graphics cards. The industry pessimistically predicted that the technological gap between China and overseas was unbridgeable, and China couldn't develop its own large - scale video models.

In summary, the impact of Sora on Chinese AI was multi - faceted. Previously, the development of the domestic AI industry mainly relied on application - layer innovation. However, video generation belongs to the hardcore technology field, with no shortcuts in the application layer, which suddenly magnified the industry's weaknesses infinitely.

Moreover, the gaps in innate conditions such as the bottleneck in computing power and the shortage of high - quality video training data also led to a sense of hopelessness in catching up. Domestic practitioners were caught in a debate about "whether to catch up with Sora." Most enterprises were reluctant to be the first to take risks, making the comeback seem even more distant.

Fortunately, humans never sit back and wait when facing external threats. The Chinese AI industry, academia, and all sectors quickly took action, becoming Wallfacers to solve the Sora crisis.

The Era of Deterrence: Three Forces of Sora - like Models

Under the technological deterrence of Sora, the academic community, large enterprises, and vertical - field companies found three different ways to break through and gradually narrowed the gap with Sora.

The academic faction was the first to act.

An interesting contrast is that the breakthrough of domestic large - language models similar to ChatGPT was led by enterprises like Baidu and Alibaba. However, in the field of domestic video - generation models similar to Sora, it was the academic community that took the lead.

The day after Sora was released, Tsinghua University quickly applied for a patent related to text - to - video technology, taking the lead in technological positioning. After that, Tsinghua University, in cooperation with Shengshu Technology, developed an original architecture that combines Diffusion and Transformer, creating China's first large - scale video model with long - duration, high - consistency, and high - dynamics capabilities, which became a pioneering work in domestic video - generation technology.

The proactive attitude of the academic community in benchmarking with Sora was not accidental.

On the one hand, the core of Sora - like models lies in architectural innovation. Universities and research institutions, without the commercial burdens of enterprises, can focus on underlying technologies and conduct original explorations. In addition, the research and development of video - generation models consume a huge amount of computing power. It's difficult for enterprises alone to support long - term trial - and - error. The academic community can rely on policy support, government subsidies for computing power, and research funds to conduct high - risk and high - investment hardcore research. At the end of 2024, I visited the Changchun Artificial Intelligence Computing Center. Out of the total 300P of intelligent computing power in the center, more than 200P was occupied by a Sora - benchmark project of a Beijing university. The full - stack domestic computing power support and the computing - power subsidy policy of Changchun gave the research team the confidence to reproduce Sora.

Next came the data - driven large enterprises. Kuaishou's Keling and ByteDance's Jimeng were launched one after another.

In March 2024, Jimeng AI started its internal testing, relying on ByteDance's self - developed Seedream and Seedance models. In June 2024, Kuaishou's self - developed large - scale video - generation model, Keling, was launched. Its technical route was designed to compete with Sora, supporting the generation of 1080p - resolution videos up to 2 minutes long.

Many people may wonder why it was Jimeng and Keling. The answer is that they are backed by leading video - content platforms, possessing billions of short - video materials covering various scenarios such as daily life, e - commerce, and dramas, which provide a high - quality data foundation for model research and development. After the models were launched, they could also start a data flywheel through the video - content ecosystem and iterate quickly. For example, Keling opened a testing entrance in the Kuaishou Video Editor App, attracting millions of creators on the platform to use it. The user - generated content from real - world creation scenarios then fed back into the model's iteration.

The data - driven large enterprises found an efficient way to catch up in technology. What about other companies?

Not all enterprises chose to fully compete with Sora. Companies like Kunlun Wanwei and Alibaba found a third way: focusing on vertical scenarios to build differentiated advantages.

Although Sora's general - purpose video - generation ability is powerful, in actual use, users' needs are more about accurately solving problems in specific fields. Therefore, these enterprises gave up the blind pursuit of general - purpose models and instead targeted specific business scenarios to solve users' actual pain points.

For example, Kunlun Wanwei's Tianguang large - scale model focused on the AI - short - drama production scenario. Short - drama production has extremely high requirements for character expressions, prop restoration, and plot continuity. However, previous general - purpose models often had problems such as stiff expressions and distorted props. The Tianguang model specifically addressed these pain points, optimizing character - expression generation, prop consistency, video - generation duration, and controllability, making it more suitable for the needs of creators in short - drama and e - commerce advertising fields.

Alibaba focused on ecosystem construction and technology open - sourcing. Backed by the research support of Alibaba Research Institute and the computing power of Alibaba Cloud, it developed video - generation models such as Tongyi Wanxiang and Qwen - Image - 2.0 and chose to open - source its core technologies. Open - sourcing not only attracted a large number of developers to participate in model optimization but also enabled Alibaba's AI video capabilities to be quickly integrated into SaaS tools like DingTalk and e - commerce services like Taobao.

The parallel explorations of these three forces finally reversed the outside world's pessimistic expectations of Chinese AI video generation. However, the challenges greater than technology were just beginning.

The Tug - of - War between Cost - Accounting and Computing Power in the Commercial Fog

After solving the technical problems, economic viability also needs to be considered. Different from overseas video models like Sora and Runway, the commercial exploration of Chinese AI video generation has faced more severe challenges from the start.

On the one hand, there is no mature business model for Sora - like models to follow. The overseas market can only rely on a single monetization method of selling APIs and charging by tokens. On the other hand, domestic users' payment habits are not fully developed, and both enterprise and individual users have lower payment willingness than overseas users. That is to say, every investment in those previously developed video - generation models is an upfront cost - burning effort.

In this context, Chinese enterprises are forced to explore low - cost implementation methods for AI video - generation technology.

Computing power is the core cost of AI video generation and also the biggest pain point for Chinese enterprises. Facing restrictions on graphics - card supply, domestic enterprises have to find alternative ways, optimizing from both the model architecture and hardware adaptation dimensions.

Shengshu Technology's Vidu model developed an original U - ViT end - to - end efficient generation architecture, optimizing it according to the characteristics of domestic chips to achieve the same effect as overseas models with fewer computing cards.

After SenseTime's Seko 2.0 was adapted to multiple domestic chips, the computing - power cost of a single - episode short drama was directly halved. Originally, generating an AI advertisement required a computing - power cost of 500 yuan. After adapting to domestic chips, it could be done for just a few dozen yuan.

If optimizing computing power is about cost - saving, innovating the business model is about revenue - generating.

Facing the situation where "once the free service stops, the user relationship ends" among domestic users, in addition to the overseas subscription - fee and token - package models, Chinese enterprises have also explored new monetization models such as revenue - sharing between the platform and merchants based on advertising revenue, profit - sharing with creators based on content views, and providing customized video - generation services for enterprises.

For example, when a creator generates an e - commerce advertising short video using Kuaishou Keling and attaches the product link of a merchant on the platform, the platform will share the advertising revenue with the creator based on the video's views and product click - through rate. Hongguo Short Dramas cooperates with producers, using the Seedance model to reduce production costs and then sharing the profits with producers based on the views of AI - made dramas.

It can be said that the rich Internet scenarios in China, such as e - commerce, short dramas, and live - streaming, are the key for AI video - generation technology to break through the commercial fog. By linking technological value with commercial benefits, domestic AI video - generation models have escaped the dilemma of high costs and low profits, gradually exploring a sustainable commercial path in the extreme tug - of - war between computing - power consumption and commercial returns.

After Breaking into the Mainstream: The Happiness and Pains of the Massive Craze

2025 was the year when Chinese AI video generation broke into the mainstream and widely entered people's daily lives.

Previously, using AI video tools required downloading dedicated apps, visiting websites, and entering complex prompts. Now, in popular apps like Douyin and Jianying, users can use them with a simple "create a similar video" operation. During the Spring Festival, personalized AI New - Year greeting videos became a new way for trendy people to send New - Year wishes. The Spring Festival Gala in the Year of the Horse was the climax of AI video's breakthrough. ByteDance's Seedance 2.0 model was used in the stage visuals of shows like "Hailing the Flower Goddess," allowing hundreds of millions of viewers to directly experience the effects of Chinese AI video generation.

However, while attracting the whole nation to use it, a series of negative aspects of AI video generation have also emerged.

The biggest annoyance for ordinary users is queuing. During the Spring - Festival peak, generating a 10 - second AI video could take up to 12 hours of waiting. Even in normal use now, the queuing time for generating a short video is still more than 4 hours. This poor user experience has forced many users to pay to become premium users of the models. However, even after paying, the queuing problem has not been completely solved.

Behind the users' queuing problem lies an unsolved commercial dilemma.

As AI video - generation technology has become popular, a large number of new users have flocked in, and the platform's resource consumption has increased exponentially. The computing - power cost of AI video generation is much higher than that of ordinary Internet products, so platforms cannot bear the computing - power cost of free users for a long time, as they did for free social and video services before. It's still unknown whether these new free users will just have a one - time experience or become long - term paying users. Without a certain commercial return, AI video platforms have no incentive to allocate more computing - power resources, and the poor queuing experience will further discourage users from paying.