36氪_让一部分人先看到未来

In the field of video generation, there is no gap between us and OpenAI.

Written by Yongyi Deng

Edited by Jianxun Su

"Intelligent Emergence" exclusively learned that Aishi Technology has recently successfully completed the A2 - A4 rounds of financing, with a total amount of nearly 300 million yuan. The investors include Ant Group, Beijing Artificial Intelligence Industry Investment Fund, Guoke Investment, and Guangyuan Capital.

As 2024 comes to an end, and it is also nearly the first anniversary of Sora under OpenAI. How are the video generation startups in China doing?

In response to the inquiries of "Intelligent Emergence", Wang Changhu, the founder of Aishi Technology, calmly said: "At least we have achieved the goals we previously mentioned."

After the release of the Sora preview version in early 2024, Wang Changhu once judged: He is confident that within 3 - 6 months, they can catch up with the current level of Sora.

In December, Sora finally "made a belated appearance" and officially launched publicly. Although Sora has many functional innovations in the product and interaction levels, the general market comment is that the actual effect presented by Sora is not as amazing as expected.

For example, in the Chinese test rankings such as SuperClue released in November and December, the core product of Aishi Technology, PixVerse, has ranked first in the text-to-video ranking. In the global market of AI video generation applications, PixVerse is also often listed as a first-tier product.

Nearly two years after its establishment, Aishi Technology has also submitted a substantial answer sheet: The core AI video generation product PixVerse, when it just launched in January 2024, achieved more than 1.2 million visits in the first month after its launch. At that time, the Silicon Valley star AI video generation startup Pika had a monthly visit volume of around 2 million in the three months after its launch.

One year has passed, and this number has been refreshed: The global user number of PixVerse has exceeded 12 million, and the monthly active user number is nearly 6 million. Moreover, the team has now achieved scalable income.

The rapid growth of the product comes from many updates to the underlying video model. In 2024, Aishi Technology has undergone three major iterations, namely the video model V1 in January, and the V2 model in July, which is also one of the first batch of video large models released in China, benchmarking the DiT architecture route such as Sora. In terms of clarity, consistency, physical laws, and instruction following, PixVerse has made many improvements.

After the latest V3 model launched at the end of October, PixVerse even created a hot topic in social media - the "Venom" special effect that went viral on platforms such as TikTok, Douyin, and Xiaohongshu, with a total exposure of over 100 million. Many amateur bloggers used the "Venom" special effect to shoot videos and received over one million plays.

△Source: PixVerse

Wang Changhu said that the reason why the "Venom" special effect can go viral is also closely related to the base model capability of PixVerse. In March 2024, Aishi Technology launched the world's first Character2Video (character consistency) model and continuously iterated the solution. By precisely constraining the ID in the generation process of the diffusion model (DiT), the character image in the video can be highly consistent with the background, which also improves the user experience.

In the past year, the difficulties in the field of generative video are still concentrated in consistency, physical laws, etc., and there are still many technical difficulties that need to be overcome. Wang Changhu frankly admitted that the current technical route in the industry has not yet converged.

In fact, the industry's cognition and expectations for AI videos have become more rational.

For example, when Sora was released in early 2024, it could generate videos up to 1 minute long, which triggered the public's expectations for the video generation market. However, it is worth noting that what Sora showed at the beginning of the year was a Demo after multiple generations. When the video length is really lengthened, the consistency and clarity of the generated video may not be very satisfactory. When facing unsatisfactory results, the probability of users clicking "Regenerate" is too high, which instead greatly affects the user experience.

Therefore, the current more efforts in the AI video field have shifted from competing for video length to more dimensions such as video content consistency, clarity, and motion amplitude.

"To make a product, we need to see where the users' real needs are. We randomly selected movies from movie websites to see the length of each shot in the movie. Finally, we found that in fact, the shots in a real movie are basically about ten seconds." Wang Changhu said that in order to ensure the user experience and usability, simply competing for video length is not meaningful.

△Source: PixVerse

In terms of generation length and clarity, PixVerse currently supports the generation of high-quality videos within 10 seconds, and the highest clarity can support 4K, which can already enter the commercial level. Compared to last year, the high-quality AI videos that the industry can generally achieve are within 5 seconds, and the clarity is generally below 1080p.

PixVerse is also rapidly updating its products and models - In November, the new feature just released by PixVerse is that users can upload videos and select to extend the video generation by entering a Prompt or choosing a special effect. In December, the next model V3.5 version of PixVerse has entered the internal testing stage. The video generation speed can be shortened to within 30 seconds, and the prompt response and motion control capabilities have been significantly improved, and it is about to be officially launched.

In fact, the AI video generation field has now a clearer division. Compared to AI video startups such as Pika and Runway, which mainly focus on the To B direction, Aishi Technology has been focusing on the broader C-end market since its establishment. In December, PixVerse also just launched the overseas App version.

Wang Changhu's confidence in the To C market comes from his early experience at ByteDance, where he built a visual technology team, a visual algorithm platform, and a business middle platform from scratch, and supported the rapid development of products such as Douyin and TikTok. He said that the goal of Aishi Technology has always been to enable the billions of ordinary consumers who are active on short-video platforms every day to create the videos they want with zero threshold.

This trend can already be confirmed by some signs. "In the past year, the more important change we have experienced is that users have spread from professional creators to C-end user groups." Wang Changhu said. This has prompted Aishi Technology to quickly lower the threshold in product functions - In PixVerses, dozens of special effect templates have been built in. Users only need to input one image to generate a video, without the need for users to input or think about how to write a Prompt themselves.

△Source: PixVerse

Entering 2024, another important problem faced by startups is how to deal with the offensive encirclement of giants - At this time, in the AI video generation field, there have been many players quickly entering the market like bamboo shoots after a spring rain. Giants such as Kuaishou, ByteDance, Alibaba, and Tencent have all launched corresponding AI video models in 2024.

In this regard, Wang Changhu is optimistic. He believes that although the development is rapid, the current video generation track of large models is still at the stage from GPT - 2 to GPT - 3. At this stage, there are still many technical difficulties that need to be overcome, which will be an opportunity for startups. Previously, the core team of Aishi Technology has overcome many technical problems in the industry with a volume less than one-tenth of that of its competitors.

On the product side, the video generation field will also be a field that is "closer" to users. Unlike the iterative development of LLM (Large Language Model) which is leap-like, and the model suddenly has a step-like performance improvement at a certain stage, swallowing up many applications; the technical evolution of the video model will be more gradual - Each technical iteration update will bring a more intuitive improvement in the video product experience, which also helps startups obtain market feedback earlier and quickly establish a business closed loop.

Currently, the once highly concerned training and inference costs are also experiencing a rapid decline. Wang Changhu revealed that the current training cost of Aishi is actually one-third or even one-tenth of that of many of its peers. He predicts that the cost will decline even faster within the next year. In the next year, Aishi Technology will also accelerate its commercialization and aims to achieve scalable growth.

Cover Source｜Enterprise Official‍‍

👇🏻 Scan the code to join the "Intelligent Emergence AI Exchange Group"👇🏻

Welcome to Exchange

This article is from the WeChat public account "Intelligent Emergence", author: Yongyi Deng, and 36Kr is authorized to publish it.

This article is originally produced by「咏仪」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

AI video startup "Aishi Technology" has received nearly 300 million yuan in Series A+ financing, with over 12 million users globally. | Exclusive from 36Kr.