HomeArticle

In 2024, every large model cannot avoid Rongmomo and Ziwei | Focus Analysis

咏仪2024-10-19 09:26
At the fork in the road, which one should be chosen?

Written by Yongyi Deng

Edited by Jianxun Su

Even Robin Li, who is the most active in promoting AI, has hesitated on this matter.

"Baidu will not touch the video generation direction like Sora." Robin Li said at the recent Q3 director meeting in 2024. The reason is that it may be difficult to be commercially applied in 10 or 20 years.

Since the emergence of OpenAI Sora and the full launch of Kuaishou Keling in June, video generation has become the hottest AI topic in 2024.

Manufacturers have begun to compete fiercely. Since April, video production models have emerged in large numbers - in addition to major manufacturers such as Kuaishou, ByteDance, and Alibaba, leading large model companies such as Zhipu, MiniMax, as well as vertical manufacturers such as Shengshu Technology and Zhixiang Future, have all released video generation models.

The emergence of domestic video models has also made the hilarious image of "Empress Dowager Rongmomo and Ziwei" popular again. They exist in different video model demos and users' secondary creations, testing the effects of different models:

△Source: Keling, from the public network 

△Source: Jimeng AI, Intelligent Emergence Mapping 

The industry's eagerness for "super applications" is obvious, and it even affects the current route of large models - Whether to build a video large model or not is a key decision that affects the trend of the "Six Tigers of Large Models in China" (Zhipu, Dark Side of the Moon, MiniMax, Baichuan Intelligence, Stepwise Star, Zero-One Everything).

However, domestic large model manufacturers have not reached a consensus and are divided into several distinct groups:

Some manufacturers quickly follow up. In July, Zhipu launched a Sora-like video generation model "Zhipu Qingying"; in August, MiniMax released the video model Video-01.

And Stepwise Star released a new image model and also made a small number of video generation attempts at the Shanghai World Artificial Intelligence Conference in July this year.

There are also outspoken opponents. "Baichuan will not do Sora." In May this year, Wang Xiaochuan, the CEO of Baichuan Intelligence, said in an exclusive interview with "Intelligent Emergence". He believes that Sora is not on the main line of AGI (General Artificial Intelligence), that is, to improve the intelligence level of the model.

There are also manufacturers who have explored and then slowed down. The most watched Dark Side of the Moon was reported by the media in June to be testing two new applications overseas - the role-playing application Ohai, and the AI music/video generation application Noisse. According to "Intelligent Emergence", these two applications were not independently listed due to the unsatisfactory results and remained in the experimental stage.

After the recent launch of the "Kimi Exploration Version", the Dark Side of the Moon will also release multi-modal related capabilities. However, it is not yet certain whether there will be video generation related functions.

Until around the National Day, two heavyweight players entered the video generation field: On September 24, ByteDance quietly launched two products, Seaweed and Pixeldance.

And紧接着 on October 5, Meta released the series model Movie Gen, which once again caused a sensation.

△Illustration: In the first half of 2024, many video models and products have emerged globally, especially in China.

In the current situation where the iteration of language models is slowing down, the video generation model seems to be a more promising new direction for AI applications - and large companies have not yet formed a monopoly. For start-up companies, this is an even more important choice - whether to do Sora or not?

Fork in the Road, Which One to Choose?

First of all, a concept that needs to be clarified is that the "multi-modal capabilities" (image, voice, and other modalities) that are now commonly equipped by large manufacturers and start-up companies, and the Sora-like video generation model, are not the same thing.

"Multi-modal capabilities are equivalent to enabling the model to understand forms such as images, audio, and videos, but they are still based on the extended capabilities of the large language model." An industry insider analyzed to "Intelligent Emergence", "Inputting videos, pictures, and voices into the large model is to 'understand' based on the large language model; but generating videos relies on the capabilities of the video model."

The video generation model that "Sora-like" products rely on draws on technical ideas such as the Transformer architecture in the large language model (LLM), but it is a different thing from the large language model (LLM).

This means that if you want to build a video generation model, it is equivalent to starting from scratch and building a model from 0 to 1.

It is certain that to build a video generation model, it is currently destined to be a game for a few people.

Replicating a "Sora" is costly.

According to Meta's data, Movie Gen used 6144 H100 for training, and the video model parameters reached 30B (30 billion). In China, there are not many manufacturers with such training resources.

At present, domestic large model manufacturers have basically equipped with multi-modal capabilities, but whether to do the video generation direction is still in a wavering state.

For large manufacturers with short-video-related businesses, such as Douyin and Kuaishou, video generation is a direction that cannot be lost. According to Silicon Stars, an important motivation for the strong investment in Keling is to serve Kuaishou's content ecosystem - in 2023, there were 138 million creators who published short videos on Kuaishou for the first time.

In addition, Kuaishou's development of Keling is also intended to serve Kuaishou's e-commerce ecosystem, such as providing AI content generation services related to products for MCNs and e-commerce merchants.

But for start-up players, in the current situation where the direction of AI applications is unclear, everyone is cautiously feeling their way forward.

Some players have firmly determined their chosen path early. Baichuan, who decided not to do Sora from the beginning, fully implemented the medical scene in 2024 and launched its own medical AI assistant.

Vertical manufacturers specializing in video generation have also achieved phased results. For example, Vidu, a product of Shengshu Technology, after its launch in August, within two months, Vidu's monthly visits have reached 5.52 million.

But whether they can turn the new story into their own depends on the true capabilities of each company. The technical route in the video generation field has not yet converged, and almost all the top video generation models on the market have chosen to be closed-source.

This means that players need to invest real money to test and error - choosing which technical route and application scenarios will determine who can truly stay afloat after the AGI tide recedes.

Text is Competitive, Agent is Far Away, Is Video Generation Just Right?

OpenAI's Sora has not yet been widely used, why has video generation become a popular choice in China?

Taking GPT-4 as a reference benchmark, domestic leading large model manufacturers and large companies have gradually approached the level of GPT-4 in the first half of this year. After OpenAI subsequently released GPT-4o, manufacturers have also followed up with multi-modal capabilities.

But the delay of GPT-5 means that in terms of language models, domestic large model manufacturers are basically difficult to widen the generational gap.

On the other hand, the large model has been running for more than a year, and its landing and commercialization results have not convinced the market.

In China, most AI application directions have fallen into a dilemma of being praised but not well-received. Counting the AI application directions that have been popular in the past two years - ChatBot/emotional companion and other ChatGPT-like products, text-to-image, AI music, AI search, they have all quickly fallen into a situation of homogeneous competition.

Taking the leading AI applications in China as an example, Doubao, Kimi, etc. have experienced fierce investment competition in the first half of this year, and the number of users has reached the tens of millions at most, but the commercialization situation is not ideal.

Many practitioners believe that the difficulty in commercializing applications is largely due to the slowdown in the iteration of text models and the slow improvement of capabilities. This also makes some more distant directions that can complete more complex tasks - such as Agent (Intelligent Agent), more and more blurred.

An example that "Intelligent Emergence" has learned is that the Agent business of ByteDance's AI development platform, Button, has experienced a round of reduction this year.

And the recent热切 discussion in the industry about giving up the pre-training stage of large models means that many manufacturers have to step down from the pursuit and turn to the landing of AI applications to survive.

Manufacturers need a new story, and the video generation direction just happens to be in the middle: it has sufficient technical and development barriers, but the barriers are not so high that players cannot reach, and the prospects are also sufficient.

"Not to mention the commercialization of language models, start-up companies at least need imagination. If leading start-up companies do not switch to other application directions, they will have nothing. How can they support such a high valuation?" An industry insider said frankly.

In 2023, many entrepreneurs in the video generation field told "Intelligent Emergence" that the current video generation field can be compared to the GPT-2 to GPT-3 stage. This means that it is slightly behind the effect of ChatGPT and much earlier than the development stage of the language model.

But after the release of Sora, the video generation field has seen the dawn of the GPT-3.5 stage. "This stage means that it allows you to see the huge potential of this track, and the market is willing to invest." An industry insider told "Intelligent Emergence".

The reason for the wavering consensus lies in that the track is still in the early stage of development, and there are still many exploration opportunities. For example, the recently released Meta Movie Gen. Based on the Transformer architecture, it uses Flow Matching (Flow Matching Technology), which is very different from Sora's route, and also means that the technical route of the entire track has not yet converged.

In China, this direction also has a unique short-video ecosystem, and the model exploration in the video generation direction is therefore at the global forefront.

The popular Kuaishou video model "Keling" in June is a typical example - among large manufacturers, Kuaishou is not the highest point of AI talents and resources, but after a few months of hard work, Kuaishou Keling, with a small team of only 20 people, managed to carve out a path among a group of large model manufacturers. With a series of plans such as nostalgic photos, Keling's popularity has even spread to Silicon Valley across the ocean.

△The founder of Stability.ai reposted the Keling product and commented that "China has a huge advantage in AI" Source: X

Moreover, the video generation direction is still in the early stage, and the computing power cost is still high. Once commercialization begins, payment is already a must.

Overseas, video generation has taken different routes - the leading video manufacturers Runway and Pika are both focused on making B-end productivity tools, and Runway has even entered Hollywood and reached many collaborations in the film and television industry. In China, manufacturers such as Keling and Minimax have also started paid attempts early.

In the final analysis, few people are willing to miss this direction. After all, videos have replaced text as the information content with the highest traffic share on the Internet. According to Sandvine's "2023 Global Internet Phenomenon Report", in 2022, global Internet video services accounted for 65.93% of the total traffic.

As the video generation technology continues to mature, this may not only be a game for large companies. Start-up companies can combine technology and ingenious operational methods to quickly carve out their own path.

△Source: Pika

The Silicon Valley video generation start-up star Pika has discovered many traffic secrets: When it first debuted, it chose to operate on Discord, where developers gather, and quickly gained 500,000 users.

In the new 1.5 model released by Pika in October this year, it also brings more social operation gameplay: It has built-in templates such as inflation, melting, explosion, kneading, and flattening, attracting global users to "create content", and the server even crashed due to the influx of too many users - some users can't help but recall the past: It is similar to the cold start period of the early TikTok.

This article is from the WeChat public account "Intelligent Emergence", author: Yongyi Deng, and 36Kr is authorized to publish it.