HomeArticle

Before the model factories crush the market, can AI video agent products only make quick money?

王毓婵2026-05-07 19:31
Is this an industry "waiting to be swallowed by big tech companies," or could a tool-based company like Adobe emerge from it?

Text | Wang Yuchan, Zhou Xinyu

Editor | Yang Xuan

“Looking at the revenue, the performance of AI video projects is indeed quite good. It can be said to be one of the most profitable niches in the AI field,” an investment industry insider told 36Kr.

The Chinese AI video generation track is experiencing the huge dividend of the rapid growth of the model capabilities of large tech companies. Two “super bases,” Seedance from ByteDance and Keling from Kuaishou, are undergoing high - frequency iterations, with minor updates every week and major updates every two months. Alibaba also launched a gray test of its video generation model HappyHorse 1.0 at the end of April. The list price for 720P video generation is $0.9 per second.

There are too many content creators willing to spend money on this, and they are extremely eager. It has become a spectacle in the AI world in 2026 that numerous short - drama and content companies are queuing up to use Seedance 2.0. As a result, AI video Agent products, which “put a shell” outside the AI video model to make it easier to use, have also witnessed a growth miracle.

An industry insider told Zhineng Yongxian that the monthly computing power consumption cost of a leading company should be over one million yuan. “The computing power consumption cost of a short drama is about 30,000 yuan. If a tool platform can handle 100 such projects in a month, the consumption can reach 3 million yuan. It's not difficult; it's just a matter of time,” the insider said.

Searching for keywords like “AI video generation tools” on Bing, you can see many advertisements for such products. “As I know, a leading tool platform spends 20,000 to 30,000 yuan a day on this kind of advertising. So, the annual advertising investment on this single channel alone is at least 7 to 8 million yuan. From this, we can infer how high its revenue level is,” an industry insider said.

The AI video creation platform Creati told Zhineng Yongxian that within one year of its launch, the global user base of the platform exceeded 10 million. The product's ARR (Annual Recurring Revenue) once reached 20 million US dollars.

However, what worries these AI video Agent products is that if large tech companies also move from the model layer to the product layer and compete with them for business. In January this year, Douyin launched an AI video application called “Suibian,” integrating tool and community services. Moreover, the application - layer functions designed by product companies may be overwritten with an upgrade of the large model.

“In the short term, there is still a cooperative relationship between these tool - type companies and large model manufacturers. The profits of startups are largely determined by which models they can access and the API price discounts they can get,” said Feng Huini, an investment manager at NIO Capital. “But at the same time, as I know, large tech companies keep a close eye on these ‘partners’. In these important track directions, there may be more than one team working within a large company.”

Is this an industry “waiting to be swallowed by large tech companies,” or is it really possible to grow a tool - type company like Adobe?

The weakness of product companies in the ecological niche is reflected in their profits. “If we look at the profits, actually, the gross profit margins of everyone are quite low,” an investor said. Many projects are sacrificing UE (Unit Economics) to gain scale because “there are currently no barriers in this industry, so they are all burning money to subsidize customer acquisition and have not yet achieved break - even.”

However, many investors are still willing to bet on it. LiblibAI, the parent company of the most prominent Chinese company in this track, LibTV, completed a Series B financing of 130 million US dollars in October last year, invested by institutions such as Sequoia China and CMC Capital. Earlier, it also set a record in the industry for “four consecutive rounds of financing within one year.”

Ranking of financing scale of tool - type companies

This year, AI video tools are one of the few investable tracks. Because the iteration speed of video is much slower than that of language and coding, in the situation where a large number of language tools and coding applications are being disrupted by base models, video generation projects are relatively more ‘worth watching’,” an investor told Zhineng Yongxian.

These AI video Agent products still have time to build their own moats. In this “unequal competition,” who can survive?

The Sword of Large Tech Companies and the Commercial Moat

Currently, there are three main forms of mainstream tool - type products:

Either focus on the “idea” and use AI Agent to extremely simplify the creative process into “natural language instructions,” such as ZeroCut and Ribbi; or focus on “editing” and make the infinite canvas and detailed adjustments very precise, such as LibTV and Buzzy; or “get closer to money” and directly link video generation with e - commerce transactions/social media operations, such as TapNow.

All interviewees in this article, including entrepreneurs and investors, agree that after large model manufacturers have completed the infrastructure - level work, they will surely move on to the application - level work. It's just a matter of time. The key is how long this time window is and whether they can still survive after the window closes.

Zhang Yunjian, who once worked in a large tech company and experienced the competition in the classical Internet era, founded the AI video creation platform ZeroCut. He believes that “it will be difficult for large tech companies to perfectly cover the entire AI video production process at once in at least five years.”

His judgment is mainly based on the following two understandings:

First, video production is an extremely long creative service chain. The outside world or investors often only focus on the “engineering tools” and “generation” aspects, but video generation actually only accounts for a small part of the entire production process. Before and after the actual video generation, there are very complex creative and chain processes. Therefore, the replacement of the process by AI will be a gradual process, and it will be difficult to reach the ultimate form of directly facing consumers without any manual intervention within five years.

Second, based on market competition and segmentation logic, it is difficult for a single manufacturer to excel in all aspects. A complete AI video workflow requires the use of language models, image models, and video models. Although large tech companies have the ability to cover the entire process, it doesn't mean they can be the strongest in every niche. For example, some models are excellent in image generation but may not have the strongest video capabilities. This difference in capabilities will ultimately lead to market segmentation rather than a monopoly.

Robin, the founder and CEO of Ribbi, who also left a large tech company to start a business in AI creation tools, has a similar view to Zhang Yunjian on this point. “Among large tech companies, aligning business, models, and top - level strategies is the most difficult thing, unless there is an industry consensus.” Robin said, “Before seeing the definite value of Taste, large tech companies are not willing to build models for aesthetics and taste. Only when visual creative generation changes from non - consensus to consensus can it inspire more large tech companies and top - notch talents to participate.”

However, investment manager Feng Huini thinks this five - year estimate is a bit “too optimistic.”

“When large tech companies contact these tool - type startups, actually what they most want to poach is not product or algorithm talents, but operations,” Feng Huini said. “This reveals one thing - in the technical aspect, large tech companies believe they have the ability to develop the products, and the current shortcoming lies in user penetration.”

Feng Huini's judgment is that the ambitions of large models such as Seedance and Keling are very big. “They don't just want to be an infrastructure or a tool. They want to ‘define the next content platform and social platform’, and the tool is just a ‘by - product’,” she said.

In short, large tech companies will do it, but not tomorrow. During this window period, what can startups do?

Fang Chen, the CEO of Anijam, who left large tech companies such as Tencent and ByteDance to start a business, believes that the key for startups to compete with large tech companies is to “start running earlier to form user retention and data precipitation.”

In other words, time is a resource, and the running speed determines survival after the Sword of Damocles falls. “We should enter the market as soon as possible, acquire users, and accumulate data and knowledge in real - world use,” Fang Chen said.

Zhang Yunjian's plan for ZeroCut is that the company's moat lies in ‘AI implementation services’ and ‘social division of labor’.

“Even if the underlying models become very powerful, there will still be a large number of users who don't know how to use the tools, or enterprise customers who are reluctant to produce by themselves due to ‘cost - effectiveness’ and ‘comparative advantage’ considerations,” Zhang Yunjian said. Therefore, ZeroCut will avoid direct competition at the tool level and directly help customers solve the final ‘delivery and implementation’ problems.

This involves the issue of the commercialization route - whether to make a profit by earning the difference between the computing power cost of large models and the pricing for users, or to find a new commercialization path. The former is simple, but once large model manufacturers lower the price, they will attract users; once they raise the price, the profit of startups will shrink. In short, it's ultimately putting the fate in the hands of others. Therefore, most startups choose the latter path.

ZeroCut's approach is the “technology + service” model - if customers have the ability, they can directly use the tools; if customers need outsourcing, the platform will connect the orders to creators who are proficient in using the tools and provide stable video customization and delivery services. As for the billing standard, it has changed from the traditional ‘man - hour billing’ of content production companies to ‘Token billing’ in the AI era. Customers don't need to worry about fixed man - hour quotes but are billed based on the computing power consumed during the video generation process.

Just letting users “spend money on points” is not enough. Many AI video generation tools are reaching deeper into customers' businesses and becoming more and more like all - in - one service providers.

TapNow, which features an “e - commerce + AI automatic generation” business model, is evaluated by the outside world as “the project closest to money.” A former senior executive of a traditional 4A company mentioned in an anonymous interview in 《BusinessFocus》: “The ‘prediction + automatic generation’ logic of TapNow has taken away the short - video agency operation orders that originally belonged to small and medium - sized agents.”

Ribbi can not only be used to create audio, images, and videos but also help users monitor the data after the content is posted on social media. Ganzijieyue is also committed to covering the entire process of content generation, posting, promotion, A/B testing, effect analysis, and secondary creation.

People are lazy animals. No user wants to change a model or a set of tools for each product or each step,” Robin said.

Social media is the core training ground for the online evolution of Agent. Posting works on social media is the interaction between Agent and the real world. After monitoring the data performance, Ribbi can autonomously iterate and optimize the creative path to deliver better results. Eventually, the platform can form an autonomously evolving creative closed - loop.

Ribbi's current model is not fully determined yet, but Robin is sure that it will not be a point - based system in the future because it is “not honest and clear enough.”

However, the point - based system is still the mainstream commercialization model in the industry at present. After all, it is simple enough and has completed user education. But as the services provided by tools become more in - depth and with the bright vision that “Tokens will become cheaper in the future,” perhaps there will be new business models for future services.

Zhang Shiying, the founder and CEO of Ganzijieyue, and Fang Chen have a consensus. They believe that the business model in the future era should be “paying for the effect rather than for the cost”.

Fang Chen believes that in an ideal situation, when the accuracy of AI generation is high enough and the Token cost is low enough, users can pay only when they are willing to download the content, rather than paying for the Token consumption during the generation process.

Zhang Shiying believes that the business model of Agent should be more and more similar to that of human agencies. “The charging model will not be a subscription but more likely a commission - based model.”

Is the time window for tool - type companies a chance to give birth to a new Adobe - sized company in the new era, or just a flash in the pan before large tech companies take over everything? For those who have entered the field, they believe that the underlying large models belong to large tech companies, but there is still something for startups to do at the application layer.

“I am determined to be a stepping - stone for silicon - based life,” Robin said. “Suppose one day, a model manufacturer achieves the autonomous evolution of AI. Even if I am not the one who makes the achievement, I am willing to contribute our know - how about the autonomous evolution of the Context Layer and open - source our technical architecture to help model manufacturers train better autonomous evolution models.”

The Dispute over Technical Routes: Providing Ideas or Editing?

There are also significant differences in thinking among startups at present.

Although they are all AI video generation tools, the product forms vary greatly. Some look like Douyin (automatically playing AI videos) or Dewu (full of AI advertising demos) when you open the homepage, while others have just a simple dialog box, like entering any chatbot. Behind this is the dispute over the technical routes in the industry.

Whether to have a canvas or an all - in - one Agent is one of the biggest technical differences at present.

The UI interaction mode of the “infinite canvas” has changed the traditional linear timeline, allowing creators to connect materials and workflows through nodes, just like in Figma or Miro. Star products adhering to this route include LibTV, SkyReels, TapNow, etc.

On these products, users have a canvas space that can be infinitely zoomed and dragged. You can connect a “picture node” to a “video node” and then to an “audio node” to form an automated pipeline.

LibTV canvas interface

The advantage of the canvas lies in the ‘strong control’ of human will - users can manually intervene and adjust at any step to ensure that the style, characters, and shot details of the AI - generated content are all within their planning.

Some creators compare LibTV's infinite canvas to “Lego bricks because it can freely build storyboards and completely change the linear editing logic.

However, there are also products that clearly oppose the canvas form, such as ZeroCut and Ribbi, which received more than 40,000 user application requests from around the world in a week.

The characteristic of these two products is that there is no prominent canvas, and all creative and editing interactions are concentrated in a small dialog box. Users communicate with the Agent in natural language, and then the Agent guides the model to generate content.

Zhang Yunjian is committed to promoting the paradigm shift from “human - led” to “Agent - led.” He told Zhineng Yongxian that ZeroCut believes that the traditional canvas or workflow model is only a transitional form. These models are essentially “labor - intensive,” using AI capabilities as nodes for users to manually connect, belonging to an automated industrial solution.