Kann das AI-Video-Agent-Produkt nur schnell Geld verdienen, bevor es von den Modellfabriken überrollt wird?
Text | Wang Yuchan, Zhou Xinyu
Editor | Yang Xuan
“Looking at the cash flow (i.e., revenue), the performance of AI video projects is indeed quite good. It can be said to be one of the most profitable niches in the AI field,” an investment industry insider told 36Kr.
The Chinese AI video generation market is currently experiencing the huge dividends of the rapid growth of the model capabilities of large companies. Two “super bases,” Seedance from ByteDance and Keling from Kuaishou, are undergoing high - frequency iterations, with a small update every week and a major update every two months. Alibaba also started a gray - scale test of its video generation model HappyHorse 1.0 at the end of April. The list price for 720P video generation is 0.9 yuan per second.
There are too many content creators who are willing to spend money on this, and they are in a great hurry. It has become a spectacle in the AI world in 2026 that numerous short - drama and content companies are queuing up to use Seedance 2.0. As a result, AI video Agent products, which “put a shell” outside the AI video model to make it easier to use, have also witnessed a growth miracle.
An industry insider told Intelligence Emergence that the monthly computing power consumption cost of a leading company should be over one million yuan. “The computing power consumption cost of a short drama is about 30,000 yuan. If a tool platform can take on 100 such projects in a month, the consumption can reach 3 million yuan. There is no difficulty in this; it's just a matter of time,” the insider said.
Searching for keywords like “AI video generation tools” on Bing, you can see many advertisements for such products. “As far as I know, a leading tool platform spends 20,000 to 30,000 yuan a day on this kind of advertisement. So, the annual advertising investment in this single channel alone is at least 7 to 8 million yuan. From this, we can infer how high its revenue level is,” the industry insider said.
The AI video creation platform Creati told Intelligence Emergence that within one year of its launch, the global user base of the platform exceeded ten million. The product's ARR (Annual Recurring Revenue) once reached 20 million US dollars.
However, what worries these AI video Agent products is that if large companies also move from the model layer to the product layer and compete with them for business? In January this year, Douyin also launched an AI video application called “Suibian,” integrating both tool and community services. Moreover, the application - layer functions designed by product companies may be overwritten by an upgrade of the large model.
“In the short term, there is still a cooperative relationship between these tool - type companies and large - model manufacturers. The profits of start - up companies are largely determined by which models they can access and how much API price discount they can get,” said Feng Huini, an investment manager at NIO Capital. “But at the same time, as far as I know, large companies are keeping a close eye on these ‘partners.’ In these important market segments, there may be more than one team working within a large company.”
Is this an industry “waiting to be swallowed by large companies,” or is it really possible to grow a tool - type company like Adobe?
The weakness of product companies in the ecological niche is reflected in their profits. “If you look at the profits, in fact, everyone's gross profit margin is quite low,” an investor said. Many projects are sacrificing UE (Unit Economic) to gain scale because “there are currently no barriers in this industry, so everyone is burning money to subsidize customer acquisition and has not yet achieved break - even.”
However, there are still many investors willing to bet on it. The most well - known Chinese company in this field, LibTV's parent company LiblibAI, completed a Series B financing of 130 million US dollars in October last year, led by Sequoia China, CMC Capital and other institutions. Earlier, it also set a record in the industry of “four consecutive rounds of financing within one year.”
Ranking of financing scale of tool - type companies
“This year, AI video tools are one of the few investable niches because the iteration speed of video is much slower than that of language and coding. So, in the case where a large number of language tools and coding applications are being disrupted by base models, video generation projects are relatively more ‘promising,’” an investor told Intelligence Emergence.
These AI video Agent products still have time to build their own moats. In this “unequal competition,” who can survive?
The Sword of Large Companies and the Commercial Moat
Currently, there are three main forms of mainstream tool - type products:
Either focus on “idea,” using AI Agent to extremely simplify the creative process into “natural language instructions,” such as ZeroCut and Ribbi; or focus on “editing,” making the infinite canvas and detailed adjustments extremely refined, such as LibTV and Buzzy; or “get closer to money,” directly linking video generation with e - commerce transactions/social media operations, such as TapNow.
All the interviewees in this article, including entrepreneurs and investors, agree that after large - model manufacturers have completed the infrastructure - level work, they will inevitably move on to the application - level work. It's just a matter of time. The key is how long this time window will last and whether they can still survive after the window closes.
Zhang Yunjian, who used to work in a large company and experienced the competition in the era of classical Internet, created the AI video creation platform ZeroCut. He believes that “it will be difficult for large companies to perfectly cover the entire AI video production process in one go at least within five years.”
His judgment is mainly based on the following two understandings:
First, video production is an extremely long creative service chain. The outside world or investors often only focus on the “engineering tools” and “generation” aspects, but video generation actually only accounts for a small part of the entire production process. Before and after the actual video generation, there are very complex creative and chain processes. Therefore, the replacement of the process by AI will be a gradual process, and it will be difficult to reach the ultimate form of directly facing consumers without any manual intervention within five years.
Second, based on market competition and segmentation logic, it is difficult for a single manufacturer to excel in all aspects. A complete AI video workflow needs to call language models, image models, and video models. Although large companies have the ability to cover the entire process, it doesn't mean they can be the strongest in every niche. For example, some models are excellent in image generation, but their video capabilities may not be the strongest. This difference in capabilities will ultimately lead to market segmentation rather than a monopoly.
Robin, the founder and CEO of Ribbi, who also left a large company to start a business in AI creation tools, has a similar view to Zhang Yunjian on this point. “Among large companies, aligning business, models, and top - level strategies is the most difficult thing, unless there is an industry consensus.” Robin said, “Before seeing the exact value of Taste, large companies are not willing to build models for aesthetics and taste. Only when visual creative generation changes from non - consensus to consensus can it inspire more large companies and top - notch talents to get involved.”
However, investment manager Feng Huini thinks this five - year estimate is a bit “too optimistic.”
“When large companies contact these tool - type start - up companies, in fact, what they most want to poach is not product or algorithm talents, but operations personnel.” Feng Huini said, “This reveals one thing - in terms of technology, large companies believe they have the ability to develop the products, and the current shortcoming lies in user penetration.”
Feng Huini's judgment is that the ambitions of large models such as Seedance and Keling are very big. “They don't just want to be an infrastructure or a tool. They want to ‘define the next content platform and social platform,’ and the tool is just a by - product,” she said.
In a nutshell, large companies will do it, but not tomorrow. During this window period, what can start - up companies do?
Fang Chen, the CEO of Anijam, who left large companies such as Tencent and ByteDance to start a business, believes that the key for start - up companies to compete with large companies is to “start running earlier to form user retention and data precipitation.”
In other words, time is a resource, and the speed of getting started determines survival after the Sword of Damocles falls. “We need to enter the market as soon as possible, acquire users, and accumulate data and knowledge in real - world use,” Fang Chen said.
Zhang Yunjian's plan for ZeroCut is that the company's moat lies in ‘AI implementation services’ and ‘social division of labor’.
“Even if the underlying models become very powerful, there will still be a large number of users in the market who don't know how to use the tools, or enterprise customers who are reluctant to produce by themselves due to ‘cost - effectiveness’ and ‘comparative advantage.’” Zhang Yunjian said. Therefore, ZeroCut will avoid direct competition at the tool level and directly help customers solve the final ‘delivery and implementation’ problems.
This involves the issue of the commercialization route - whether to make a profit from the price difference between the computing power cost of large models and the pricing for users, or to find a new commercialization path. The former is simple, but once large - model manufacturers lower the price, they will attract users, and once they raise the price, the profit of the start - up company will be reduced. In essence, it means handing the lifeline to others. Therefore, most start - up companies choose the latter path.
ZeroCut's approach is the “technology + service” model - if customers have the ability, they can directly use the tools; if customers need outsourcing, the platform will connect the orders to creators who are proficient in using the tools and provide stable video customization and delivery services. As for the billing standard, it has changed from the traditional ‘man - hour billing’ of content production companies to ‘Token billing’ in the AI era. Customers don't need to worry about fixed labor quotes, but are billed based on the computing power consumed in the video generation process.
Just letting users “spend money to buy points” is not enough. Many AI video generation tools are reaching deeper into customers' businesses and are becoming more and more like all - in - one contractors.
TapNow, which features an “e - commerce + AI automatic generation” business model, has been evaluated by the outside world as “the project closest to money.” A senior executive of a traditional 4A company mentioned in an anonymous interview in “BusinessFocus”: “The ‘prediction + automatic generation’ logic of TapNow has taken away the short - video agency operation orders that originally belonged to small and medium - sized agents.”
Ribbi can not only be used to create audio, images, and videos but also help users monitor the data after content is posted on social media. The same goes for Perception Leap, which is committed to covering the entire process of content generation, posting, promotion, A/B testing, effect analysis, and secondary creation.
“People are inherently lazy animals. No user wants to change models and tools for each product or process,” Robin said.
Social media is the core training ground for the online evolution of Agent. Posting works on social media is the interaction between Agent and the real world. After monitoring the data performance, Ribbi can independently iterate and optimize the creative path to deliver better results. Eventually, the platform can form an autonomously evolving creative closed - loop.
Ribbi's current model has not been fully determined, but Robin is sure that it will not be a points - based system in the future because it is “not honest and clear enough.”
However, the points - based system is still the mainstream commercialization model in the industry at present. After all, it is simple enough and has completed user education. But as the services provided by tools become more in - depth and with the bright vision that “Tokens will become cheaper in the future,” perhaps there will be new business models for future services.
Zhang Shiying, the founder and CEO of “Perception Leap,” and Fang Chen have a consensus. They believe that the business model in the future should be “paying for results, not for costs.”
Fang Chen believes that in an ideal situation, when the accuracy of AI generation is high enough and the Token cost is low enough, users can pay only for the final output when they are willing to download the content, rather than paying for the Token consumption during the generation process.
Zhang Shiying believes that the business model of Agent should be more and more similar to that of human agencies. “The charging model will not be subscription - based, but more likely to be commission - based,” she said.
Is the time window for tool - type companies an opportunity to create a new Adobe - like company in the new era, or just a flash in the pan before large companies take over everything? For those who have already entered the market, they believe that the underlying large models belong to large companies, but there is still room for start - up companies in the application layer.
“I am determined to be a stepping - stone for silicon - based life,” Robin said. “Assuming that one day, a model manufacturer achieves the autonomous evolution of AI, even if I am not directly involved in the success, I am willing to contribute our know - how on the autonomous evolution of the Context Layer and open - source our technical architecture to help model manufacturers train better autonomous evolution models.”
The Dispute over Technical Routes: Providing Ideas or Editing?
There are currently significant differences in thinking among start - up companies.
Although they are all AI video generation tools, the product forms vary greatly. Some look like Douyin (automatically playing AI videos) or Dewu (full of AI advertising demos) as soon as you open the homepage, while others have just a simple dialog box, like any chatbot. Behind this is the dispute over the technical routes in the industry.
Whether to choose a canvas or an all - in - one Agent is one of the biggest technical differences at present.
The UI interaction method of the “infinite canvas” has changed the traditional linear timeline, allowing creators to connect materials and workflows through nodes, just like in Figma or Miro. Star products adhering to this route include LibTV, SkyReels, TapNow, etc.
On these products, users have a canvas space that can be infinitely zoomed and dragged. You can connect a “picture node” to a “video node” and then to an “audio node” to form an automated pipeline.
LibTV canvas interface
The advantage of the canvas lies in the ‘strong control’ of human will - users can manually intervene and adjust at any stage to ensure that the style, characters, and shot details of the AI - generated content are within their own plans.
Some creators have compared LibTV's infinite canvas to “Lego bricks.” Because it can freely build storyboards and completely change the linear editing logic.
However, there are also products that clearly oppose the canvas form, such as ZeroCut and Ribbi, which received more than 40,000 user application requests from around the world in a week.
The feature of these two products is that there is no prominent canvas, and all creative and editing interactions are concentrated in a small dialog box. Users communicate with the Agent in natural language, and then the agent guides the model to generate content.
Zhang Yunjian is committed to promoting the paradigm shift from “human - led” to “Agent - led.” He told Intelligence Emergence that ZeroCut believes that the traditional canvas or workflow model is only a transitional form. In essence, these models are “labor - intensive,” using AI capabilities as nodes and allowing users to manually connect them, which belongs to an automated industrial solution.