AI Company with $20M ARR Launches "Video Version of Photoshop" Named "Buzzy" After Raising $20M

When AI video generation technology is mature enough, there are only two things that can be done at the application level: before content generation and after content generation.

Text by | Zhou Xinyu

Edited by | Yang Xuan

One-sentence Introduction

Buzzy (https://www.buzzy.now/) is a video editing Agent platform under the AI content creation company "Perceptual Leap", mainly targeting C-end content creators and small and medium-sized businesses.

Like a "Video version of PhotoShop", users can simply issue natural language instructions to drive the Agent to perform editing operations on videos, such as background removal, light correction, product replacement, background/perspective change, etc.

Team Introduction

Ella Zhang (Zhang Shiying), the founder and CEO of "Perceptual Leap", has previously been in charge of core products at Apple, Oculus VR, and Google.

During her time at Apple, she was a core member of the founding team of the AirPods product line, responsible for the system integration and full-cycle design implementation of the products, including the architectural design of audio products, component selection, schematic drawing, layout design, verification, and large-scale production.

After that, Zhang Shiying served as the system architect for AR products at Google, responsible for the algorithm and architecture research and development of products such as Glass and Reflector.

The other core members of "Perceptual Leap" come from companies such as Adobe, Xiaomi, and SenseTime.

Financing Progress

Recently, "Perceptual Leap" completed a new round of financing. The amount exceeded $20 million, and the lead investor was Redpoint. Shendu Capital served as the exclusive financial advisor for this round.

Products and Business

In Zhang Shiying's view, with the development of the performance of video generation models, the tool track of generative types has gradually become a "red ocean". She roughly divides the video creation tools on the market into two categories:

One type is the "canvas-type" products. The advantage is that the quality of the generated results can be guaranteed through manual control, but the disadvantage is that the usage threshold is high for most users. The other type provides users with pre-made workflows and templates. The disadvantage is that it is not flexible enough, and at the same time, the ideas are not innovative enough.

"Users tend to generate a whole video at once and then modify it to the perfect solution through continuous iteration. So a video editor that can precisely target specific parts has become a necessity."

Currently, due to the coherence of videos and the limitations of the model's understanding ability, it is difficult for users to perform "local fine-tuning" on videos, such as changing the background, replacing characters, or eliminating certain elements, through the Chat method. Most AI editors will change the entire picture, which is close to regenerating.

Recently, the new product Buzzy launched by "Perceptual Leap" is an AI video editor that allows users to "P video" as conveniently as P-picture.

Just through Chat, Buzzy can perform operations on videos such as removing background passers-by, correcting light, replacing products, collating videos, changing the background and perspective, truly realizing local fine-tuning.

△ Background passers-by removal. Left: After removal; Right: Before removal. Image source: Provided by the interviewee.

△ Changing light. Top: Before change; Bottom: After change. Image source: Provided by the interviewee.

△ Changing the shooting angle. Left: After change; Right: Before change. Image source: Provided by the interviewee.

It is not easy to achieve local editing of videos while maintaining the rest of the parts. Zhang Shiying told us that local editing requires the video model to have higher video and language understanding abilities. "First, it needs to identify what the part to be modified is and where it appears. Second, it also needs to accurately understand the user's intention, such as the meme in the prompt."

For this reason, "Perceptual Leap" trained a small model based on RLHF (Reinforcement Learning from Human Feedback) to enhance Buzzy's understanding of video editing.

At the same time, Buzzy is also designed as an Agent that can independently learn the user's aesthetics and taste.

Buzzy has launched a "OpenClaw-like" Bot. Users can directly connect the Bot to Telegram and WhatsApp by scanning the QR code.

By sharing video links on TikTok and YouTube with the Bot, the Bot will automatically analyze the user's video preferences and taste, search for inspiration materials across the network 24/7 based on the video style, and precipitate the style as a Skill.

Style precipitation. Image source: Provided by the interviewee.

Previously, since its establishment in 2021, "Perceptual Leap" has gone through two iterations of content creation products:

Before the explosion of text-to-image products such as Midjourney and Stable Diffusion, "Perceptual Leap" developed the first AI model image generation platform ZMO.ai for domestic B-end e-commerce customers based on GAN (Generative Adversarial Network), and later expanded the implementation scenarios to product image design, editing, and other scenarios.

ZMO. Image source: Provided by the interviewee.

Taking the first-mover advantage, the MAU of ZMO.ai once reached 7 million.

Since 2024, the video generation track has seen a small explosion with the release of Sora. In this trend, "Perceptual Leap" stopped ZMO.ai and launched the content creation platform Creati covering pictures and videos in April 2024.

Compared with ZMO.ai, which focuses on e-commerce and advertising picture generation and editing, Creati expands content creation to the video field, including text-to-video generation and secondary creation based on video templates.

At the same time, it provides users with a mobile product. Many non-professional content producers can directly shoot materials with their mobile phones and then complete content creation, editing, and publishing directly on the App, instead of transferring the materials to a computer.

"Users' demand for AI-generated videos is more urgent than that for pictures." Zhang Shiying mentioned. "In terms of communication effects, whether on social media or advertising, videos are more attention-grabbing than pictures. At the same time, it is much more difficult for users to shoot videos than to create pictures."

Creati. Image source: Provided by the interviewee.

The target users have also changed. The main customers of ZMO.ai are domestic B-end e-commerce and advertising companies. But soon, Zhang Shiying found that although the number of ZMO.ai users was growing rapidly, the traffic was not converted into actual payments.

The core reasons are as follows. First, the payment cycle of "big B" customers is too long. Second, the creation cost of pictures is lower than that of videos, so users' willingness to pay for pictures is not high enough.

Creati is a product targeting "big C and small B": C-end content creators and small and medium-sized merchants. Zhang Shiying told "Intelligent Emergence" that "big C and small B" are the groups with the highest willingness to pay. "Larger B-end enterprises tend to develop their own workflows."

One year after its launch, the global user base of Creati exceeded 10 million. The product's ARR (Annual Recurring Revenue) once reached $20 million.

Business Model

Covering the Token consumption cost through user subscriptions is the mainstream business model of current AI software. But Zhang Shiying believes that subscription is the business model of the SaaS era. The business model of the Agent era should be to pay for the effect, not for the cost.

She told "Intelligent Emergence" that at present, users still regard the Agent as a tool, not as a value creator.

When the Agent can cover the entire creative process, including content generation, publishing, placement, A/B testing, effect analysis, and secondary creation, the business model of the Agent should be more and more similar to that of human agencies. "The charging model will not be subscription, but more likely to be in the form of commission."

Founder's Thoughts

The content creation scenarios of most non-professional users are mainly on the Mobile end, not the PC end.

Many merchants and non-professional content creators are used to shooting materials such as product pictures and short videos with their mobile phones. But paradoxically, the creation tools are often concentrated on the PC end. This will lead to a break in the content creation link.

Therefore, both Creati and Buzzy provide users with Mobile App products, so that material acquisition, content creation and editing, and publishing can all be completed on the mobile phone.

When the AI video generation technology is mature enough, there are only two things that the application layer can do: before content generation and after content generation.

Before content generation, the application layer solves the problem of generating ideas. After content generation, it needs to solve the problem of "how to modify".

The application layer should not do the work of the model layer because the model will definitely become good enough.

Currently, there are many products that "encapsulate" the capabilities of video models. Whether it is the "canvas" or the workflow, they are all solving the problem of the insufficient capabilities of the model, such as "card drawing" and limited video generation length.

But in the future, the model layer will definitely solve the problems of generation quality and length. The opportunity for the application layer lies in solving problems outside the generation process.

In the future, Skills will become tradable assets.

Skills are essentially the user's precipitated taste, cognition, and workflow. In the field of creation, people's aesthetics, taste, and the skills of finding materials are all valuable.

Therefore, in the future, selling Skills may become a business model.

In the new era, new products should be independently developed, rather than adding a new entrance to the old products.

Buzzy and Creati are two completely different generations of products. Creati focuses on generation, while Buzzy focuses on post-generation editing. Different generations of products will form different user mental models.

All Go Viral (going popular) is very accidental, and products should not overly pursue Go Viral.

Many user necessities actually do not have the potential to go viral on social media, such as PDF Editor, but their user base is very large.

According to our experience, products that can go viral have several characteristics. First, the product form and design are relatively innovative. Second, they are practical. Only by solving users' pain points will they be willing to spread the product spontaneously. Third, they lower the threshold for users to produce interesting content.

Welcome to communicate!

This article is originally produced by「阿菜cabbage」， For reprint or content cooperation, please click Reprint Instructions ；Unauthorized reprint will be held accountable.

After raising $20 million, this AI company with an ARR of $20 million has launched "the video version of Photoshop", named "Buzzy".