HomeArticle

Consuming 50 trillion Tokens daily, Volcengine's battle in the AI consumer goods market

晓曦2025-12-19 18:52
Large models have moved from single-point capabilities to the competition of systematic engineering.

Text by|Lumos

Cover source|AI-generated

If you want to know how far the AI market has developed, ByteDance's Volcengine has undoubtedly become the leading indicator in the Chinese market.

"As of December this year, the daily average token usage of the Doubao large model has exceeded 50 trillion, more than ten times the figure of the same period last year." On December 18th, at the bustling Force Conference, Tan Dai, the President of Volcengine, announced this figure.

In 2025, this figure was only 16.4 trillion. Image source: Volcengine

MaaS (Model as a Service) is the most direct indicator for observing model consumption. In this market alone, Volcengine has now become the number one in the domestic market and ranks third globally.

In mid-2025, the competition among cloud providers for the "No.1 AI Cloud" was still fierce. By the last month of this year, major tech companies had launched new versions of their models. Google's flagship model Gemini 3 and video model Veo 3.1 made a big splash, followed closely by OpenAI's GPT-5.2. In China, tech giants such as Alibaba and Tencent also introduced updates to their new models.

If we were to summarize the keywords for the 2025 AI market, multimodality and Agent would surely be on the list.

At the Force Conference, the key products launched by Volcengine also revolved around these two aspects:

On the model side: the Doubao flagship model 1.8 and the video generation model Seedance 1.5 pro;

Toolchains and ecological services centered around Agent: including inference OEM services for enterprises' proprietary models, a reinforcement learning platform; the enterprise-level AI Agent platform AgentKit; and the HiAgent "1 + N + X" intelligent agent workstation for Agent operation.

Tan Dai, President of Volcengine

At the Force Conference, Volcengine was determined to "fully embrace Agent" - it even created an Agent for conference registration and participant guidance.

"You might think this is easy, but it's actually quite challenging for us!" Tan Dai said with a smile. "The current model capabilities are strong enough, but many enterprises still can't make full use of them. The problem is that the tools and ecosystem for Agent are still in their early stages, so enterprises' Agent iteration is slow."

It has been five years since ByteDance entered the cloud market in 2020. At that time, ByteDance was considered a "newcomer" in the cloud market. Now, riding on the wave of large models, it has become a force to be reckoned with in the AI field. In 2024, ByteDance's revenue exceeded 11 billion yuan, with a revenue growth rate far exceeding 60%. This year, the figure has exceeded 20 billion yuan.

Forget about parameters. Models are becoming mature consumer products.

In 2025, the video model market was extremely competitive throughout the year.

The biggest difference from last year is that while manufacturers were previously competing on parameters and generation speed, the competition in the video generation field has now reached a new level. The real dividing line lies in the ability to directly produce "publishable complete works."

For example, recently, AI video manufacturers have been competing on a new feature: simultaneous audio and video output.

Previously, the video clips generated by models were more like semi-finished products, requiring extensive post-production editing, dubbing, and synchronization. Creating an AI video involved using multiple platforms and complex editing processes.

The newly launched Seedance 1.5 pro also features this as its main selling point, allowing for out-of-the-box use. At the Force Conference, Tan Dai briefly mentioned the parameters of Seedance 1.5 pro and directly presented multiple demos covering various styles such as movies, animations, and commercial shoots.

We also tried out Seedance 1.5 pro. Overall, with the simplest prompts, Seedance 1.5 pro can directly generate videos with synchronized audio and video. The lip-syncing, emotion/context capture, and coordination with the picture have reached a very high level of usability.

  • Prompt: A little girl is in a room, facing the audience. Behind her, an adult hands her a Christmas gift box. When she opens it, she finds a cute little puppy jumping out of the box. She smiles happily and says, "You're so kind!"

  • Prompt: An anime girl with blue hair stands under a cherry blossom tree, with cherry blossom petals falling. She reaches out to catch a petal, spins around happily, and her skirt flutters as she spins. She smiles and says in English, "Spring has finally arrived!"

In 2025, the AI video model field continued to evolve at an extremely rapid pace.

In 2024, video models from various companies were mainly focused on solving issues such as consistency and naturalness of characters' movements and expressions. For example, one frame might show Smith eating noodles, while the next frame could feature another character.

In 2025, the previous version of Seedance, 1.0 pro, had its core selling point as "native multi-shot storytelling." It could automatically plan a combination of long shots, close-ups, and medium shots based on complex scripts and ensure the consistency of the main character.

However, these issues are no longer the most critical ones. Video generation models have rapidly advanced to a level close to production-grade usability. Sound has become an important competitive factor.

Coincidentally, Kuaishou's Keling 2.6, Google's Veo 3.1, and Alibaba's WAN 2.5, all released in the second half of this year, have highlighted their audio-video synchronization capabilities in their promotions.

Source: Xiaohongshu user @AI Distorting Mirror

In comparison, Seedance 1.5 pro has its own unique features.

Firstly, Seedance 1.5 pro has achieved a very high level of lip-syncing accuracy. In contrast, overseas models such as Google's Veo 3.1 have a lower level of Chinese language adaptation, often resulting in poor lip-syncing and unnatural dubbing.

Secondly, the videos generated by Seedance 1.5 pro have a more immersive experience. Not only is the lip-syncing good, but the audio also matches well with the characters and the environment.

Camera movements and dynamic tension have always been Seedance's strong suits. In Seedance 1.5 pro, these features have been further enhanced to follow movie-grade standards.

For example, in an outdoor scene with different weather conditions, the character's voice will sound more distant, even with a faint echo.

  • Prompt: A man stands on a rainy street, wearing a black trench coat. Raindrops are flowing down his face. He slowly raises his head to look at the sky and whispers in Shanghainese, "It's time to end this." The camera switches to the person opposite, who replies, "What are you going to do?" The background shows blurry neon lights and a wet street. Finally, the camera switches to a few passers-by behind the man, who are quietly observing on the other side of the road.

  • Prompt: A red sports car speeds along a mountain road, with white smoke rising from the tires due to friction. The car takes a sharp turn, tilting to one side. Then the camera switches to the driver's seat, where the driver tightly grips the steering wheel, looks focused, snorts, and the car accelerates past the finish line.

The videos generated by Seedance 1.5 pro, in terms of action amplitude, multi-shot presentation, and multi-subject handling, are clearly above the industry average.

In fact, achieving audio-video synchronization requires not only a large amount of training data but also significant adjustments in training architecture and route selection.

Previously, video generation was mostly based on traditional T2V models, which first generated the video frames, resulting in "silent videos." Users had to add their own audio, background music, and synchronize the lips, which was time-consuming and labor-intensive.

This has real commercial value for both C-end users who pursue creative efficiency and B-end customers who value cost and stability.

The improvement of the model training architecture has also enhanced the efficiency of commercialization. For example, through engineering optimizations such as multi-stage distillation and quantization, the end-to-end inference speed of the Seedance 1.5 pro model has increased by more than ten times, significantly reducing the generation cost.

Wu Di, the head of intelligent algorithms at Volcengine, said in an interview that from the very beginning of setting the model training goals, Volcengine focused on the needs of B-end key scenarios. "Audio-video synchronization" is one of the core requirements of customers.

It can be said that with the maturity of factors such as consistency, camera movements, storytelling, and sound, the puzzle of AI video generation is gradually being completed.

This also reflects the maturity of the entire creative ecosystem.

This can be seen from the promotion of Seedance 1.5 pro on Xiaohongshu. ByteDance's AI video Agents, such as Xiaoyunque and Jimeng, mainly used short videos with multiple actions, plots, and stories when promoting 1.5 pro, giving a very strong Douyin-like feel.

Remixed and fun videos on Xiaohongshu

The fun factor of a video largely determines its potential for dissemination. Seedance 1.5 pro's support for various dialects, dialogues, and highly performative scenarios makes it naturally suitable for generating social currency for secondary creation and sharing on C-end products such as Doubao and Jimeng. For example, "fun" videos in dialects have become a proven way to attract users for AI video models.

As a short-video giant, ByteDance has a deep understanding of content - what kind of content can go viral and why. These insights have ultimately been translated into the training goals of the models.

The signal is very clear: as video generation models gradually mature, these AI-generated videos will soon be integrated with C-end products such as Doubao, Jimeng, and Xiaoyunque, providing users with social currency for secondary creation and sharing.

When a model can understand and generate complex long shots, Hitchcock zooms, and other film terms, and can accurately reproduce niche dialects such as Sichuanese, Cantonese, and Shanghainese, becoming a readily available creative tool, it is not just a technological tool but also has the potential to evolve into a social platform.

Large models are now a battle of system engineering, but Volcengine wants to simplify models.

The rapid growth of Volcengine reflects the current explosion of AI applications.