Domestic Models Quietly Win Multimodal Battle

The latest masterpiece of Alibaba Tongyi: PrismAudio.

Just yesterday, domestic models quietly won a multimodal battle.

Just as ByteDance's Seedance 2.0 video generation model gained fame overseas, OpenAI suddenly announced that it would soon shut down its video generation model Sora's related services.

In this era of intelligent agents, people have gradually realized the importance of multimodal capabilities.

Thanks to its powerful capabilities, Seedance 2.0 has been regarded as a "magic tool" for future movie production. However, the only awkward aspect of it currently is the lack of voice - overs.

Audio generation may seem simpler than video generation, but accurately dubbing videos (Video - to - Audio, V2A) is extremely difficult: perfect dubbing not only needs to "match the face (synchronize semantics with voice)", but also be "pleasant to the ear (aesthetic quality)" and "immersive (spatial stereo sound)".

To make up for the shortcoming of video generation models in "moving towards film - making", a research team from Alibaba Tongyi Lab, the Hong Kong University of Science and Technology, and the Chinese University of Hong Kong conducted a landmark study: PrismAudio.

This is the industry's first framework that deeply integrates Reinforcement Learning (RL) with specialized Multi - dimensional Chain of Thought (CoT) planning into V2A generation.

The research team not only proposed the Fast - GRPO algorithm, which can significantly reduce the training cost of reinforcement learning for diffusion models, but also open - sourced the high - difficulty benchmark test dataset AudioCanvas.

Even more astonishing is that PrismAudio, with only 518M parameters, defeated many models with billions of parameters and comprehensively refreshed the state - of - the - art (SOTA) in all perceptual dimensions.

01 The "Impossible Quadrilateral" of V2A Generation

Globally, the multimodality of AI is still currently limited to four core modalities: text, image, audio, and video.

In the past year, models for text - to - image, text - to - video, and image - to - video generation have become common. However, the connection between the audio modality and other core modalities has not been fully established.

Although there are already many AI music - generation products on the market, the essence of video - to - audio generation is different from "describing what's in a picture". In the human perceptual world, a qualified video dubbing needs to pass the following four - dimensional tests:

First, Semantic consistency: This is the simplest requirement. If the picture shows Chinese, the voice cannot be in English.

Second, Temporal synchronization: Similar to "lip - syncing" in film and television works, it ensures that the time of the voice matches the video.

Third, Aesthetic quality: The voice needs to have subjective richness, fidelity, and artistic sense, rather than being a monotonous electronic sound.

Fourth, Spatial accuracy: The left and right channels need to form a perfect sound - image movement following the moving objects in the picture.

Early models such as V2A - Mapper directly mapped pictures to audio, but lacked the ability to control the "black box" of the intermediate process.

Recent models such as MMAudio and MovieGenAudio introduced text prompts for control, but their control capabilities were still relatively weak.

It wasn't until July 2025 that Dr. Liu Huadai from Alibaba Tongyi Lab open - sourced the ThinkSound model, pioneering the introduction of the Chain of Thought (CoT) into multimodal large models. This allowed the model to "think" about what sound to make before generating it, greatly improving the model's logic.

However, ThinkSound also has three fatal flaws:

First, Extremely chaotic Chain of Thought: It combines the processes of object recognition, time alignment, aesthetic judgment, and spatial position calculation into the same thinking process.

This is like asking a student to take tests in different subjects such as Chinese, mathematics, English, and physics simultaneously. As a result, the model is prone to "multimodal hallucinations".

Second, Objective Entanglement: During the training process, the model uses a single reconstruction loss function, but there is often a competitive relationship between perceptual objectives.

That is to say, in order to align the sound - emitting time, the model may generate an unpleasant noise; if it generates a pleasant - sounding voice, it may not match the picture.

Third, Lack of alignment with human preferences: Existing models only fit the training data mechanically, without introducing Reinforcement Learning from Human Feedback (RLHF) to learn what humans consider "pleasant - sounding" voices.

This is also one of the biggest challenges faced by multimodal models:

For large language models, the correctness or incorrectness of an answer is obvious; but for images, audio, and video, humans can easily judge what is "bad", but cannot accurately define what is "good".

02 PrismAudio: Thinking Like a Top - Level Sound Engineer

PrismAudio provides an elegant solution, and its core idea is not complicated: Chain of Thought planning with a divide - and - conquer approach + targeted reinforcement learning optimization.

Its architecture is built on a powerful base model.

To improve the model's understanding of the video modality and complex logic, the research team not only replaced the traditional visual encoder CLIP with VideoPrism (Google, 2024), which is specialized for video understanding, but also upgraded the text encoder to T5 - Gemma (Google, 2025), which has strong logical reasoning capabilities.

Next, it's time for its core technologies to shine:

1. Decomposed multi - dimensional Chain of Thought

Since it doesn't work to put all V2A requirements into the same thinking process, PrismAudio simply breaks the thinking process into four independent and professional CoT links.

Before audio generation, the model needs to submit four "analysis reports" in sequence:

Semantic CoT: Focus on content recognition, for example, "A horse in the picture starts running, the sound of hooves gradually intensifies, and finally it stops with gasping sounds."

Temporal CoT: Focus on temporal sequencing, for example, "At first, it's a slow pace, then it speeds up into a stable rhythm, and finally the pace slows down until it stops."

Aesthetic CoT: Focus on sound quality perception, for example, "The audio maintains a clear and crisp sound of hooves, with natural reverberation."

Spatial CoT: Focus on sound field positioning, for example, "The sound appears from the left sound image, passes through the center, and finally fades out on the right."

The "analysis reports" from the four dimensions can be spliced together and used as a strong text condition input for the diffusion base model.

This explicit logical reasoning not only solves the problem of thinking chaos, but also makes the "black box" of the generation process more controllable and interpretable.

2. Multi - dimensional reinforcement learning

Now that the thinking process is straightened out, the next step is to solve the problem of objective entanglement and make the audio generated by the model conform to human preferences.

For this reason, for the existing four CoTs, the team designed four corresponding independent reward models:

For the Semantic CoT, use Microsoft's MS - CLAP model to evaluate whether the audio and text content are consistent;

For the Temporal CoT, use the highly sensitive Synchformer model to check whether the audio and video are synchronized;

For the Aesthetic CoT, use Meta's audio quality evaluation tool Audiobox Aesthetics to predict human subjective scores;

For the Spatial CoT, use the StereoCRM method to verify the accuracy of stereo direction positioning.

In this way, the audio generated by the model has a specific evaluation standard, and the reinforcement learning mechanism has an ideal training target.

3. Fast - GRPO algorithm

The research team first focused on the lightweight and efficient reinforcement learning algorithm GRPO proposed by the DeepSeek team in 2024.

However, GRPO can only be applied to large language models with discrete autoregressive generation. To apply it to multimodal diffusion models, Flow - GRPO, that is, GRPO applied to flow - matching models, is needed.

But even so, there is still a key fundamental problem that has not been solved:

Whether generating images or audio, the model starts from a mass of pure noise and goes through dozens or hundreds of denoising steps to finally restore a clear signal.

In order to enable the model to find "good" sounds during the denoising process, Flow - GRPO turns all these hundreds of steps into stochastic differential equations. The model needs to add a little random noise at each denoising step and calculate the strategy ratio.

The consequence of this process is catastrophic. The back - propagation gradient of the neural network becomes extremely deep, the video memory and training time will explode exponentially, and the computational complexity reaches O(T), where T is the total number of denoising steps.

Nowadays, computing power is equivalent to cost. To fill this computing - power "black hole", the research team adopted a seemingly opportunistic method: Fast - GRPO.

This is a hybrid sampling path. Before the model starts converting noise into audio, it randomly selects an extremely narrow time period from the total steps. This small interval of only a few steps is called the "optimization window".

Within the optimization window, the model uses stochastic differential equations and introduces random noise to explore "better" sounds; outside the optimization window, the model uses ordinary differential equations for deterministic sampling, which is extremely efficient, has a unique path, and does not require calculating complex strategy probabilities.

At first glance, Fast - GRPO only conducts random exploration on a small segment in the middle of the denoising process, which may affect the probability distribution generated by the diffusion model in the end.

In fact, this method has extremely rigorous mathematical proof.

When this method is actually applied to the model, the results are amazing:

First, There is a cliff - like decline in computing - power consumption: The time complexity drops directly from O(T) to nearly linear, and the video memory usage and training time are also reduced to an acceptable range for ordinary laboratories, avoiding the situation of over - expenditure like Sora.

In addition, Both the convergence speed and the final effect are also improved: Fast - GRPO enables the model to complete the original 600 - step denoising process in only 200 steps, and the score increases from 0.47 to 0.51.

03 Winning with a Smaller Model and Crushing Competitors

True gold fears no fire. In an extremely harsh experimental environment, PrismAudio still demonstrated strong dominance:

On the large - scale audio - video dataset VGGSound released by the VGG team at the University of Oxford in 2020, PrismAudio with only 518M parameters went head - to - head with the previous - generation model ThinkSound (1.3B) developed by Tongyi Lab, Tencent Hunyuan's Video - Foley (5.31B), and the open - source model MMAudio (1.03B).

Whether it is objective indicators such as semantic alignment (CLAP), audio - video synchronization error (DeSync), and spatial accuracy error (CRW), or subjective indicators such as sound quality (MOS - Q) and audio - video consistency (MOS - C) evaluated by humans, PrismAudio outperformed all competitors, including the previous - generation SOTA model.

However, as mentioned before, compared with text, images, and videos, the development of the audio modality is actually a bit lagging behind. Most of the existing evaluation datasets have rough annotations and single scenarios.

For this reason, the research team spent a lot of effort to build a high - difficulty benchmark test containing 3177 real - world videos: AudioCanvas.

The audio and video in this test set have been strictly manually filtered to completely eliminate the interference of off - screen voices and background music, and 501 complex multi - event scenarios have been carefully designed to test the model's ability to distinguish and integrate various sounds.

In addition, the research team also used Gemini 2.5 Pro to generate detailed Chain of Thought reasoning texts for the videos. After manual verification, the accuracy rate is over 94%.

Facing the complex multi - time scenarios in AudioCanvas, the previous - generation models almost completely collapsed in terms of temporal synchronization and spatial accuracy.

However, PrismAudio remained stable, demonstrating amazing robustness and ranking first in all indicators.

It is worth noting that in some objective indicators such as semantic alignment and temporal synchronization, PrismAudio's performance even exceeds the original sound of real videos.

That is to say, The noise in the real world will interfere with objective indicators, while PrismAudio generates sounds that are highly in line with human ideal expectations through reinforcement learning.

The last row in the table shows the results of the ablation experiment on the reward function:

If the multi - dimensional Chain of Thought and the Fast - GRPO algorithm are removed, PrismAudio's performance instantly becomes mediocre and is almost indistinguishable from its competitors. The role of these core mechanisms is clearly proven.

04 Alibaba's Choice in the Multimodal Field

The birth of PrismAudio not only officially bid farewell to the uncontrollable "black - box" era of audio generation technology, but also shows great imagination space in commercial implementation.

Looking at the way out for domestic large models at the crossroads of the intelligent agent era, in fact, there are extremely limited options left for major AI companies. The core lies in two paths: Code - writing ability and multimodal ability.

PrismAudio is precisely the trump card played by Alibaba after careful consideration and the unsuccessful marketing of the Qianwen APP.

In the code and logical reasoning track, international top - level models led by Claude Code still occupy an absolute dominant position at present.

This path has extremely high R & D barriers and expensive reasoning costs. There is no possibility of domestic companies launching substitute products in the short term.

In the multimodal track, text - to - video generation has entered a stage of intense competition. ByteDance has entered the global first - tier with Seedance 2.0, and Kelin and Sora are closely following.

In contrast, Tongyi Wanxiang is not very popular. Without the support of data from a short - video platform like Douyin, it is not a wise strategic choice to compete with competitors on this crowded visual track in terms of computing power.

Therefore, Alibaba's solution is to take a detour: Since others are creating "silent shells", I'll create the "soul of sound".

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Domestic models have quietly won a multimodal battle.

01 The "Impossible Quadrilateral" of V2A Generation

02 PrismAudio: Thinking Like a Top - Level Sound Engineer

03 Winning with a Smaller Model and Crushing Competitors

04 Alibaba's Choice in the Multimodal Field