Selected for NeurIPS 2025, BAAI/Peking University/Beijing University of Posts and Telecommunications propose a multi-stream controlled video generation framework, achieving precise audio-visual synchronization based on audio demixing.
The existing methods for audio-driven video generation often face processing limitations due to the integrity of audio, blurring the correspondence between audio and vision. In response, the Beijing Academy of Artificial Intelligence, Peking University, and Beijing University of Posts and Telecommunications jointly proposed a video generation framework for audio-visual synchronization based on audio demixing. This framework splits the input audio into three types of tracks: speech, sound effects, and music, verifying the effectiveness of audio demixing and multi-stream control in complex video generation tasks.
Compared with text, audio naturally has a continuous time structure and rich dynamic information, which can provide more precise temporal control for video generation. Therefore, with the development of video generation models, audio-driven video generation has gradually become an important research direction in the field of multi-modal generation. Currently, related research has covered multiple scenarios such as speaker animation, music-driven videos, and audio-visual synchronization generation. However, it is still quite difficult to achieve stable and accurate audio-visual alignment in complex video content.
The main limitation of existing methods comes from the way of modeling audio signals. Most models introduce the input audio as an overall condition into the generation process without distinguishing the functional roles of different audio components such as speech, sound effects, and music at the visual level. This processing method reduces the modeling complexity to some extent, but it also blurs the correspondence between audio and vision, making it difficult to meet the requirements of lip synchronization, event timing alignment, and overall visual atmosphere control simultaneously.
To address this issue, the Beijing Academy of Artificial Intelligence, Peking University, and Beijing University of Posts and Telecommunications jointly proposed a video generation framework for audio-visual synchronization based on audio demixing. It splits the input audio into three types of tracks: speech, sound effects, and music, and uses them to drive different levels of the visual generation process respectively. Through the multi-stream temporal control network, as well as the supporting dataset and training strategy, this framework can achieve a clearer audio-visual correspondence at the time interval and global levels. Experimental data shows that this method has achieved stable improvements in indicators such as video quality, audio-visual alignment, and lip synchronization, verifying the effectiveness of audio demixing and multi-stream control in complex video generation tasks.
The related research results, titled "Audio-Sync Video Generation with Multi-Stream Temporal Control," have been selected for NeurIPS 2025.
Paper link: https://arxiv.org/abs/2506.08003
Research highlights:
* Construct an audio-synchronized video generation dataset DEMIX composed of five overlapping subsets, and propose a multi-stage training strategy for learning audio-visual relationships.
* Propose the MTV framework. By splitting the audio into three types of tracks: speech, sound effects, and music, it can respectively control different visual elements such as lip movement, event timing, and overall visual atmosphere, achieving clearer semantic control.
* Design a multi-stream temporal control network (MST-ControlNet) to simultaneously handle fine synchronization of local time intervals and global style adjustment within the same generation framework, structurally supporting differential control of different audio components on the time scale.
Multifunctional generation ability
MTV has multifunctional generation ability, such as character-centered narratives, multi-character interactions, sound-triggered events, music-created atmospheres, and camera movements.
The DEMIX dataset introduces demixed track annotations to enable phased training
This paper first obtains the DEMIX dataset through a detailed filtering process. The filtered DEMIX data is structured into five overlapping subsets: basic face, single person, multiple people, event sound effects, and environmental atmosphere. Based on these five overlapping subsets, this paper introduces a multi-stage training strategy to gradually expand the scale of the model. First, the model is trained on the basic face subset to learn lip movement; then, it learns human posture, scene appearance, and camera movement on the single person subset; subsequently, the model is trained on the multiple people subset to handle complex scenarios with multiple speakers; then, the training focus shifts to event timing, and the event sound effect subset is used to expand the subject understanding from humans to objects; finally, the model is trained on the environmental atmosphere subset to improve its representation of visual emotions.
Based on the multi-stream temporal control mechanism, achieve accurate audio-visual mapping and precise time alignment
This paper explicitly divides the audio into three different control tracks: speech, sound effects, and music. These different tracks enable the MTV framework to precisely control lip movement, event timing, and visual emotions, solving the problem of fuzzy mapping. To make the MTV framework compatible with various tasks, this paper creates a template to construct text descriptions. This template starts with a sentence indicating the number of participants, such as "Two person conversation."; then, it lists each person, starting with a unique identifier (Person1, Person2), and briefly describes their appearance; after listing the participants, the template clearly specifies the person currently speaking; finally, a sentence provides an overall description of the scene. To achieve precise time alignment, this paper proposes a multi-stream temporal control network, which controls lip movement, event timing, and visual emotions through clearly separated speech, effect, and music tracks.
Interval feature injection
For speech and sound effect features, this paper designs an interval stream to accurately control lip movement and event timing. The features of each track are extracted through an interval interaction module, and the self-attention mechanism is used to simulate the interaction between speech and sound effects. Finally, the interacted speech and sound effect features are injected into each time interval using cross-attention, which is called the interval feature injection mechanism.
Overall feature injection
For music features, this paper designs an overall stream to control the visual emotions of the entire video clip. Since music features represent the overall aesthetics, the overall visual emotions are first extracted from the music through an overall context encoder, and average pooling is applied to obtain the global features of the entire clip. Finally, the global features are used as embeddings to modulate the video latent code through AdaLN, which is called the overall feature injection mechanism.
Precisely generate movie-level audio-synchronized videos
Comprehensive evaluation indicators
To verify the effectiveness of the multi-stage training strategy in different learning stages, the experimental part of the paper adopts a set of comprehensive evaluation indicators covering video quality, temporal consistency, and multi-modal alignment ability to systematically evaluate the overall stability and consistency performance of the model after gradually introducing complex control signals, and compares three state-of-the-art methods.
In terms of generation quality and temporal stability, the research uses FVD to measure the difference in distribution between the generated video and the real video, and uses Temp-C to evaluate the temporal continuity between adjacent frames. The results show that MTV significantly outperforms existing methods in FVD, indicating that the model does not sacrifice the overall generation quality when introducing more complex audio control, and at the same time maintains high temporal stability in Temp-C.
At the multi-modal alignment level, the research measures the consistency between the video and text, and audio through Text-C and Audio-C respectively. Among them, MTV has achieved a significant improvement in the Audio-C indicator, far higher than the comparison methods, reflecting the effectiveness of the audio demixing and multi-stream control mechanism in strengthening the audio-visual correspondence.
For the key issues in the speech-driven scenario, the paper introduces two synchronization indicators, Sync-C and Sync-D, to evaluate the synchronization confidence and error range respectively, and also achieves the best performance.
Comparison results
As shown in the figure above, the researchers compared the MTV framework with the current SOTA results. In terms of visual performance, existing methods generally have problems with insufficient stability when dealing with complex text descriptions or movie-level scenarios.
For example, even if the official code is used to fine-tune MM-Diffusion for more than 320,000 steps on 8 NVIDIA A100 GPUs, it is still difficult to generate pictures with a consistent narrative structure and visual coherence, and the overall style tends to be a splicing of local fragments. When facing complex scenarios, TempoTokens is prone to problems such as unnatural facial expressions and movements of characters, especially in scenarios with multiple people or high dynamics, the authenticity of the generated results is significantly affected. In terms of audio-visual synchronization, the method of Xing et al. has difficulty achieving audio synchronization for specific event timings, resulting in incorrect rendering of character gestures in guitar playing (as shown on the right side of the figure above).
In contrast, the MTV framework can maintain high visual quality and stable audio-visual synchronization effects in various scenarios, and can precisely generate audio-synchronized videos with movie-level quality.
Reference link: 1.https://arxiv.org/abs/2506.08003
This article is from the WeChat official account "HyperAI Super Neural", author: Zihan. It is published by 36Kr with authorization.