HomeArticle

Output directly without editing. AI generates multi-character co - framed dialogue videos, and dynamic routing precisely binds audio.

新智元2025-07-17 16:30
Bind-Your-Avatar enables multi-character audio-visual synchronization and supports the MTCC dataset.

[Introduction] Bind-Your-Avatar is a framework based on the diffusion Transformer (MM-DiT). It binds voices to characters through fine-grained embedding routing, achieving precise audio-visual synchronization and supporting dynamic background generation. The framework also introduces the first dataset MTCC and benchmark for multi-character dialogue video generation. Experiments show that it outperforms existing methods in identity fidelity and audio-visual synchronization.

In recent years, with the emergence of basic models for video generation, significant progress has also been made in the field of audio-driven speaker video generation.

However, existing methods mainly focus on single-character scenarios. The existing methods that can generate two-character dialogue videos can only generate two separate speaker videos independently.

In response to this challenge, researchers have proposed the first framework, Bind-Your-Avatar, dedicated to multi-character speaking video generation in the same scene.

This model is based on the diffusion Transformer (MM-DiT). Through a fine-grained embedding routing mechanism, it binds "who is speaking" with "what is being said", thus achieving precise control over the audio-character correspondence.

Paper link: https://arxiv.org/abs/2506.19833

Project link: https://yubo-shankui.github.io/bind-your-avatar

The authors also constructed the first complete dataset (MTCC) and evaluation benchmark for multi-character dialogue video generation, providing an end-to-end data processing pipeline.

Extensive experiments show that Bind-Your-Avatar has excellent generation results in multi-character scenarios and significantly outperforms existing baseline methods in indicators such as face identity fidelity and audio-visual synchronization.

Bind-Your-Avatar

Method Overview

Bind-Your-Avatar is built on a multi-modal text-to-video diffusion Transformer (MM-DiT). The model inputs include: text prompts, multi-channel speech audio streams, face reference images of multiple characters, and (optionally) an inpainting frame for drawing the background.

Text, audio, and face identity features are extracted by feature encoders. Cross-attention guided by Embedding Routing selectively injects face and audio information into visual tokens, thus achieving the association of audio-visual synchronization.

The training of the model is divided into three stages: In the first stage, only silent character motion videos with inpainting frames are generated (without using audio). In the second stage, single-character speech input is added to learn fine-grained audio-driven character motions (through lightweight LoRA fine-tuning). In the third stage, multi-character speech input is introduced, and the Embedding Routing is jointly trained (using the teacher forcing method to prevent mask degradation).

Fine-Grained Embedding Routing-Guided Audio-Character Driving

The output of the Embedding Routing is a spatio-temporal mask matrix M, which is used to indicate which character (or background) each visual token corresponds to, thus binding the speaker to the specific voice.

During training, the researchers designed a cross-entropy loss

to supervise the routing output. Combined with geometric priors, spatio-temporal consistency loss and layer consistency loss are introduced to enhance the accuracy and smoothness of the mask.

The paper discusses three routing implementation methods: Pre-Denoise (using a static 2D mask), Post-Denoise (predicting a 3D mask after two-stage generation), and Intra-Denoise Routing.

Intra-Denoise Routing dynamically generates fine-grained 3D spatio-temporal masks during the diffusion denoising process, achieving independent frame-level control of each character. This design not only improves the accuracy of the audio and the corresponding character's mouth movement but also maintains the coherence of the character identity.

To obtain high-quality 3D masks, the researchers proposed two effective methods in the routing design. Among them, the Mask Optimization Strategy regularizes the mask by introducing geometric priors, improving the accuracy and temporal consistency of the segmentation between characters and the background area. In addition, the researchers also proposed a mask refinement process, which smooths and corrects the time consistency of the initially predicted sparse mask, further enhancing the mask quality.

MTCC Dataset

To support multi-character video generation, the researchers constructed the MTCC dataset (Multi-Talking-Characters-Conversations), which contains more than 200 hours of multi-character dialogue videos.

Data processing pipeline includes:

Video cleaning (screening resolution, duration, frame rate; ensuring that there are exactly two clear characters in the video; filtering based on pose difference, etc.), audio separation and synchronization screening (using AV-MossFormer and the Sync-C metric to ensure audio-visual consistency), speech and text annotation (using Wav2Vec to extract audio features and QWen2-VL to generate descriptions), and SAM2 to generate character area masks as supervision signals.

MTCC comes with complete open-source processing code, providing the community with an end-to-end pipeline from raw videos to training data.

Experiments and Analysis

Quantitative Analysis

The researchers compared with various baseline methods on the MTCC test set and a new benchmark set (Bind-Your-Avatar-Benchmark, containing 40 sets of dual-character faces and two-stream audio), including recent methods such as Sonic, Hallo3, and Ingredients. These methods were originally designed for single-character or background-free scenarios and were adapted for this task.

Quantitative indicators cover character identity preservation (Face Similarity), audio-visual synchronization (Sync-C, Sync-D), and visual quality (FID, FVD), etc.

The results show that Bind-Your-Avatar significantly outperforms each baseline in the indicators of face similarity and audio-visual synchronization (especially excellent in synchronization indicators) and also remains competitive in visual quality indicators such as FID/FVD.

Ablation experiments further verify that fine-grained 3D masks can better handle character movements and close interactions than bounding boxes or static 2D masks, improving the generation quality in dynamic scenarios.

Qualitative Analysis

Bind-Your-Avatar can naturally handle cross-talking scenarios of multiple characters and simultaneously generate a unified and dynamic background without post-processing stitching.

For example, Bind-Your-Avatar can generate dialogue videos where two characters are speaking different contents simultaneously, while keeping the mouth movements of each character highly synchronized with the corresponding voice, and the facial expressions are realistic.

Conclusion

Bind-Your-Avatar proposes the task of multi-character speech-driven video generation in the same scene for the first time and provides a complete solution from algorithms to datasets.

Its main contributions include: the fine-grained Embedding Routing mechanism (achieving precise binding of "who is saying what"), the dynamic 3D-mask Routing design (controlling each character frame by frame), and the MTCC dataset and the corresponding multi-character generation benchmark.

Future work will focus on enhancing the realism of character actions (such as body and gesture movements) and optimizing the real-time performance of the model to meet the needs of large-scale and online multi-character video generation.

The researchers will open-source the dataset and code later to facilitate further research in the community.

Reference materials:

https://arxiv.org/abs/2506.19833 

This article is from the WeChat official account "New Intelligence Yuan", edited by LRST , and published by 36Kr with authorization.