Audio-visual separation SOTA speeds up by 6 times, Tsinghua University releases the first high-performance 6M model
[Introduction] The Dolphin model developed by a team from Tsinghua University has broken through the bottleneck of "high performance necessarily means high energy consumption": Using only 6M parameters (half that of mainstream models), through discretized visual encoding and a physics-inspired thermal diffusion attention mechanism, it can accurately separate speech in a single inference, with a speed increase of over six times. It has set new records in multiple benchmark tests, paving the way for deploying high-definition speech separation on end-side devices such as smart hearing aids and mobile phones.
Audio-Visual Speech Separation (AVSS) technology aims to simulate the human "cocktail party effect", that is, to accurately extract the voice of the target speaker from background noise or mixed voices of multiple people by using visual cues of the speaker's face (such as mouth movements). This technology has extremely important application value in fields such as smart hearing aids, mobile communications, augmented reality, and human-computer interaction.
However, for a long time, the field has faced the dilemma of "it's hard to have both performance and efficiency": High-performance models often rely on a large number of pre-trained parameters and high computational costs, making it difficult to deploy them on resource-constrained edge devices; while lightweight models usually sacrifice separation accuracy and often rely on high-latency iterative calculations.
In response to this pain point, the team led by Associate Professor Hu Xiaolin from the Department of Computer Science at Tsinghua University proposed a brand-new and efficient AVSS model called Dolphin.
This model introduces discretized visual semantic representation and a global-local attention mechanism based on physical priors. While significantly reducing computational complexity, it has set new performance records on multiple benchmark datasets.
Dolphin is not only the first AVSS model that compresses the number of parameters to the 6M level (including the visual encoder) and combines high quality and high performance, but also achieves a speed increase of over six times compared to the existing SOTA models in GPU inference.
Paper link: https://arxiv.org/pdf/2509.23610
Paper homepage: https://cslikai.cn/Dolphin/
Code link: https://github.com/JusperLee/Dolphin
The existing mainstream AVSS methods mainly face three major challenges:
- The "path dependence" problem of the visual encoder. To extract semantic features highly aligned with speech, existing methods usually directly use large video encoders pre-trained on lip-reading tasks, which results in a huge computational load on the visual branch, even exceeding that of audio processing itself; while simple lightweight alternatives can only extract shallow pixel-level features, leading to the loss of semantic information and a significant reduction in separation performance.
- High latency in iterative inference. To improve performance with limited parameters, lightweight models (such as RTFS-Net) often adopt a cyclic iterative strategy, that is, passing through the separator multiple times to gradually optimize the results. Although this method reduces the number of parameters, it significantly increases the inference time and computational latency, failing to meet the requirements of real-time interaction.
- Limitations in feature modeling. Traditional models have difficulty in simultaneously considering the long-term global context dependence and the short-term local fine structure in a single forward propagation, resulting in artifacts or loss of details when dealing with complex acoustic environments.
Figure 1. Overall pipeline architecture diagram of the Dolphin model
In response to the above problems, Dolphin proposes a complete set of solutions, and its core architecture includes the following three key innovations:
DP-LipCoder: A Dual-Path Discrete Visual Encoder Based on Vector Quantization
To obtain high-quality visual semantics under the premise of lightweight design, the team designed a dual-path discrete visual encoder called DP-LipCoder based on vector quantization (as shown in Figure 2).
Figure 2. Network structure of DP-LipCoder
This is a dual-path architecture, including a "reconstruction path" and a "semantic path". The reconstruction path is responsible for capturing basic visual cues such as the speaker's identity and facial expressions. The semantic path introduces Vector Quantization (VQ) technology.
It maps continuous video frames into discrete token sequences and uses the pre-trained AV-HuBERT model for distillation, forcing the encoder to learn deep semantic information highly aligned with audio. This discretized design enables Dolphin to extract visual features with high discriminative power and noise resistance at a very low computational cost, effectively solving the conflict between the lightweight design of the visual encoder and the richness of encoded semantic information.
GLA Module: Global-Local Collaborative Modeling in a Single Iteration
Dolphin abandons the time-consuming multi-round iterative mechanism, uses a single-round encoder-decoder architecture, and designs an efficient Global-Local Attention (GLA) module (as shown in Figure 3) to ensure that the model can complete high-quality separation in a single forward propagation. The core components of the GLA module are introduced as follows:
Global Attention (GA): It adopts a coarse-grained self-attention mechanism, which significantly reduces computational complexity by capturing global context information spanning several seconds at a low resolution.
Local Attention (LA): This is another highlight of the model. The team creatively introduced "Heat Diffusion Attention (HDA)" based on the physical heat diffusion equation. Utilizing the smoothing property of the heat diffusion process, HDA can adaptively perform multi-scale filtering on features, accurately preserving the transient details of speech while removing noise interference.
Figure 3. Schematic diagram of the GLA module structure
Direct Feature Regression Mechanism
Different from the masking strategy adopted by mainstream methods, Dolphin adopts a direct mapping strategy. Traditional masking methods multiply the predicted mask between 0 and 1 back to the mixed speech, which is prone to introducing nonlinear distortion. Dolphin directly regresses the deep representation of the target speech. Experiments have shown that this strategy can effectively improve the signal restoration degree, bringing an additional improvement of about 0.5 dB in the SI-SNRi metric.
Primary Experimental Results and Performance Breakthroughs
On three authoritative audio-visual separation benchmark datasets, LRS2, LRS3, and VoxCeleb2, Dolphin has demonstrated dominant separation quality and performance advantages:
- Comprehensive lead in separation quality: On the LRS2 dataset, Dolphin's Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) reached 16.8 dB, significantly outperforming the current SOTA models IIANet (16.0 dB) and AV-Mossformer2 (15.1 dB).
- Extremely high model performance: In terms of the number of model parameters, including the parameters of the visual encoder, the total number of parameters of Dolphin is only 6.22M. In contrast, the number of parameters of IIANet is as high as 15.01M, a reduction of more than 50%. In the GPU inference latency test, Dolphin only takes 33.24 milliseconds to process 1 second of audio, more than four times faster than IIANet and nearly 50% faster than the lightweight model RTFS-Net. At the same time, the computational load (MACs) of the model is only 10.89 G, a reduction of more than 50% compared to models such as IIANet and RTFS-Net.
- High robustness and superior actual listening experience: When facing "in-the-wild" scenarios such as 3 - 4 people speaking in a mixed manner, high-intensity background music interference, and real-world debate videos, Dolphin showed strong robustness. In the subjective listening test (MOS), Dolphin scored 3.86, far exceeding the 2.24 points of the comparison models, proving that the separated speech is clearer, more natural, and free of artificial traces.
Summary
With the continuous development of large model technology, the field of audio-visual speech separation is also pursuing large models to improve separation quality. However, this is not feasible for end devices. The proposal of Dolphin has broken the long-standing inherent thinking of "trading the number of parameters for performance" in the field of audio-visual speech separation.
By introducing discretized semantic representation and a physics-inspired thermal diffusion attention mechanism, Dolphin has proven that lightweight models are fully capable of outperforming large models in terms of performance. This work provides a new technical path and theoretical support for deploying high-precision speech separation technology in resource-constrained scenarios such as smart glasses, mobile phone end-side large models, and real-time conference systems in the future.
Reference: https://arxiv.org/pdf/2509.23610
This article is from the WeChat official account "New Intelligence Yuan". Editor: LRST. Republished by 36Kr with authorization.