The SALMONN series of large audio-visual understanding models returns to the top of the leaderboard, achieving full-line breakthroughs in enhanced inference, high frame rate, and no text leakage.
The SALMONN family has once again witnessed a significant expansion on the basis of the first general audio understanding model (ICLR 2024) and the first all - element audio - visual understanding model video - SALMONN (ICML 2024), dominating various audio - visual understanding leaderboards!
The new generation of video - SALMONN 2/2+, the first open - source inference - enhanced large audio - visual understanding model video - SALMONN - o1 (ICML 2025), the first high - frame - rate large video understanding model F - 16 (ICML 2025), and the text - leakage - free benchmark test AVUT (EMNLP 2025) are officially released. The new lineup has achieved breakthroughs across the board in video understanding capabilities and evaluation systems, comprehensively consolidating the leading position of the SALMONN family in the open - source large audio - visual understanding model track.
The image is generated by Nano - Banana.
The new flagship arrives, video - SALMONN 2+
video - SALMONN 2+ is the first large audio - visual language model focusing on high - quality and complete video description. Through the atomic event - level evaluation system and MrDPO multi - round reinforcement learning optimization, it significantly reduces information omission and hallucination. It has achieved the state - of - the - art (SOTA) results in subtitle integrity and accuracy metrics.
The research team further found that: using this model to generate high - quality descriptions on a large scale and feeding them back into the training of new models can not only maintain strong description capabilities but also significantly improve the performance of audio - visual understanding tasks such as question - answering.
Finally, video - SALMONN 2+ leads comprehensively on audio - visual understanding leaderboards such as Video - MME, WorldSense, AVUT, Video - Holmes, DailyOmni, and silent video understanding leaderboards such as MLVU, LVBench. The overall performance of the 72B version surpasses that of closed - source giants such as GPT - 4o and Google Gemini 1.5 Pro.
Thanks to the simultaneous utilization of the audio modality, in leaderboards such as Video - MME, the performance gap between no - subtitle (end - to - end real system) and with - subtitle (ideal subtitles generated by external manual annotation or commercial speech recognition systems) scenarios for the video - SALMONN 2+ series is less than 2%. This indicates that audio - visual linkage can effectively enhance semantic understanding capabilities and has obvious advantages in scenarios such as videos without manual subtitles and embodied robots.
The research team first built an audio - visual model based on Qwen 2.5 - VL 7B. After the above - mentioned audio - visual training process and MrDPO training, video - SALMONN 2+ 7B was obtained. Subsequently, this model was used to annotate high - quality descriptions for a larger - scale audio - visual dataset, and based on this dataset, large audio - visual models based on Qwen 2.5 - VL 3B/72B were trained, ultimately obtaining video - SALMONN 2+ 3B/72B. Experiments show that compared with the baseline trained only with the original annotated data, the new model has achieved comprehensive improvements in video understanding capabilities.
The results of the ablation experiment on the Video - MME dataset verify the effectiveness of this technical path: a model with high description capabilities can serve as a "generator" of high - quality training data, providing a reliable data foundation for stronger audio - visual understanding models and promoting continuous iteration and performance improvement of the models.
Inference enhancement, enabling the model to truly understand, video - SALMONN - o1 (ICML 2025)
Question: Why does the performer want to become a citizen of Montenegro? Choices:
A. Because Montenegro is a beautiful country.
B. Because he wants to live in a country whose name matches his skin color.
C. He wants to become a mountain.
D. He thinks the people in Montenegro are very attractive.
E. He thinks the name of the country sounds cool.
video - SALMONN - o1 inference:
The comedian says that there is a country in Europe called Montenegro.
Then he repeats the name, emphasizing its uniqueness: "There is a country full of white people called Montenegro."
He expresses his strong desire to become a citizen of it and even jokes about changing his nationality to Montenegrin.
The reason he wants to become a Montenegrin citizen is that it is a beautiful place and the name matches his skin color.
Therefore, the answer is B. He wants to live in a country whose name matches his skin color.
Question: The performer talked about the habits of straight men in photos. Why did the audience laugh? Please choose from the following options:
A. The performer's shirt is funny.
B. The performer reminds the audience of their friends.
C. The audience felt offended and laughed awkwardly.
D. The performer is a woman and finds it fun to make fun of straight men.
E. It's funny just to imagine, and the performer even acted it out.
video - SALMONN - o1 inference:
The performer is making fun of straight men, saying that they like to pose with their chins up when taking photos.
Then he imitates this pose himself and exaggerates it for comic effect.
The reason the audience laughed is that they can resonate and find this posing behavior itself very funny.
Therefore, the answer is E. It's funny just to imagine, and the performer even acted it out.
Question: What are the jokes in this video and why are they funny? Please analyze step by step in detail.
video - SALMONN - o1 inference:
The punchline is that US Vice - President Joe Biden wrote an autobiography.
The speaker (probably a comedian or an audience member) says that he thinks the book is much better than it actually is.
This is funny because it plays on a common perception that politicians often portray themselves more positively than they really deserve.
The speaker implies that Biden's autobiography is incredibly good, which is a humorous exaggeration of the situation.
In complex videos such as talk shows, deep understanding not only depends on the joint perception of audio and vision but also requires multi - step reasoning on this basis. video - SALMONN - o1 is the first open - source large audio - visual language model with inference enhancement: compared with similar open - source systems that directly output answers, it can receive both audio and video inputs simultaneously, perform evidence - based step - by - step reasoning, and give conclusions, achieving consistency between results and reasons. Experimental observations show that text reasoning optimization for mathematics/programming or reasoning paradigms mainly based on images and texts are difficult to generalize to audio - visual scenarios; video - SALMONN - o1 broadens the application boundaries of reasoning through audio - visual native process supervision and reasoning optimization, enabling large models to understand the physical world more comprehensively with reasoning as the core.
Question: How does the twin study isolate the impact of TBI (traumatic brain injury) on the risk of dementia? Please analyze step by step.
video - SALMONN - o1 inference:
The question is how the twin study isolates the impact of TBI (traumatic brain injury) on the risk of dementia. We need to clarify what key elements in this study can help distinguish the effect of TBI from other factors affecting dementia.
The speaker mentions that the study uses a "twin - pairing" design, that is, comparing twin individuals. This means that the study focuses on controlling genetic factors and common environmental factors.
The speaker emphasizes that the twins in the study are discordant in TBI (i.e., one has TBI and the other does not) and also discordant in the onset of dementia (i.e., one develops dementia and the other does not). This design allows researchers to control early genetic and environmental exposure factors because these are similar between twins.
By comparing twins who differ in TBI experience and dementia outcome, the study can isolate the impact of TBI on the risk of dementia. If TBI is an important risk factor, we would expect that the twin with TBI is more likely to develop dementia, while the twin without TBI is less likely to develop dementia.
In complex audio - visual scenarios such as talk shows and academic lectures, understanding not only depends on the joint perception of audio and vision but also requires multi - step reasoning on this basis. video - SALMONN - o1 is the first open - source large audio - visual language model with inference enhancement: compared with similar open - source systems that directly output answers, it can receive both audio and video inputs simultaneously, perform evidence - based step - by - step reasoning, and give conclusions, achieving consistency between results and reasons. Experimental observations show that text reasoning for mathematics/programming and image - text reasoning are both difficult to generalize to audio - visual scenarios; video - SALMONN - o1 broadens the application boundaries of reasoning through audio - visual native process supervision and reasoning optimization, enabling large models to understand the physical world more comprehensively and accurately with reasoning as the core.
Core algorithm process DPO
Schematic diagram of pDPO rollout and contrastive step selection
Process DPO (pDPO) is the training method proposed by video - SALMONN - o1. Different from the traditional process reward function (PRM) which also focuses on step - level optimization, pDPO addresses the problems of weak self - checking ability and difficulty in obtaining stable step - level rewards in audio - visual scenarios by using a preference optimization path: at the beginning of each step, two candidate next steps sk and s’k are sampled and expanded through reasoning (rollout). The advantages and disadvantages of the two are compared based on the correctness of the expansion results, and the DPO loss is refined to the step level for optimization.
To address the computational overhead caused by a large number of rollouts, pDPO further introduces perturbation - based uncertainty estimation: by applying small perturbations to the current state and observing the output changes to evaluate the uncertainty of each step, preference comparison and optimization are only performed on key steps with high uncertainty, thereby significantly reducing computational cost.
High - frame - rate breakthrough, capturing decisive moments, F - 16 (ICML 2025)
F - 16 is the first large model designed for high - frame - rate video understanding. The human eye can perceive far beyond 30FPS. In critical moments such as sports live broadcasts, e - sports highlights, and rapid camera cuts, high frame rate determines whether you can "see everything clearly and accurately". Unfortunately, most large video models still stay at a low sampling rate of about 1 FPS, resulting in a large loss of information; simply increasing the frame rate will "burn out" the computing power.
F - 16's solution is multi - frame joint alignment and compression. It feeds multi - frame features into the aligner, aligning semantics while performing efficient compression, which not only preserves the pre - trained semantics of the image encoder but also reduces the computational cost. The result is straightforward: on the general video understanding benchmark, F - 16 catches up with and rewrites the SOTA results of pure - vision open - source models of the same scale at the same time; in high - frame - rate tasks such as sports, it remains at the SOTA level.