StartseiteArtikel

Nvidias neue Architektur löst eine Revolution in der multimodalen Large Language Model (LLM)-Technologie aus, und die Open-Source-9B-Modelle erreichen über 10.000 Downloads direkt nach ihrer Veröffentlichung.

新智元2025-11-07 18:44
NVIDIA hat das ganzmodale Modell OmniVinci mit 9 Milliarden Parametern Open-Source gemacht. Es zeichnet sich durch hervorragende Leistung aus.

[Introduction] OmniVinci is a full-modal large model launched by NVIDIA. It can accurately analyze videos and audios, and is particularly good at the temporal alignment of visual and auditory signals. With a parameter scale of 9 billion, its performance surpasses that of models at the same level or even higher levels. Its training data efficiency is 6 times that of its competitors, significantly reducing costs. In scenarios such as video content understanding, speech transcription, and robot navigation, OmniVinci can provide efficient support, demonstrating excellent multi-modal application capabilities.

Since the beginning of this year, the battlefield of open-source large models has been filled with smoke.

All kinds of people are fully committed here, trying to seize an ecological niche in the next era of AI. An undeniable trend is that Chinese large models are strongly dominating the "Hall of Fame" of open-source foundation models.

From the amazing performance of DeepSeek in code and mathematical reasoning to the all-round development of the Qwen family in multi-modal and general capabilities, they have long become references that global AI practitioners cannot bypass, thanks to their excellent performance and rapid iteration.

Just when everyone thought that this wave of open-sourcing foundation models would mainly be driven by top Internet giants and star startups, a giant that "should have" been "selling water" on the side has also stepped into the arena.

Yes, as the biggest beneficiary of the AI wave - NVIDIA - has not slacked off in self-developed large models.

Now, an important piece has been added to NVIDIA's large model matrix.

Without further ado, Huang's latest trump card has officially arrived: The most powerful 9B full-modal video and audio large model, OmniVinci, is now open-sourced!

  • Paper link: https://arxiv.org/abs/2510.15870
  • Code link: https://github.com/NVlabs/OmniVinci

On multiple mainstream full-modal, audio understanding, and video understanding leaderboards, OmniVinci has demonstrated overwhelming performance over its competitors:

If NVIDIA's previous open-source models were only making detailed layouts in specific fields, then the release of OmniVinci is a real "all-out effort".

NVIDIA defines OmniVinci as "Omni-Modal" - a unified model that can simultaneously understand videos, audios, images, and texts.

Although it only has 9 billion (9B) parameters, it has demonstrated "game-changing" performance in multiple key multi-modal benchmark tests.

According to the paper published by NVIDIA, the core advantages of OmniVinci are extremely prominent:

  • Performance beyond its class: On multiple authoritative full-modal understanding benchmarks (such as DailyOmni, MMAR, etc.), OmniVinci's performance comprehensively surpasses that of competitors at the same level (or even higher levels), including Qwen2.5-Omni.
  • Amazing data efficiency: This is the most remarkable aspect. To achieve the current SOTA (state-of-the-art) performance, OmniVinci only uses 0.2T (200 billion) tokens of training data. In contrast, the dataset sizes of its main competitors are generally above 1.2T. This means that OmniVinci's training efficiency is 6 times that of its competitors!
  • Core technology innovation: Through an innovative architecture called the OmniAlignNet and technologies such as Temporal Embedding Grouping and Constrained Rotary Time Embed, it achieves high-precision temporal alignment of visual and auditory signals. Simply put, it not only "sees" the video and "hears" the sound but also accurately understands "what sound occurs at what picture".

NVIDIA's entry into the arena sends a clear signal: The king of hardware also wants to master the right to define models.

Video + Audio Understanding: 1 + 1 > 2

Does the addition of audio really make multi-modal models stronger? Experiments have given a clear answer: Yes, and the improvement is very significant.

The research team pointed out that sound introduces a new information dimension to visual tasks, benefiting the model greatly in video understanding.

Specifically, from simply relying on vision to implicitly learning multi-modal information by combining audio, and then to explicitly fusing through a full-modal data engine, the model's performance has shown a step-by-step leap.

Especially after adopting the explicit learning strategy, there have been breakthrough improvements in multiple indicators. As shown in the following table, the performance has almost "soared all the way".

Not only in SFT, but also adding the audio modality in the post-training stage can further enhance the effect of GRPO:

Full-Modal Agent: Abundant Application Scenarios

The full-modal model that combines video and audio breaks through the modal limitations of traditional VLMs and can understand video content more comprehensively, thus having broader application scenarios.

For example, summarizing Huang's interview:

It can also transcribe into text:

Or use voice to command robot navigation:

A Friend, Not a Foe in the Open-Source Community

In the past year,

DeepSeek has refreshed the upper limits of open-source leaderboards again and again with its outstanding strength in code and mathematical reasoning, becoming a synonym for the "strongest science student".

Qwen has built a large model matrix, ranging from the smallest 0.6B to the giant 1T large model. It is one of the most well-rounded "all-round players" with the most complete ecosystem at present.

The open-sourcing of OmniVinci is more like a "catfish". It has set a SOTA research benchmark with its extreme efficiency and powerful performance, stirring up the open-source large model battlefield and urging friends in the community to come up with better models to help humanity move towards AGI.

For NVIDIA, the "shovel seller", the more people use open-source models -> the more people buy GPUs. Undoubtedly, it is the biggest beneficiary of open-source models. That's why NVIDIA is a firm friend, not a foe, of open-source model teams.

Conclusion: Community Carnival, Accelerated Wave, Towards AGI Together

As soon as NVIDIA's OmniVinci was released, it was like a boulder hitting the already turbulent open-source sea, and it has already gained more than 10,000 downloads on Hugging Face.

Overseas tech bloggers have rushed to release videos and articles to share relevant technologies.

It is not only a natural extension of NVIDIA's "hardware-software integration" ecosystem but also a strong "boost" to the entire AI open-source ecosystem.

The pattern of open-sourcing has become clearer as a result.

On the one hand, there are Chinese open-source forces represented by DeepSeek and Qwen, who have built a prosperous developer base with their extremely fast iteration speed and openness.

On the other hand, there is NVIDIA, which holds the hegemony of computing power. It has stepped into the arena and is using "technical benchmarks" and "ecosystem incubation" to accelerate the entire process as a friend in the open-source community.

The wave is accelerating, and no one can stay out of it. For every AI practitioner, a stronger, faster, and more competitive AI era has just begun.

Reference materials:

https://arxiv.org/abs/2510.15870

This article is from the WeChat official account "New Intelligence Yuan", edited by LRST. Republished by 36Kr with permission.