NVIDIA's Full - Modal Large Model Arrives: Finishes Jensen Huang's 3 - Minute Speech in Seconds with 9x Throughput of Peers

One model to handle text, vision, and voice.

On April 29, TechSugar reported that NVIDIA officially launched a new multimodal inference model, Nemotron 3 Nano Omni, yesterday. It deeply integrates the capabilities of three major modalities - text, vision, and voice - into a single model system and is currently available for free use.

As the latest member of the Nemotron 3 series, Nemotron 3 Nano Omni can handle various inputs such as text, images, audio, video, documents, charts, and graphical interfaces and output in text form. In addition, the model can dynamically activate expert networks according to different tasks and modalities, achieving strong multimodal perception capabilities while ensuring high throughput, making the overall throughput 9 times that of similar open multimodal models.

Currently, this model ranks among the top five in document intelligence lists such as MMlongbench - Doc and OCRBenchV2. In video and audio understanding tasks, it ranked first on DailyOmni and VoiceBench, surpassing Qwen3 - Omni - 30B - A3B - Thinking and Gemini 2.5Flash.

▲OCRBenchV2 Ranking

▲DailyOmni Ranking

In addition to accuracy, MediaPerf data shows that it achieves the highest throughput in multi - task scenarios and has the lowest inference cost in video - level annotation tasks.

In terms of the training dataset, Hugging Face shows that Nemotron 3 Nano Omni has been improved using Qwen3 - VL - 30B - A3B - Instruct, Qwen3.5 - 122B - A10B, Qwen3.5 - 397B - A17B, Qwen2.5 - VL - 72B - Instruct, and gpt - oss - 120b.

According to actual tests by overseas netizens, the Nemotron 3 Nano Omni model can quickly and accurately recognize video content, rapidly analyze speech videos and extract key information; it can answer questions related to specific topics in a person's speech, and the answers are in line with the original text. At the same time, it can read and analyze professional technical documents and answer hard - core technical questions about model training. Its overall understanding ability, multimodal information processing, and professional content interpretation performance are remarkable.

Open - source URL:

https://nvda.ws/420h6mR

https://openrouter.ai/nvidia/nemotron - 3 - nano - omni - 30b - a3b - reasoning:free

Official URL:

https://build.nvidia.com/nvidia/nemotron - 3 - nano - omni - 30b - a3b - reasoning

01. Can quickly understand video content and locate relevant segments

In an actual test, an overseas blogger uploaded a speech video of Jensen Huang at NVIDIA GTC 2026 that was more than three minutes long and directly asked the model about the video content. Nemotron 3 Nano Omni completed the joint understanding of the picture and voice within just a few seconds. It not only accurately summarized the core points of the speech but also pointed out the key information in the specific context.

Subsequently, the blogger further asked, "What exactly did Jensen Huang say about the ranking list?" Based on the existing video context, the model quickly located the relevant segments and gave a more detailed answer, demonstrating its ability to continuously remember long - video content and perform cross - modal retrieval.

He then directly input the technical document of Nemotron 3 Nano Omni into the model and asked it to explain the model's training method. Facing the switch of multi - source information from video to text, the model was still able to seamlessly connect and analyze complex technical details under the same inference framework, sorting out the key logic including the mixture - of - experts architecture, data, and training process.

The main application scenarios of Nemotron 3 Nano Omni include computer user agent navigation of graphical interfaces, document intelligence for enterprise analysis and compliance workflows, and audio - video understanding for customer service and research applications. The model provides open weights, datasets, and training technologies and can be deployed in local systems, data centers, and cloud environments to meet regulatory, sovereignty, or data localization requirements.

Early adopters include Aible, Foxconn, Palantir, and H Company, while companies such as Dell Technologies, DocuSign, Infosys, and Oracle are evaluating the model. The download volume of the Nemotron 3 model series has exceeded 50 million times in the past year.

02. The throughput is 9 times that of similar open multimodal models

The core highlights of Nemotron 3 Nano Omni are concentrated on the hybrid MoE architecture, efficient spatio - temporal visual processing, and comprehensive multimodal capabilities. It can dynamically activate expert networks according to different tasks and modalities, achieving strong multimodal perception capabilities while ensuring high throughput, making the overall throughput 9 times that of similar open multimodal models.

The core architecture of the hybrid MoE innovatively integrates the Mamba layer and the Transformer layer. The Mamba layer is responsible for improving sequence processing efficiency and memory utilization, while the Transformer layer ensures accurate inference calculation. This integrated design not only significantly improves the data processing throughput but also increases the memory and computing efficiency by up to 4 times, making it highly adaptable in the role of sub - agents.

For video inference under the same interaction threshold, Nemotron 3 Nano Omni can maintain a higher total throughput. Compared with alternative open omnidirectional models, its effective system capacity can be increased by approximately 9.2 times.

For multi - document inference under the same interaction threshold, Nemotron 3 Nano Omni can maintain a higher total throughput. Compared with alternative open omnidirectional models, its effective system capacity can be increased by approximately 7.4 times.

From the previous Nemotron Nano VL V2 model to Nemotron 3 Nano Omni, the multimodal accuracy has improved in industry - leading benchmark tests.

03. An open - source model that integrates multimodal processing capabilities within a unified architecture

Currently, open - source AI models in the field of agent reasoning are experiencing a concentrated explosion, and market competition is becoming increasingly fierce. Meta's Llama series has long dominated the open - source large - language model track; Google's Gemini focuses on large - scale multimodal capabilities in the cloud to build a differentiated advantage; OpenAI's GPT series has always been a benchmark in the commercial field; Deepseek's newly released V4 - Pro and V4 - Flash last week further enriched the market supply with a hybrid attention architecture that specifically optimizes long - cycle agent tasks.

The core difference of Nemotron 3 Nano Omni does not lie in a single - performance breakthrough but in the exclusive combination of four major advantages: unified multimodal perception of vision, audio, and text in a single model, high - energy - efficiency adaptation of the mixture - of - experts for edge deployment, open - source weights, and full commercial authorization. Currently, there are no competing products with all these features at the same time. Competing products have their own shortcomings: Google's edge - side model Gemini Nano is not open - source, and the multimodal version of Meta Llama cannot integrate audio processing capabilities within a unified architecture.

04. Conclusion: A "crucial move" for NVIDIA to improve its AI layout

The strategic impact of this model far exceeds the product itself. If it becomes the mainstream choice for agent deployment, NVIDIA will achieve the trinity of inference GPU hardware, optimized acceleration software framework, and self - developed upper - layer models. If competitors conduct secondary development based on NVIDIA, it will further deepen their hardware dependence; even if opponents independently develop models, the training process still relies on NVIDIA's GPU computing power. As the era of agent AI accelerates, NVIDIA's core goal is not to monopolize a single point but to penetrate every core link of the industry and build irreplaceability.

This article is from the WeChat official account "TechSugar" (ID: zhidxcom), written by Xu Jiayang, and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA's full-modal large model is here. It can finish Jensen Huang's 3-minute speech in just a few seconds, with a throughput 9 times that of its peers.

01. Can quickly understand video content and locate relevant segments

02. The throughput is 9 times that of similar open multimodal models

03. An open - source model that integrates multimodal processing capabilities within a unified architecture

04. Conclusion: A "crucial move" for NVIDIA to improve its AI layout