HomeArticle

NVIDIA becomes the benchmark for open-source large models in the United States: Even the training formula of Nemotron 3 is made public, and all 10 trillion tokens of data are released.

量子位2025-12-26 19:49
Hybrid Mamba-Transformer MoE architecture and NVFP4 low-precision training are all utilized.

NVIDIA is very aggressive in the field of open-source models:

The "most efficient open model family", Nemotron 3, incorporates the hybrid Mamba-Transformer MoE architecture and NVFP4 low-precision training.

Moreover, it is completely open:

Not only are the model weights open, but also the training data of over 10 trillion tokens, pre-training and post-training software, and training recipes are all made public.

Compared with other open-source models, it has competitive performance and is 1.5 - 3.3 times faster.

Combining Mamba and Transformer

Nemotron 3 aims to maximize inference efficiency at the architectural level.

The self-attention mechanism of traditional Transformer requires linear scanning of the ever-growing KV Cache. The longer the sequence, the greater the computational overhead.

NVIDIA's solution is to extensively use Mamba-2 layers to replace self-attention layers. The Mamba layer only needs to store a fixed-size state during generation, which is not affected by the sequence length.