NVIDIA Sets Benchmark for US Open - Source Large Models: Nemotron 3 Training Formula Made Public, All 10 Trillion Tokens of Data Released

Hybrid Mamba-Transformer MoE architecture and NVFP4 low-precision training are all utilized.

NVIDIA is very aggressive in the field of open-source models:

The "most efficient open model family", Nemotron 3, incorporates the hybrid Mamba-Transformer MoE architecture and NVFP4 low-precision training.

Moreover, it is completely open:

Not only are the model weights open, but also the training data of over 10 trillion tokens, pre-training and post-training software, and training recipes are all made public.

Compared with other open-source models, it has competitive performance and is 1.5 - 3.3 times faster.

Combining Mamba and Transformer

Nemotron 3 aims to maximize inference efficiency at the architectural level.

The self-attention mechanism of traditional Transformer requires linear scanning of the ever-growing KV Cache. The longer the sequence, the greater the computational overhead.

NVIDIA's solution is to extensively use Mamba-2 layers to replace self-attention layers. The Mamba layer only needs to store a fixed-size state during generation, which is not affected by the sequence length.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

NVIDIA becomes the benchmark for open-source large models in the United States: Even the training formula of Nemotron 3 is made public, and all 10 trillion tokens of data are released.

Combining Mamba and Transformer