NVIDIA's new model is launched, with a 53-fold surge in 4B inference speed. The new attention architecture surpasses Mamba 2.
Jet-Nemotron is a new series of small models (2B/4B) recently launched by NVIDIA, developed by an all-Chinese team. Its core innovations include the proposal of Post Neural Architecture Search (PostNAS) and a new linear attention module called JetBlock, which enables efficient architecture optimization starting from pre-trained Transformer models. Compared with models such as Qwen3, Gemma3, and Llama3.2, Jet-Nemotron has higher accuracy in dimensions such as mathematics, code, common sense, retrieval, and long context, and its inference throughput on H100 GPUs can be increased by up to 53 times.
NVIDIA has really become obsessed with “small models” recently.
NVIDIA has released a brand-new hybrid architecture language model series, Jet-Nemotron.
Paper link: https://arxiv.org/pdf/2508.15884
Project link: https://github.com/NVlabs/Jet-Nemotron
The Jet-Nemotron series includes Jet-Nemotron-2B and Jet-Nemotron-4B models.
NVIDIA claims that the “small models” in the Jet-Nemotron series outperform current state-of-the-art open-source full-attention language models such as Qwen3, Qwen2.5, Gemma3, and Llama3.2.
At the same time, it achieves significant efficiency improvements, with the generation throughput on H100 GPUs increasing by up to 53.6 times.
In the radar chart in the upper right corner, it can be seen that Jet-Nemotron is like a well-rounded warrior.
The Jet-Nemotron-4B model almost reaches the maximum in six dimensions: MMLU-pro, Math, Retrieval, Commonsense, Code, and Long.
During the pre-filling and decoding phases, as the context length increases, the advantage of Jet-Nemotron-2B over Qwen 3-1.7B becomes more exaggerated.
In a nutshell, under the same hardware and evaluation settings, Jet-Nemotron achieves an order-of-magnitude increase in throughput in long-context scenarios (up to 50 times increase in decoding).
At the same time, the accuracy in dimensions such as common sense, mathematics, code, retrieval, and long context increases instead of decreases.
Compared with traditional full-attention small models, it is both faster and more accurate.
It seems that NVIDIA has set its sights on the field of small models, the Small Model domain.
Last week, they just released the NVIDIA Nemotron Nano 2 model with only 9B parameters.
In complex reasoning benchmark tests, it achieves an accuracy comparable to or better than that of Qwen3-8B, and its throughput can be up to 6 times higher.
Today, they have launched the even smaller Jet series, with model sizes reduced to 2B and 4B.
Core Innovations
Jet-Nemotron has two core innovations.
- Post Neural Architecture Search (PostNAS), which is an efficient post-training architecture exploration and adaptation process applicable to any pre-trained Transformer model;
- JetBlock, a new linear attention module whose performance significantly outperforms previous designs such as Mamba2.
PostNAS: Post-Training Architecture Exploration and Adaptation
Unlike previous methods that train from scratch to explore new model architectures, PostNAS builds on pre-trained Transformer models.
It also supports flexible exploration of attention block designs, greatly reducing the cost and risk of developing new language model architectures.
PostNAS first determines the optimal placement of full-attention layers and then searches for improved attention block designs.
PostNAS starts from a pre-trained full-attention model and freezes the MLP.
Subsequently, it conducts a coarse-to-fine search for the design of efficient attention blocks:
First, determine the optimal placement of full-attention layers, then select the most suitable linear attention block or adopt a new linear attention block, and finally search for the optimal architecture hyperparameters.
After applying PostNAS to the baseline model, significant accuracy improvements are achieved in all benchmark tests.
In pre-trained Transformer models, not all attention layers contribute equally.
PostNAS reveals the important attention layers in pre-trained Transformer models.
The KV cache size is the most critical factor affecting long-context and long-generation throughput.
PostNAS hardware-aware search can discover architectures that have more parameters and achieve higher accuracy while maintaining similar generation throughput.
JetBlock: A New Linear Attention Module with SOTA Accuracy
Through PostNAS, JetBlock, a novel linear attention module, is introduced. It combines dynamic convolution with hardware-aware architecture search to enhance linear attention, achieving significant accuracy improvements while maintaining similar training and inference throughput as previous designs.
Below is a fair comparison between the Mamba2 Block and JetBlock using exactly the same training data and training scheme.
Performance
Jet-Nemotron-2B and Jet-Nemotron-4B achieve or exceed the accuracy of mainstream efficient language models (such as Qwen3) in comprehensive benchmark tests.
At the same time, they run significantly faster - 21 times and 47 times faster than Qwen3-1.7B-Base respectively.
References
https://arxiv.org/pdf/2508.15884v1
https://x.com/hancai_hm/status/1960000017235902722
This article is from the WeChat official account “New Intelligence Yuan”, author: Ding Hui. It is published by 36Kr with authorization.