New Work from Han Song's Team at NVIDIA: An Efficient Language Model with Post-Neural Architecture Search

The generation efficiency has been increased by 53 times.

NVIDIA has made another big move in open - sourcing!

The Han Song team has launched a brand - new efficient language model based on post - neural architecture search called Jet - Nemotron.

In a series of benchmark tests, this model not only demonstrates comparable or even better accuracy than Qwen3, Qwen2.5, Gemma 3, and Llama 3.2, but also achieves up to a 53.6 - fold acceleration in generation throughput and a 6.1 - fold acceleration in the pre - filling stage.

Notably, on the MMLU, MMLU - Pro, and BBH benchmarks, Jet - Nemotron - 2B has a 47 - fold increase in throughput compared to Qwen3 - 1.7B - Base, and the cache size is reduced to 1/47.

Meanwhile, it also achieves higher accuracy than DeepSeek - V3 - Small and Moonlight (with a total of 15 billion parameters and 2.2 billion active parameters).

Both the code and the pre - trained model will be open - sourced. Let's first take a look at how Jet - Nemotron is built.

Jet - Nemotron: Built on Post - Neural Architecture Search

First of all, Jet - Nemotron is built on the basis of Post - Neural Architecture Search (PostNAS).

Among them, the Post - Neural Architecture Search (PostNAS) model is an architecture search method that "makes improvements on the shoulders of large models".

It starts from a pre - trained full - attention model and directly inherits the weights of its multi - layer perceptron, and keeps these weights frozen (no longer updated) throughout the process.

Jet - Nemotron is optimized from PostNAS through the following four steps:

Placement and Elimination of Full - Attention Layers

Retaining a small number of full - attention layers in the model is crucial for maintaining high accuracy in high - difficulty tasks such as retrieval.

However, the optimal placement of these layers has always been unclear.

Therefore, the research team introduced a new method by training a "once - for - all super network" to automatically learn where to use full - attention layers.

The experimental results show that compared with the commonly used uniform placement strategy, this learned placement method significantly improves the accuracy on the MMLU benchmark.

Selection of Linear Attention Modules

After determining the placement of full - attention layers, the research team conducted an attention module search to determine the optimal linear attention module.

In the experiment, they evaluated six state - of - the - art linear attention modules (RWKV7 was excluded due to its low training throughput), and the results are as follows.

As observed from the above table, Gated DeltaNet achieved the optimal overall accuracy. Therefore, the research team used Gated DeltaNet in subsequent experiments.

Design of a New Attention Module

Adding convolution is a commonly used strategy to enhance the ability of linear attention. However, previous methods only relied on static convolution kernels and lacked the ability to dynamically adapt to the feature extraction patterns of convolution kernels.

So, the research team introduced a new linear attention module called JetBlock.

This module uses a kernel generator to dynamically generate causal convolution kernels based on the input content and then applies these convolution kernels to the V (value) tokens. In addition, it also removes the redundant static convolution on Q (query) and K (key), thus simplifying the calculation process.

Execution of Hardware - Aware Architecture Search

Traditionally, the number of parameters has been used as a proxy for the efficiency of language models. However, the number of parameters is not directly related to hardware efficiency.

Based on the discovery that "the KV cache size is the most critical factor affecting the throughput of long contexts and long generations".

The research team fixed the KV cache size to the original design specifications and conducted a small - scale grid search on the key dimension, value dimension, and the number of attention heads.

This hardware - aware search can use more parameters to achieve higher accuracy while maintaining a similar generation throughput.

Good news is that the research team plans to make the code and model public on GitHub and is currently waiting for legal compliance review.

Significant Efficiency Improvement

Jet - Nemotron - 2B and Jet - Nemotron - 4B are built based on the Qwen2.5 - 1.5B and Qwen2.5 - 3B models respectively.

To comprehensively evaluate the model performance, the research team conducted tests in mathematics, common sense, retrieval, coding, and long contexts.

In mathematics tasks, Jet - Nemotron - 2B achieved an average accuracy of 49.6, which is 6.3 higher than that of Qwen3 - 1.7B - Base, and 47 times faster.

In contrast, previous linear attention and hybrid models lagged far behind Qwen3 - 1.7B - Base in mathematics tasks.

In common - sense reasoning tasks, Jet - Nemotron - 2B achieved an average accuracy of 62.0, surpassing all baseline models.

In retrieval tasks, Jet - Nemotron - 2B outperformed all baseline models except Qwen3 - 1.7B - Base.

When scaled up to 4B, Jet - Nemotron - 4B achieved an optimal average accuracy of 76.2 and still maintained a 21 - fold speed increase compared to Qwen3.

In coding tasks, Jet - Nemotron - 2B had a higher average accuracy than all baseline models.

Meanwhile, Jet - Nemotron - 4B achieved higher accuracy in all coding tasks.

In long - context tasks, it can be seen that although Jet - Nemotron - 2B has only two full - attention layers, its performance is comparable to that of leading models such as Qwen2.5 - 1.5B and Gemma3n - E2B, which have more full - attention layers.

Overall, the performance of Jet - Nemotron - 2B and Jet - Nemotron - 4B in these areas is comparable to or even better than that of Qwen3 - 1.7B - Base.

Due to the significant reduction of full - attention layers and a smaller KV cache scale, Jet - Nemotron has obvious advantages over Qwen3.

Team Introduction

Notably, all members of this research team are of Chinese origin.

Yuxian Gu graduated from the Department of Computer Science and Technology at Tsinghua University for both his undergraduate and doctoral degrees, and his supervisor is Professor Huang Minlie.

Previously, he also interned at Microsoft Research Asia, and his supervisor was Researcher Dong Li.

His research interests mainly focus on the entire lifecycle of language models, including pre - training, adaptation to downstream tasks, and efficient methods in the inference stage.

Recently, his research focuses on the theory and algorithms of data construction for pre - trained large language models (such as PDS, instruction pre - training, Learning Law), and language model compression using knowledge distillation (such as MiniLLM, MiniPLM).

Hu Qinghao graduated from Zhejiang University for his undergraduate degree and from the National University of Singapore for his master's degree. He is currently a post - doctoral researcher under Professor Han Song at the Massachusetts Institute of Technology.

Shang Yang is currently a third - year doctoral student at the Massachusetts Institute of Technology, and his supervisor is Professor Han Song. Before that, he graduated from the Department of Electronic Engineering at Tsinghua University for his undergraduate degree.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。