The second generation of InfLLM is open-sourced. It is three times faster in the same size, has zero parameters, and supports trainable sparse attention.
InfLLM-V2 is a sparse attention model that can efficiently process long texts. It can be trained with only a small amount of long-text data, and its performance is close to that of traditional dense models. By dynamically switching between short and long text processing modes, it significantly improves the efficiency and quality of long-context tasks. It enables a low-cost "seamless switch" from short to long texts, accelerates both the prefill and decoding stages, and unleashes the true productivity of long contexts.
Efficient processing of long sequences has become the key to large model applications.
The computational cost of traditional dense attention increases exponentially as the sequence length grows, which directly limits the product availability and cost controllability.
To address this pain point, Tsinghua University, OpenBMB, and Harbin Institute of Technology proposed InfLLM-V2: a native sparse attention framework with zero additional parameters and high training efficiency.
InfLLM maintains high efficiency in short-text scenarios and switches to the sparse mode in long-text scenarios, bringing significant end-to-end acceleration.
This method can complete the training of sparse attention with only 5B long-text tokens (while DeepSeek-V3.2-Exp trained nearly 1T tokens of data to complete the training of sparse attention).
Specifically, compared with the dense attention mechanism, InfLLM-V2 can achieve a 4-fold speed increase, maintain 98.1% of the performance of the dense model in long-text understanding tasks, and maintain 99.7% of the performance of the dense model in deep-thinking tasks.
InfLLM has three core advantages
1. Low-cost training: Only 5B long-text data is required to complete the training of sparse attention ability, with low training cost and short adaptation cycle.
2. Seamless switch from short to long and dual efficiency optimization: With zero additional parameters, it uses dense attention for short sequences and switches to sparse attention for long sequences, fully aligning with the mainstream paradigm of "short-sequence pre-training - long-sequence post-training". The training is stable and converges quickly.
3. Efficient operator implementation: It systematically optimizes the time bottleneck of "relevant context selection" (block selection) in sparse attention, proposes an efficient hardware-oriented implementation, significantly reduces HBM I/O and computational costs, and unleashes the full potential of sparse attention.
Paper link: https://www.arxiv.org/pdf/2509.24663
Model link: https://huggingface.co/openbmb/MiniCPM4.1-8B
How does InfLLM-V2 achieve both "strength" and "speed"
In the self-attention of the standard Transformer, each query token (Q[t]) needs to calculate the similarity with all historical tokens (K[:t]) and participate in the attention calculation.
In long contexts (often hundreds of thousands of tokens), it causes unbearable latency and cost. Empirically, most of the long-distance attention calculations in long sequences are not equally important, and the attention matrix shows significant "sparsity" (most attention scores are close to zero).
If we can only calculate on a "small amount of relevant context", we can significantly reduce the attention calculation cost of the model.
Sparse attention replaces the dense paradigm of "each query token interacts with all key-value pairs" with the sparse paradigm of "each query token only interacts with a selected subset".
The core includes two steps:
Block selection: Split the context into key-value blocks and determine the key-value subset that needs to participate in the attention calculation for each query.
Sparse attention calculation: Perform attention calculation only on the selected subset.
Trainable sparse attention introduces the sparse mechanism during the model training process, which can systematically improve the efficiency and quality of the model in long-text scenarios.
However, the existing representative method is mainly the NSA architecture proposed by DeepSeek.
Although NSA uses a mature block sparse structure and is equipped with a dedicated CUDA kernel, its architecture is significantly mismatched with the mainstream paradigm of "short-sequence pre-training - long-sequence fine-tuning": It introduces three sets of independent KV caches and three attention branches, which will make the model converge unstably during "long-sequence fine-tuning" and add a large amount of additional overhead to short-sequence scenarios.
In response to the above pain points, InfLLM-V2 proposes a trainable sparse path of "zero additional parameters and seamless short-long switch", which completes the smooth switch from dense to sparse without changing the original attention parameters.
Seamless short-long switch: Use only one set of shared key-value caches (zero additional parameters) to combine the multiple branches of NSA into a single branch; It is completely aligned with dense attention in terms of parameters and calculation methods, and dynamically switches between dense/sparse according to the sequence length, making the training more stable.
Dual efficiency optimization for short and long sequences: Use the dense attention mechanism directly for short texts, with zero additional overhead and performance regression; Use a unified sparse paradigm for long texts to speed up the entire prefill and decode process.
Hardware-friendly block selection: Modify the MLP-based block compression operation to a parameter-free pooling operation; Modify the compressed attention (Compressed Attetntion in the figure) to only generate selection scores and calculate Top-K; Combine with GQA to share Top-K within the group, achieving better computational kernel fusion and preventing block selection from becoming the efficiency bottleneck instead of sparse attention.
With the support of the above technologies, InfLLM-V2 can achieve the training of the sparse attention model with only 5B tokens!
Comparison with DeepSeek Sparse Attention
It is worth noting that on September 29th, DeepSeek-V3.2-Exp proposed an upgraded version of NSA - DeepSeek Sparse Attention (DSA).
DSA abandons the design of three sets of independent KV caches and three attention branches in NSA and introduces the sparse attention algorithm in the post-training stage.
Experimental conclusions
The researchers compared the effects of different sparse attention algorithms on long-text understanding and deep-thinking tasks based on the base model of MiniCPM4.
Long-text understanding tasks
In the evaluations of long-text understanding tasks such as RULER, LongBench, and LongPPL, InfLLM-V2 achieved performance comparable to that of the dense attention model, demonstrating the superiority of InfLLM-V2. Other sparse attention methods will cause a certain degree of performance degradation of the model.
The NSA method adds a large number of parameters, and after a small amount of long-text training, the model cannot capture the semantic association between the front and back in the long context.
Deep-thinking tasks
In deep-thinking tasks such as mathematics and code, InfLLM-V2 can achieve performance comparable to that of dense attention, while the NSA method has a greater impact on the model effect.
As more and more tasks require the model to perform more in-depth reasoning and analysis, "how to efficiently accelerate the thinking process of the model" has become an important research direction at present. InfLLM-V2 fully demonstrates the potential of sparse attention in deep-thinking scenarios.
Efficiency evaluation
The researchers evaluated the inference efficiency of InfLLM-V2 on two chips, A100 and 4090.
The results show that InfLLM-V2 can obtain significant acceleration compared with dense attention. In 128K long texts, InfLLM-V2 can achieve a 4-9 times operator acceleration ratio.
Decomposition analysis and ablation experiments show that the efficient block selection design is the key source of acceleration.
In the end-to-end evaluation, InfLLM-V2 achieved approximately 2.1× and 2.3× acceleration in prefill and decode respectively.
Operator speed evaluation
End-to-end speed evaluation
The first open-source native sparse attention model MiniCPM4/MiniCPM4.1
In June this year, OpenBMB and Tsinghua University proposed the InfLLM-V2 architecture and jointly released the first open-source native sparse attention model MiniCPM4 based on this architecture. In early September, the hybrid-thinking version MiniCPM4.1 was open-sourced.
MiniCPM4.1 ranked first among models of the same size in terms of the comprehensive average score in many deep-thinking tasks.
MiniCPM4.1 fully utilizes efficient algorithms such as sparse attention and speculative sampling. In tests of code and mathematical reasoning such as LiveCodeBench and AIME, its inference speed is more than 3 times faster than that of open-source models of the same size such as Qwen3-8B.
The researchers said that they will continue to optimize the training and inference operators of InfLLM-V2 and integrate InfLLM-V2 into mainstream inference frameworks such as SGLang.
At the same time, in order to promote the research on the sparse attention mechanism, they will also open-source the base model (Base model) and long-text training data used in the paper one after another.
Reference materials:
https://www.arxiv.org/pdf/2509.24663
This article is from the WeChat public account "New Intelligence Yuan". Author: LRST. Republished by 36Kr with permission.