HomeArticle

Ethnic Chinese students have made great achievements. The new King Mamba-3 directly hits the Achilles' heel of Transformer, with its reasoning efficiency seven times higher.

新智元2026-03-19 10:33
Is Transformer at risk? Today, the original team from CMU and Princeton is back, and the new-generation open-source architecture Mamba-3 has made a stunning debut. With 1.5 billion parameters, its combat power is off the charts, and its performance has soared by 4% compared to Transformer.

The "killer" architecture of Transformer has received a major upgrade!

Just today, the original team behind the Mamba architecture officially released the latest generation of open - source architecture, Mamba - 3.

Paper link: https://arxiv.org/pdf/2603.15569

Compared with Mamba - 2, Mamba - 3 made three major changes to the core SSM:

  • Improved the discretization process to enable it to simulate convolution;
  • Introduced state transitions into the complex domain to optimize state tracking;
  • Adopted the MIMO architecture to improve inference utilization, enhancing model performance while maintaining decoding speed.

The results show that with only half the internal state size, Mamba - 3 has comparable strength to Mamba - 2.

With a parameter scale of 1.5 billion, the average accuracy of the Mamba - 3 MIMO version reaches 57.6%, 4% higher than that of Transformer.

In long - sequence tasks, the end - to - end latency of Mamba - 3 is only one - seventh of that of Transformer.

Targeting the Achilles' heel of Transformer

Mamba - 3 turns the tables

In 2017, the Transformer architecture emerged and became the cornerstone of today's LLMs.

However, it is a veritable "computing power black hole". As the dialogue length increases, the computational requirements grow quadratically, and the memory usage soars linearly, resulting in extremely high costs for large - scale inference.

To break this deadlock, the first Mamba architecture came into being in 2023.

In mid - 2024, Mamba - 2 was released, further establishing the mathematical equivalence between SSM and the attention mechanism, and increasing the training speed by 2 - 8 times.

Now, Mamba - 3, jointly guided by Albert Gu and Tri Dao and mainly developed by four student researchers, makes its debut with a brand - new design philosophy.

Mamba - 3 represents a paradigm shift: from pursuing training efficiency to a "inference - first" design.

As Albert Gu said, the focus of Mamba - 2 was to break the bottleneck of pre - training, while Mamba - 3 is to solve the "cold GPU" problem -

That is, during the decoding process, modern hardware often waits idly for data transfer (memory movement) instead of actually performing calculations.

The secret of high efficiency: the summarization machine

As a state - space model (SSM), Mamba - 3 is like an efficient "summarization machine".

Its core logic is fundamentally different from that of Transformer.

Every time Transformer generates a word, it has to review all historical tokens to understand the context. The longer the history, the heavier the burden.

On the other hand, Mamba - 3 compresses historical information into a fixed - size "internal state" in real - time, which you can think of as a "snapshot" of the data history.

Whenever new information comes in, the architecture only needs to update the snapshot without re - reading the full text. This is the fundamental reason why SSM can achieve fixed memory and linear computation.

For SSM, the size of this "snapshot" (i.e., the state size) is the core knob that determines performance:

The larger the state, the more information it can compress, and the smarter the model. However, the cost of moving data during inference also increases, and the speed slows down.

Conversely, if the state is reduced by half, the speed can double, but the model may become less intelligent.

This is where Mamba - 3 makes a breakthrough. It uses only half the state size of Mamba - 2 and achieves comparable language modeling performance to Mamba - 2.

With the same level of intelligence and double the speed - it's like pushing the performance - efficiency curve of SSM down a notch.

Inference - first, with three core strategies

How does Mamba - 3 achieve this? Behind it is a brand - new design philosophy: re - thinking the relationship between the "intelligence" of AI and the speed of the hardware running it.

If Mamba - 2 was designed to set records in training speed, then Mamba - 3 is an "inference - first" architecture.

So - called inference is the process when users use AI on ChatGPT, Gemini, or through APIs.

The core goal of Mamba - 3 is to make the most of every second of GPU activity, ensuring that the model can perform the most intensive "thinking" without making users wait.

To achieve this goal, Mamba - 3 deploys three strategies -

  • Mathematically, a more precise discretization formula makes the model's "memory" more accurate;
  • In terms of capabilities, introducing complex - valued states is like installing an "internal compass" for the model, making up for the shortcoming in logical reasoning;
  • On the hardware side, the MIMO mechanism prevents the chip from being idle, squeezing out all the idle computing power, allowing the model to perform more "in - depth thinking" when generating each word, while the user's waiting time remains unchanged.

Next, let's break them down one by one.

Three core technologies

Exponential trapezoidal discretization: a precision leap from first - order to second - order

The discretization methods used in Mamba - 1 and Mamba - 2 are essentially first - order approximations, similar to estimating the area under a curve using the height of one endpoint.

Mamba - 3 is upgraded to the "exponential trapezoidal rule", which takes the weighted average of two endpoints into account, and the precision jumps from first - order to second - order.

Although this seems like just a minor mathematical adjustment, the effect is unexpected.

It implicitly introduces a data - dependent convolution with a width of 2 in the state input of SSM, making the short - causal convolution module, which was essential in Mamba - 2, an optional component.

Ablation experiments show that the combination of exponential trapezoidal discretization and the B, C bias terms can completely replace the external short - convolution that almost all linear models used to rely on - this is an important step in architecture simplification.

Complex - valued SSM: installing an "internal compass" for the model

For a long time, alternatives to Transformer have had a "logical shortcoming" - they often fail in simple state - tracking tasks (such as determining the parity of a binary sequence).

The root cause is that Mamba - 2 restricts the state - transition matrix to real scalars, which cannot express "rotational" dynamics.

As an intuitive example, parity checking is essentially a flipping operation - every time a 1 is read, the state flips. This flipping corresponds to rotation in mathematics, and the real number domain does not support rotation naturally.

Mamba - 3 solves this problem by introducing a complex - valued state space.

The results show that the discretized complex SSM is equivalent to applying a data - dependent rotational position embedding (RoPE) on the B, C projections.

This means that the efficient "RoPE technique" can be used to implement complex number operations, and the computational cost is almost negligible.

Data shows that in the parity - checking task, Mamba - 3 achieves 100% accuracy, while Mamba - 2 only has 0.9%, which is no different from random guessing.

In the modular arithmetic task, Mamba - 3 also reaches 98.51%, while Mamba - 2 only has 47.81%. The reasoning ability of linear models can finally match that of the most advanced systems.

MIMO: squeezing out every bit of idle computing power

Most current AI models are limited by "memory bandwidth".

A set of data can illustrate the problem: the arithmetic intensity of the standard SISO decoding of Mamba is only about 2.5 ops/byte, while the bf16 tensor core capacity of NVIDIA H100 is 295 ops/byte.

In other words, more than 99% of the GPU's computing power is idle during decoding.

Mamba - 3 introduces the multiple - input multiple - output (MIMO) formula, changing the state update from an outer - product operation to a matrix multiplication.

When the MIMO rank is 4, the computational volume per step increases to 4 times the original. However, since these calculations just fill the idle tensor cores, the decoding latency hardly increases.

The kernel latency test verifies this. Under the common configuration of bf16 and a state dimension of 128, the SISO decoding latency of Mamba - 3 is only 0.156 milliseconds, faster than that of Mamba - 2 (0.203 milliseconds); the MIMO version is 0.179 milliseconds, still faster than Mamba - 2.

In a nutshell, the philosophy of MIMO is not to make the GPU run faster, but to keep it busy.

Comprehensive superiority: from 180M to 1.5B

The research team conducted a systematic comparison on four parameter scales (180M, 440M, 880M, 1.5B), with competitors including three baselines: Transformer, Mamba - 2, and Gated DeltaNet (GDN).

All models used the same training process, 100B FineWeb - Edu data, and the Llama - 3.1 tokenizer.

At the 1.5B scale, Mamba - 3 MIMO ranked first with an average accuracy of 57.6%, leading Transformer by 4%, Mamba - 2 by 3.4%, and GDN by 3.2%.

Even the standard Mamba - 3 SISO version without MIMO outperformed all non - Mamba - 3 baselines with 56.4%.

In the end - to - end inference latency, in the prefill + decode scenario of 16384 tokens, Mamba - 3 SISO took 140.61 seconds, while vLLM running Llama - 3.2 - 1B took 976.50 seconds, almost 7 times faster.

As the sequence length increases, the advantages of linear models will only become more prominent.

What's more noteworthy is the extrapolation ability of context length. All models were only trained on a length of 2K and then directly tested on longer sequences.

The results show that the language modeling performance of Mamba - 3 steadily improves up to 32K, while Mamba - 2 deteriorates rapidly after exceeding the training length.

This indicates that Mamba - 3 is not only stronger within the training distribution but also more robust