Transformer Terminator: Google DeepMind's brand - new MoR architecture is launched. The new - generation "demon king" is coming.
[Introduction] Is the Transformer killer here? The MoR architecture just released by institutions such as KAIST and Google DeepMind doubles the inference speed and halves the memory usage, directly reshaping the performance boundary of large language models (LLMs) and comprehensively outperforming the traditional Transformer. Netizens are exclaiming in shock: Another game - changing bombshell has arrived.
Just now, teams from KAIST, Mila, and Google DeepMind dropped a bombshell -
A brand - new LLM model architecture called Mixture - of - Recursions.
This brand - new architecture is considered by the industry to have the potential to become the Transformer killer!
It doubles the inference speed, reduces the training floating - point operations (FLOPs), and directly halves the key - value (KV) cache memory.
Ultimately, with a parameter scale ranging from 135M to 1.7B, MoR has directly drawn a new Pareto frontier: with the same training FLOPs, it has a lower perplexity, higher few - shot accuracy, and more than doubles the throughput.
It comprehensively outperforms the traditional Transformer!
Paper link: https://arxiv.org/abs/2507.10524
Actually, the academic community has long discovered that the Transformer has extremely high complexity and astonishing computing power requirements.
For example, Albert Gu, a top expert from CMU and the author of the Mamba architecture, recently stated that the capabilities of the Transformer model are highly limited, and the so - called tokens are nonsense.
Logan Kilpatrick, the product leader at Google, publicly pointed out the flaws in the attention mechanism - it's impossible to achieve an infinite context, and emphasized the need for comprehensive innovation at the core architecture level.
Today's research from Google DeepMind coincides with the views of these top experts.
In response, netizens are all exclaiming in shock.
Some people predict that latent space reasoning may bring the next major breakthrough.
Obviously, for tasks involving hierarchical decomposition problems such as code, mathematics, and logic, MoR is a game - changing bombshell.
Some people even commented: It seems like Hinton's Capsule Network has been reborn.
Google DeepMind Unleashes a Game - Changer: Recursive Magic Slims Down and Speeds Up LLMs
As LLMs have developed to the present, what should be done next? Should we increase the number of parameters and layers to make them smarter?
This research tells us: True masters never rely on brute - force scaling but on the art of design.
This time, the new MoR architecture they developed, literally translated as "Mixture of Recursions", directly doubles the inference speed of LLMs!
So, what exactly does MoR do?
In short, it does the following two things.
1. Treat Tokens Differently
When an LLM processes text, it splits sentences into individual tokens. However, words like "de" (equivalent of "of" in Chinese), "shi" (equivalent of "is" in Chinese), and "zai" (equivalent of "in" in Chinese) don't require in - depth reasoning and only need one forward pass. Complex tokens, on the other hand, need to pass through the same layer stack multiple times.
The smart part of MoR lies in its ability to treat tokens differently.
MoR's secret weapon is a small router that scores the hidden state of each token. Only tokens with high scores will continue to cycle, while the rest will exit early.
2. Recursive Reuse: One Module to Rule Them All
The traditional Transformer's approach is to continuously "stack layers". The more layers, the stronger the processing ability. However, the cost is memory and computing power: the model becomes slower and more expensive.
MoR does the opposite. It is specifically designed with shared blocks, and each token cycles a maximum of four times. As long as the router says "completed", it jumps out of the loop early.
In short, if the Transformer is a large factory assembly line, then MoR is more like an efficient special forces unit. In the future of AI, the competition may no longer be about who is bigger but who is better at division of labor, scheduling, and saving resources.
Google DeepMind has keenly grasped this point and demonstrated an early example of this trend.
True Adaptive Computing
Relying solely on the scaling law to make language models larger can indeed significantly boost their capabilities, but the computing power and cost required for training and deployment also skyrocket.
Common "slimming" techniques currently include either sharing parameters (to save video memory) or computing on - demand (to save computing power).
However, there is still a lack of an architecture that can organically integrate the two.
"Mixture - of - Recursions (MoR)" fully exploits the potential of the recursive Transformer (see Figure 1) and successfully integrates the two.
Figure 1: Overview of Mixture - of - Recursions (MoR)
(Left) Each recursive step includes a fixed layer stack and a router (the middle gray - framed area) that decides whether a token continues to recurse.
(Middle) The complete model structure, where the shared recursive step is applied to each token a maximum of 𝑁𝑟 times based on routing decisions.
(Right) An example of a routing pattern showing the token - level recursive depth. The darker the color, the more active the computation of the token in the recursive block. The numbers at the bottom mark the recursive steps of each text token in different colors: 1 step, 2 steps, and 3 steps.
In a unified architecture, MoR simultaneously achieves three efficiency optimizations:
Compresses the number of parameters by sharing weights; reduces redundant computation through dynamic routing; reduces memory overhead through intelligent caching.
Recursive Mixture Architecture
During pre - training and inference, MoR dynamically adjusts the recursive steps for each token, relying on two major components:
The routing mechanism and the KV cache strategy.
Routing Mechanism: Expert - Choice Routing vs. Token - Choice Routing
Inspired by the top - k gating mechanism, researchers proposed Expert - choice routing (see Figure 2a).
In this mode, the recursive depth can be regarded as "experts", and in each round of recursion, these experts select the top - k tokens they think are most worthy of processing.
To make the recursion more consistent, the team also introduced a hierarchical filtering mechanism: Only tokens selected in the r - th layer are eligible to participate in the evaluation of the (r + 1) - th layer.
This design mimics the early - exit mechanism, allowing the model to automatically "filter" out tokens that require in - depth processing at the early stage of training and concentrate computing power on the most difficult tokens.
Different from the former, Token - choice routing (see Figure 2b) determines at the beginning how many recursive processes each token will go through.
Specifically, based on the hidden state of the first layer, the model calculates the scores of each expert (e.g., through softmax or sigmoid).
Assuming there are 𝑁𝑟 experts, each corresponding to one recursion, the model assigns the token to the expert with the highest score. The token will be sent to the first i layers of recursion, and each layer is processed in sequence.
In this way, the recursive depth of a token is determined when it enters the network, avoiding re - selection at each layer and improving inference efficiency.
Table 2 on the left compares the two methods:
The advantage of expert - choice routing is that it can achieve ideal computational load balancing. However, it is prone to information leakage.
In contrast, token - choice routing does not leak information naturally. But this method has an uneven load distribution.
Table 2: Comparison of routing strategies and key - value cache strategies. (Left) Summary of two routing strategies: expert - choice and token - choice; (Right) Relative cost - efficiency of cache strategies compared to the ordinary Transformer
Figure 2: Architectural components of Mixture of Recursions (MoR). (a) Expert - choice routing; (b) Token - choice routing; (c) KV cache strategy
KV Cache Strategy: Cache by Recursive Layer vs. Cross - Recursive Sharing
For the MoR model, researchers proposed two KV cache strategies: caching by recursive layer and cross - recursive sharing.
1. Caching by recursive layer (the upper part of Figure 2c) is "selective caching": Only tokens routed to a certain recursive layer generate and store their KV pairs in that layer.
The attention calculation is only performed within the cache of the current recursive layer. This design facilitates localized computation, significantly improving memory usage efficiency and reducing the I/O burden.
2. Cross - recursive sharing (Figure 2c): KV pairs are only generated and cached in the first recursive layer and then reused in all subsequent layers. In this mechanism, the number of queries participating in the attention calculation in each layer may decrease.
That is to say, all tokens can fully access the historical context without recalculation, regardless of whether they continue to participate in the calculation in subsequent layers.
Table 2 on the right compares the two cache strategies:
- Caching by recursive layer: The KV memory and I/O burden are compressed to about half of the original.
- Cross - recursive sharing: It can only linearly compress the amount of attention calculation, and the number of KV read - write operations is relatively high, which may become a performance bottleneck.
Table 3: Comparison of MoR, recursive Transformer, and ordinary Transformer under the conditions of equal computational volume and equal number of tokens
Experiments
The researchers pre - trained the model from scratch, using a Transformer architecture based on Llama and referring to the configuration of the open - source SmolLM model. They evaluated it on the validation set of FineWeb - Edu and six few - shot benchmark test sets.
Main Results
With the Same Training Computational Budget, MoR Outperforms the Baseline Model with Fewer Parameters
With the same training budget (