Transformer in Gefahr: Google stellt das MoR-Architektur vor: Halbiert den Speicherbedarf und verdoppelt die Inferenzgeschwindigkeit.
Surpassing Transformer, Google Introduces a Brand - New Underlying Architecture —
Mixture - of - Recursions (MoR), note that it's not MoE. It can double the inference speed and directly halve the KV memory!
Moreover, it's an all - in - one solution, for the first time implemented in a single framework. It uses the same set of parameters to handle different tasks while dynamically allocating computing resources.
It's like giving LLM a double - layer enhanced buff, aiming for both model performance and efficiency.
Google DeepMind, in collaboration with teams from KAIST AI and Mila, through unified parameter sharing, adaptive recursion depth, and efficient KV caching, reduces computing and memory costs while maintaining the performance of large models, forming a new optimal solution for efficiency.
Many netizens even describe it as a Transformer Killer.
Some even claim that the emergence of this architecture may indicate that latent space reasoning could be the next breakthrough for LLMs.
Specifically, in which aspects is MoR innovative? The following details each one.
MoR: First Unifying Parameter Sharing and Adaptive Computing
Although the emergence of Transformer brought excellent few - shot generalization and inference capabilities, the subsequent huge computing and memory requirements still made training and deployment difficult.
Currently, the relevant optimization methods are mainly parameter sharing and adaptive computing, but often one has to choose between them and cannot have both.
So, researchers proposed the recursive mixture model MoR, which can fuse two efficiency dimensions simultaneously in a single recursive Transformer.
First, the recursive Transformer used. Compared with the standard Transformer that constructs tokens through multiple unique layers, it directly divides the model into recursive blocks and reuses a set of shared parameter pools.
It mainly includes three parameter sharing strategies:
- Cycle: Reuse layers cyclically.
- Sequence: Continuously reuse the same layer.
- Middle variant: Keep the unique parameters of the first and last layers and only share the middle layers.
By using parameter sharing, the number of unique parameters can be reduced, the efficiency of distributed training can be improved, and the computational "bubbles" can be eliminated through continuous depth batching, thereby increasing the inference throughput.
Then, MoR adopts a dynamic routing mechanism. Through a lightweight router, it assigns different recursion depths to each token and concentrates the computation on complex tokens. It can be divided into two types:
- Expert - choice routing: Treat each recursion step as an "expert". Calculate scores based on the hidden state, select appropriate tokens through a threshold to continue the calculation, and use hierarchical filtering to preferentially allocate computation to complex tokens.
- Token - choice routing: In the initial stage, assign a fixed recursion depth to each token. Determine the expert through softmax/sigmoid, and then the tokens complete the recursion in sequence according to the assigned depth.
In addition, MoR itself also uses a KV caching strategy to manage the storage and use of key - values while ensuring an increase in memory efficiency:
- Recursion - wise caching: Only cache the KV pairs of active tokens in the current recursion step, limit the attention calculation to the local cache, and reduce memory and IO requirements.
- Recursive KV sharing: Reuse the KV pairs of the first recursion for subsequent steps, ensure that all tokens can access the historical context, reduce pre - filling operations, and at this time, the decrease in the attention calculation amount is relatively small.
Under the combined action of the three strategies, MoR directly conducts latent thinking when decoding each token. The routing mechanism allows the model to perform adaptive inference, breaking through the previous limitation of fixed thinking depth and achieving the unity of parameter efficiency and adaptive computing.
Outperforming Transformer in Performance
Researchers conducted comparative experiments on models with different parameter scales from 135M to 1.7B, including the original Transformer, recursive baseline model, and MoR.
The experiments show that with the same training budget of 16.5e18 FLOPs, MoR uses nearly 50% fewer parameters but achieves a lower validation loss and a higher average few - shot accuracy of 43.1%.
The few - shot accuracy of the vanilla model at this time is 42.3%, indicating that MoR has higher computational efficiency and can process more training tokens under the same FLOPs budget.
When training a fixed 20B tokens, MoR also reduces the training FLOPs by 25%, shortens the training time by 19%, and reduces the peak memory by 25%.
In addition, through the analysis of routing strategies, it is found that the performance of Expert - choice routing is better than Token - choice routing to a certain extent, indicating that the routing granularity has an important impact on performance.
Researchers also conducted an IsoFLOP analysis on MoR and found that under the parameter scales of 135M, 360M, 730M, and 1.7B, and the FLOPs budgets of 2e18, 5e18, and 16.5e18, MoR always outperforms the recursive baseline model.
Although limited by the recursive capacity bottleneck, MoR is slightly inferior to the vanilla model at 135M. However, as the scale expands to 360M and further, MoR's performance approaches or even exceeds that of the Vanilla model, and its parameters are only 1/3 of the latter, verifying the scalability of MoR.
In the inference throughput evaluation, the 360M - scale MoR model outperforms the vanilla model under both fixed batch size and maximum batch size settings.
The increase in recursion depth allows more tokens to exit early, reduces the KV cache occupancy, and significantly improves the throughput, verifying the improvement of deployment efficiency by combining depth batching and early exit.
Google's Rethinking of the Underlying Architecture
This is not the first time Google has rethought the underlying architecture. In fact, Google always hopes to reconstruct the computing paradigm through architectural innovation and find a new balance for AI.
For example, the Mixture of Experts (MoE) model is a concentrated manifestation of this concept.
As early as 2017, Google first introduced MoE into the LSTM layer. Through a sparse gating mechanism, only part of the expert networks are activated to process the input, but it can still keep a model with up to 137B parameters in efficient training.
The later - introduced GShard combines MoE with Transformer to achieve dynamic load balancing. The Switch Transformer in 2021 further simplifies the routing mechanism.
The Gemini 1.5 Pro uses a hierarchical MoE architecture, which deeply combines expert networks with multi - modal processing, can handle more complex multi - modal tasks, and significantly improves the training and service efficiency.
The underlying logical design of MoE breaks through the computational defects of traditional fully - connected models and has now become the preferred choice for many ultra - large - scale models, providing a new paradigm for dealing with the computing power bottleneck.
In addition, there are also scalable architectures like TokenFormer, which regard model parameters as learnable tokens and seamlessly expand the model scale through incremental training, providing the possibility for low - cost iteration of future trillion - level models.
So, some netizens believe that regarding the MoR introduced by Google now, will it completely change the rules of the AI world in the future? Will it surpass Transformer? Time will tell.
Reference Links
[1]https://x.com/deedydas/status/1945313404958466519
[2]https://www.alphaxiv.org/abs/2507.10524
[3]https://x.com/reza_byt/status/1945498424536862841
[4]https://arxiv.org/abs/1701.06538
This article is from the WeChat official account “Quantum Bit”, author: Lu Yu. It is published by 36Kr with authorization.