Transformer Terminator! Google’s new MoR architecture is out, and a new generation of demon kings is coming

Is the Transformer killer coming? The MoR architecture just released by KAIST, Google DeepMind and other institutions doubles the inference speed and halves the memory, directly reshaping the performance boundaries of LLMs and comprehensively crushing traditional Transformers. Netizens shouted explosively: Another game-changing bomb is coming.

Just now, KAIST, Mila and Google’s DeepMind team released a bombshell –

A new LLM model architecture called Mixture-of-Recursions.

This new architecture is considered by the industry to have the potential to become a Transformer killer!

Its inference speed is increased by 2 times, training FLOP is reduced, and KV cache memory is directly halved.

In the end, at a parameter scale of 135M to 1.7B, MoR directly delineates a new Pareto frontier: the same training FLOPs, but with less confusion, higher accuracy for small samples, and more than 2x higher throughput.

Crush the traditional Transformer across the board!

Paper link: https://arxiv.org/abs/2507.10524

In fact, the academic community discovered early on that the complexity of Transformer is too high and the computing power demand is amazing.

For example, Albert Gu, a CMU leader and author of the Mamba architecture, recently said that the capabilities of the Transformer model are too limited, and the so-called token is nonsense.

Google’s product leader, Logan Kilpatrick, publicly pointed out the flaws of the attention mechanism – it is impossible to achieve infinite context, and also emphasized the need for comprehensive innovation at the core architecture layer.

To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.

View details >

Today’s Google DeepMind study coincides with the views of these big bulls.

In this regard, netizens have said that it is really explosive.

Some predict that latent spatial reasoning could lead to the next major breakthrough.

Obviously, MoR is a game-changing bombshell for tasks such as code, mathematics, and logic.

Others even commented: It looks like Hinton’s capsule network has been reborn.

1. Google DeepMind enlarges the move, and the recursive magic makes LLMs slim down and speed up

LLMs have developed to this point, what should we do next? Do you rely on heap parameters and add layers to make it smarter?

This study tells us that the real master never relies on stacking, but on the art of design.

This time, the new MoR architecture they created literally translates to “recursive hybrid”, which directly doubles the inference speed of LLMs!

So, what exactly did the MoR do?

In short, it does the following two things.

1. Do not treat all tokens equally

When LLMs process text, they break down sentences into tokens, but words like “of”, “is”, and “in” do not require much advanced reasoning, only one forward propagation is sufficient. Complex tokens need to go through the same stack multiple times.

The clever thing about MoR is that it varies from token to token.

The secret weapon of the MoR is a small router that scores the hidden state of each token, and only the high-scoring tokens continue to loop, while the rest exit early.

2. Recycling and multiplexing: One module does it all

The idea of traditional Transformer is to constantly “stack layers”, and the higher the stack, the stronger the processing power. But the price is memory and computing power: the model will become slower and more expensive.

MoR does the opposite, specially designed shared blocks, each token can loop up to 4 times, as long as the router says “done”, it will jump out of the loop in advance.

In short, if Transformer is a huge factory assembly line, then MoR is more like an efficient special force. In the future, AI will probably no longer compete with who is heavier, but who will be more able to divide labor and schedule and save effort.

Google DeepMind, on the other hand, has taken this to the fore and demonstrated an early model of this trend.

2. True adaptive computing

Relying only on scaling law to make a language model bigger can indeed skyrocket its capabilities, but the computing power and cost required for training and deployment have also skyrocketed.

The common “slimming” trick now is either to share parameters (save video memory) or to calculate on demand (save computing power).

However, there is still a lack of an architecture that can organically integrate the two.

“Mixture-of-Recursions” (MoR) gives full play to the potential of recursive Transformer (see Figure 1), successfully integrating the two.

Figure 1: Overview of Mixture-of-Recursions (MoR).

(Left) Each recursive step contains a fixed layer stack and a router that decides whether the token continues to recursion (gray box area in the middle).

(middle) Full model structure where the shared recursive step applies Nr to each token up to Nr times based on the routing decision.

(Right) An example of a routing pattern showing the token-level recursive depth, where the darker the color indicates that the token is more active in the recursive block. The bottom number indicates the number of recursive steps for each text token in different colors: 1, 2, and 3.

In a unified architecture, MoR implements three efficiency optimizations at once:

  1. Compress the amount of parameters by sharing weights;
  2. Reduce redundant calculations with dynamic routing;
  3. Reduce memory overhead with intelligent caching.

3. Recursive hybrid architecture

During pre-training and inference, MoR dynamically adjusts the recursive step for each token, relying on two major components:

routing mechanism and KV caching policy.

1. Routing mechanism: expert selection vs. .token selection

Inspired by the top-k gating mechanism, the researchers proposed Expert-choice routing (see Figure 2a).

In this model, recursive depth can be seen as “experts”, and these experts will pick the top-k tokens they think are most worth dealing with during each round of recursion.

To make recursion more consistent, the team also introduced hierarchical filtering: only tokens selected in layer R are eligible to participate in layer R+1 evaluation.

This design simulates an early exit mechanism, allowing the model to automatically “filter out” tokens that require deep processing in the early stages of training, concentrating computing power on the most difficult tokens.

Unlike the former, token-choice routing (see Figure 2b) determines how many recursive processes each token goes through at the outset.

Specifically, based on the hidden state of Layer 1, the model calculates the score of each expert (e.g. via SoftMax or Sigmoid).

Assuming there are Nr experts, each corresponding to a recursion, then the model assigns tokens to the expert with the highest score. Tokens will be sent to the first I layer recursively, and each layer is processed sequentially.

In this way, the recursion depth is determined when the token enters the network, while avoiding the re-selection of each layer, improving inference efficiency.

Table 2 compares the two methods on the left:

The advantage of expert-choice routing is that it enables ideal computational load balancing. However, it is prone to information leakage.

In contrast, token-choice routes naturally do not leak information. However, this way the load is unevenly distributed.

Table 2: Comparison of routing policies and key-value caching policies. (Left) Summary of two routing strategies: expert selection and token selection; (Right) The relative cost efficiency of the caching strategy compared to ordinary Transformers

Figure 2: Architectural components of hybrid recursion (MoR). (a) Expert selection of routes; (b) the token independently chooses the route; (c) KV caching policy

2. KV caching strategy: caching by recursive layer vs. cross-layer sharing

For the MoR model, two KV caching strategies are proposed: caching by recursive layer and sharing across recursion.

1. Caching by recursive layer (see Figure 2c) is “selective caching”: only tokens routed to a recursive layer will generate and store their KV pairs at that layer.

Attention computation is performed only in the cache of the current recursive layer, which helps to achieve localized computation, significantly improving memory usage efficiency and reducing I/O burden.

2. Cross-recursive sharing (see Figure 2c): Only KV pairs are generated and cached on the first recursive layer, and then reused in all subsequent layers. Under this mechanism, the number of queries participating in attention calculation may be reduced per layer.

In other words, all tokens can fully access the historical context regardless of whether they continue to participate in the computation at subsequent layers without recalculation.

Table 2 compares the two caching strategies on the right:

Caching by recursive layer: KV memory and I/O burden are compressed to about half of the original.

Cross-recursive sharing: It can only linearly compress attention computations, and the high number of reads and writes of KV can become a performance bottleneck.

Table 3: Comparison of MoR, recursive Transformer, and ordinary Transformer under the condition of equal computation and equal token count

4. Experiment

The researchers pre-trained the model from scratch, using the Llama-based Transformer architecture, referring to the configuration of the SmolLM open-source model, and evaluating it on the validation set of FineWeb-Edu and six few-shot benchmark test sets.

1. Key results

Under the same training computing budget, MoR outperforms the baseline model with fewer parameters

Under the same training budget (16.5e18 FLOPs), the researchers compared the MoR model with standard and recursive Transformers.

The validation loss pairs corresponding to different calculation budgets are shown in the figure below the four model scales (135M, 360M, 730M and 1.7B parameters).

As shown in Table 3, the MoR model uses expert-selected routing and quadratic recursion (Nr=2), which not only has lower validation loss, but also outperforms the standard baseline in few-shot average accuracy.

This is due to the higher computational efficiency of MoR, which allows it to handle more training tokens under the same FLOPs budget. With the same amount of data, MoR still outperforms the baseline model with less computation

In order to isolate the influence of architectural differences, the researchers analyzed the number of training tokens (20B) under the premise of fixing them.

The results confirm that the MoR model (Nr=2) still achieves lower validation loss and higher accuracy with 25% less training FLOPs, surpassing the standard and recursive baselines.

Compared to the standard baseline, the MoR model experienced a 19% reduction in training time and a 25% reduction in peak memory usage.

This is possible thanks to a specially designed hierarchical filtering mechanism and a recursive attention mechanism.

In addition, the performance of MoR is affected by routing and caching policies.

2. IsoFLOP analysis

One of the core criteria for evaluating the design of a new model architecture is whether its performance can continue to improve as the model size and computational volume grow.

Therefore, the research team comprehensively compared MoR with standard Transformer (Vanilla) and recursive Transformer.

Experimental setup

There are four model sizes for the experiment: 135M, 360M, 730M, and 1.7B parameters.

For the Recursive Transformer and MoR configurations, the number of recursions is set to 3.

Pre-training is performed under three different compute budgets: 2e18, 5e18, and 16.5e18 FLOPs.

MoR architecture: scalable and parameter-efficient

As shown in Figure 3, MoR consistently outperforms the recursive baseline model at all parameter scales and computing power.

Although MoR performance is slightly inferior to standard Transformer at the smallest scale (135M), this gap narrows rapidly as the model scales.

When the parameter size exceeds 360M, MoR not only can be on par with standard Transformers, but even better at low computational and medium computational budgets.

Overall, these results show that MoR has good scalability and high parameter efficiency, and can replace the old architecture.

3. Inference throughput assessment

Through parameter sharing, MoR can significantly improve throughput during the inference phase by leveraging continuous deep batch processing.

This mechanism fills in new tokens immediately after the old sequence is completed during the decoding process, maintaining high GPU utilization.

Experimental setup

At 360M parameter scale, the team tested the MoR model at different recursive depths (2, 3, and 4). With deep batch processing, MoR significantly improves inference throughput

As shown in Figure 4a, the inference throughput of the MoR variant exceeds that of a normal Transformer in both settings.

The higher the recursion depth, the more tokens will exit early, reducing the use of KV cache and further greatly improving inference speed. For example, at the maximum batch setting (B=Max), the MoR-4 speed can be increased by 2.06x.

Experiments show that the combination of deep batch processing mechanism and early exit strategy can greatly accelerate the actual inference speed of MoR models.

For more information and details such as ablation experiments, please refer to the original article.

Resources:

https://arxiv.org/abs/2507.10524

https://x.com/rohanpaul_ai/status/1945342236310561091

https://www.rohan-paul.com/p/landmark-research-from-google-deepmind

End of text
 0