Wang Guan, a Post - 2000 Alumnus of Tsinghua University, Unveils New Work: Revolutionizing Transformer Pre - training Model with 1/900 Tokens and 1/432 Computing Power

Significantly lower the pre-training threshold

Breaking the traditional pre - training paradigm of large models, the team led by Wang Guan, an alumnus born in the 2000s from Tsinghua University, has released a new work:

They replaced the standard Transformer with the Hierarchical Recurrent Model (HRM) and proposed the high - efficiency pre - trained HRM - Text that surpasses the Scaling principle.

Paper link: https://arxiv.org/abs/2605.20613

Even when using approximately 100 - 900 times fewer training tokens and 96 - 432 times less estimated computational resources than the standard baseline model, HRM - Text still achieves performance comparable to that of open - source models with 2B to 7B parameters.

Meanwhile, with 1B parameters, 40B non - repeating tokens, and a training cost of approximately $1500, HRM - Text has achieved the following results in mainstream benchmark tests: MMLU 60.7%, ARC - C 81.9%, DROP 82.2%, GSM8K 84.5%, MATH 56.2%.

Figure | Pre - training efficiency.

Based on this, they clearly stated: Structural priors and targeted training objectives can significantly lower the pre - training threshold. This training scheme makes it feasible to train a foundation model from scratch.

How is HRM - Text designed?

Pre - training of large language models (LLMs) increasingly depends on a small number of institutions with sufficient computing power and data resources. Training a competitive foundation model often requires trillions of tokens, thousands of GPUs, and even tens of millions of dollars in computing power investment.

However, the current training mode is not efficient. A large amount of computation is consumed on irrelevant tokens such as prompts, format filling, and webpage noise, resulting in a large amount of training computing power not directly serving inference.

In this work, the research team redesigned the architecture and training objectives, making the pre - training of HRM - Text relatively more efficient.

Architecture: A hierarchical recurrent model with a dual - time scale is adopted, splitting the computation into a slow H module and a fast L module. The standard Transformer performs only one forward propagation for each token, while HRM conducts multiple rounds of recursive updates on the same token. Each of the H and L modules accounts for half of the parameters of the recursive core. The overall computational complexity is roughly equivalent to four recursive expansions of the same set of parameters, increasing the computational depth without increasing the number of parameters.

Training objective: Instead of using the standard full - text autoregressive pre - training, it directly trains on instruction - answer pairs, calculating the loss only for the answer part, and cooperates with the PrefixLM mask to enable bidirectional attention for the instruction part and generate the answer part according to the causal mask.

Figure | HRM - Text architecture.

To improve the stability of recursive training, the research team introduced MagicNorm and Warmup Deep Credit Assignment.

MagicNorm is a hybrid normalization strategy that utilizes the asymmetry between forward and backward computational depths under Truncated Backpropagation Through Time (Truncated BPTT). It uses PreNorm inside the module and adds an additional normalization at the module exit, thereby improving the stability of deep recursive training.

Warmup Deep Credit Assignment only propagates gradients back for the last two recursive steps in the early stage of training and then linearly extends it to the last five steps. This training mechanism allows the model to converge stably on a shorter credit path and gradually introduce longer dependencies.

What's the effect?

The experimental results show that HRM - Text has obvious advantages in architecture efficiency, training objectives, and overall performance.

1. Is the recurrent architecture more effective under fixed training computing power?

The results show that under the condition of FLOPs alignment, HRM 1B outperforms Transformer 1B, Transformer 3B, Looped Transformer 1B, and RINS 1B on most benchmarks. The comparison with TRM also shows that HRM training is more stable.

Figure | Comparison of performance and stability with Transformer models. HRM maintains stable training dynamics at all scales, while Transformer models show severe instability at the 1 - billion - parameter scale. In addition, at the 0.6B scale, HRM only needs half the computational resources of Transformer models to achieve competitive performance on most benchmarks.

2. Do the task completion objective and PrefixLM help?

Ablation experiments show that under the condition of FLOPs alignment, the MMLU score of 1B Transformer increases from 40.55 in standard autoregressive training to 47.72 after introducing the task completion objective, to 53.15 after adding PrefixLM, and to 60.73 after switching to the HRM architecture.

Figure | Performance comparison between different model architectures and training objectives

3. How efficient is HRM - Text compared with contemporary open - source models?

HRM - Text 1B achieves 60.7, 81.9, 82.2, 84.5, and 56.2 on MMLU, ARC - C, DROP, GSM8K, and MATH respectively. Compared with open - source models with generally larger training budgets, it enters the performance range of 2B to 7B open - source models with only 40 billion unique tokens and 1B parameters. It uses up to 900 times fewer tokens and up to 432 times less computing power.

Figure | Evaluation results of HRM - Text 1B compared with contemporary fully open - source models and open - weight models

4. Does the recurrent structure bring greater effective depth?

The results show that the standard Transformer and Looped Transformer tend to stabilize at relatively shallow layers, while HRM still maintains more obvious inter - block representation changes, lower cosine similarity, and higher logit lens KL values at deeper layers.

Figure | Analysis of effective depth.

Figure | Layer - by - layer Logit Lens KL analysis.

Shortcomings and future directions

Although HRM - Text shows strong performance in inference - intensive tasks, this method still has limitations, and future research directions are proposed.

1. Decoupling "knowledge" and "reasoning"

Currently, broader coverage of factual knowledge still depends more on model scale and data breadth. HRM - Text is only trained on 40 billion unique tokens, and explicit knowledge sources only account for a part of the task - formatted mixed data. In the future, researchers need to design the compact reasoning core separately from external fact storage, leaving knowledge breadth to carefully selected corpora, retrieval - enhanced modules, or learnable memory.

2. Adaptive computation time

The recurrent scheduling of HRM - Text brings greater effective serial depth, but it also means that the model needs to perform a fixed number of recursive steps during inference. In the future, a worthy research direction is to introduce an adaptive computation time mechanism to allow simple samples to stop computing earlier and reserve the full recurrent budget for difficult samples, reducing inference costs.

3. The current scope of large - scale verification is still limited

The current scaling experiments only cover the Transformer control group with 3B parameters and HRM - Text with 1B parameters. The research team said that whether the similar efficiency advantage can still be maintained at a larger model scale remains to be further verified in subsequent work.

4. PrefixLM and inference framework

Currently, PrefixLM still faces certain engineering implementation limitations in actual deployment. Although it can run on standard text - generation inference frameworks such as vLLM, it requires the framework to support custom attention masks in the prefill stage. If it is extended to the multi - round dialogue scenario, a KV - cache mechanism needs to be further designed to ensure both bidirectional visibility within the user segment and the causal constraint during the generation process on the assistant side.

For more technical details, please refer to the original paper.

This article is from the WeChat official account "Academic Headlines" (ID: SciTouTiao), written by Xia Qiansi and published by 36Kr with authorization.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Wang Guan, a post-2000 alumnus of Tsinghua University, has released a new work: Using 1/900 tokens and 1/432 computing power to revolutionize the Transformer pre-training model.

How is HRM - Text designed?

What's the effect?

Shortcomings and future directions