The new architecture model HRM-Text sets a new record with 1B parameters at a cost of $1,000, even a Turing Award winner has personally stepped in
A model with approximately 1 billion parameters achieved a score of 56.2 on MATH, 84.5 on GSM8K, and 81.9 on ARC-Challenge. The training cost was approximately $1,500, and it was trained on 16 H100 GPUs in less than two days.
This is HRM-Text, released by Sapient Intelligence on May 18, 2026. The team also made the paper, model weights, and pre-training code publicly available.
Just looking at these numbers, the most intuitive reaction might be: Is this the result of some kind of fine-tuning? Standing on the shoulders of giants certainly makes things easier.
But HRM-Text is not. It was pre-trained from scratch, using only about 40 billion unique tokens (after considering repeated sampling, the total training volume in the experimental table is recorded as about 60 billion tokens), which is approximately 1/225 of the training volume of Llama 3.2 3B (9 trillion tokens) and 1/900 of Qwen3.5 2B (36 trillion tokens).
Comparison of HRM-Text with other models in terms of training FLOPs, training tokens, and benchmarks.
Naturally, the question arises: How was this achieved?
In the past few years, the large model industry has formed a nearly default growth logic: larger models, more data, and stronger computing power will continue to improve intelligence capabilities.
This approach has been fully proven to be effective. The continuous evolution of models such as GPT, Claude, DeepSeek, and Qwen all relies on the expansion of parameter scale, data scale, and training computing power. However, at the same time, the training of foundation models is becoming more and more like a heavy industry: longer training cycles, more expensive GPU clusters, more complex data engineering, and increasingly high entry barriers.
But HRM-Text wants to try another approach: Can the output of each computation be improved through the joint design of the architecture and training objectives with limited data and limited computing power?
The title of the paper directly indicates the direction it attempts to challenge: Efficient Pretraining Beyond Scaling.
- Paper Title: HRM-Text: Efficient Pretraining Beyond Scaling
- Paper URL: https://arxiv.org/abs/2605.20613
- GitHub: https://github.com/sapientinc/HRM-Text
- Hugging Face: https://huggingface.co/sapientinc/HRM-Text-1B
- X Launch post: https://x.com/Sapient_Int/status/2056510383935172798
In simple terms, HRM-Text adjusts both “how the model computes” and “what it learns”: On the one hand, it allows limited parameters to perform multiple rounds of internal computations before output, increasing the effective computation depth; on the other hand, it only computes the loss for the answer part, concentrating the training signals more on task understanding and answer generation.
It should be noted that HRM-Text is not a mature chat model that has completed post-training or reinforcement learning optimization. The team defines the current version as a Proof of Concept: Its value lies not in finding the final form of the language model, but in providing a testable case, indicating that there is still a large space for architectural innovation in the efficiency of foundation model pre-training.
Complete multiple rounds of internal computations before a single output
The first change in HRM-Text is to reorganize the internal computation process of the model.
A standard Transformer usually consists of a series of network layers with independent parameters. The input propagates forward along the model depth: passing through the first layer, then entering the second layer, and so on, until the final output is obtained. A direct way to increase the model's capabilities is to stack more layers, increase the hidden dimension, or train more parameters.
HRM-Text does not simply follow this approach. It introduces two modules that operate at different time scales: the high-level module H and the low-level module L.
If we use a more intuitive analogy, a standard Transformer is more like passing a piece of material to multiple different editors in sequence, with each editor making one modification and then passing it on; HRM-Text is more like having two groups of editors repeatedly modify the same internal draft. The model does not simply add more parameters but allows limited parameters to participate in deeper effective computations.
According to the team's interview, this design is also different from the common “big and small brains” collaborative scheme in the industry. The latter usually trains two models of different scales separately, with the large model responsible for complex planning and the small model responsible for rapid execution, and the models mainly exchange information through text interfaces.
The H and L of HRM belong to the same network. They are not two independent models, nor do they hand over tasks through the text space. Instead, they repeatedly iterate on the same internal state in the same latent space. What information is transmitted between the modules and how they are divided into tasks are jointly determined by a unified optimization process.
More precisely, HRM does not splice a planner and an executor outside the model but builds hierarchical computation into a single model.
The low-level module updates faster, undertaking local computations and iterative corrections; the high-level module updates more slowly, maintaining a more stable semantic context and providing more long-term constraints for low-level computations. According to the settings in the paper, each forward propagation will execute two high-level cycles. In each cycle, three updates of the L module are completed first, followed by one update of the H module.
That is to say, before predicting a token, the model will complete 8 recursive updates: 6 low-level updates and 2 high-level updates.
The H/L dual-time scale recursive structure, the internal structure of the modules, and the PrefixLM attention mask.
It should be emphasized here that “multiple rounds of internal computations” do not mean that the model can dynamically adjust the thinking time according to the difficulty of the question. The current version uses a fixed recursive schedule: regardless of whether the task is simple or complex, the model will perform internal updates according to the preset number of times. Adaptive computation time will be a direction for future exploration.
This also means that 1 billion parameters do not mean that its inference cost is exactly the same as that of an ordinary 1 billion dense Transformer. Recursive calls improve the parameter utilization rate but also increase the serial computation volume before each token output. Therefore, the parameter scale, training cost, and actual inference efficiency still need to be discussed separately.
This approach is not without costs.
The deeper the internal loop, the more opportunities the model has to continuously correct its representation; however, after the same set of modules are repeatedly called, the variance of the activation values may accumulate continuously, and the gradients are more likely to disappear or explode. The recursive architecture is not a new concept. The real difficulty lies in how to stably train deep recursion in open-domain language tasks.
HRM-Text introduces two designs for this purpose: MagicNorm and warmup deep credit assignment.
The goal of MagicNorm is to simultaneously ensure the stability of forward propagation and backpropagation. The PreNorm structure, which is beneficial for gradient flow, is still retained inside the module, but an additional normalization is added when each recursive module exits. This can limit the variance growth of the activation values during repeated cycles and also retain a smooth gradient path as much as possible.
Warmup deep credit assignment controls how far back the gradients need to be traced. At the beginning of training, the model only performs gradient backpropagation for the last two recursive steps; as the training gradually stabilizes, the backpropagation range is linearly increased to the last five steps.
It can be understood as a gradual “accountability mechanism”: in the early stage of training, the model is first made responsible for the internal computations of the steps closest to the output; after stability is achieved, the earlier computation processes are gradually made responsible. This can utilize deeper recursive computations and also prevent the model from being exposed to overly long gradient paths from the beginning.
The paper also analyzes this structure from the perspective of effective depth.
In a standard Transformer or some looped Transformers, as the number of layers increases, the changes in the hidden state by subsequent layers may gradually weaken, and the model tends to a relatively stable output distribution early on. The analysis of HRM-Text shows that its deep computations still maintain relatively obvious representational changes. This means that the recursive steps are not just repeated operations but are continuously modifying the internal state, and deeper computation steps can still bring incremental information.
Comparison of the Effective Depth of different architectures.
Reduce predictions and concentrate training signals on answers
In addition to the architectural changes, the second modification of HRM-Text occurs in the pre-training objective.
Most language models use autoregressive “next token prediction”: given a piece of text, predict the next token. Whether the input is a web page, a book, a forum reply, or code, the model has to learn to continue each position in the sequence. This objective is general enough, but it also means that a large amount of training signals will be used to predict text that has little to do with task completion.
HRM-Text chooses a more targeted approach: it omits the large-scale raw text pre-training stage and directly trains from scratch using “instruction - answer” data pairs. Given an instruction and the corresponding answer, the model only computes the token-level loss for the answer part.
This does not mean that the instruction part does not participate in learning at all. The answer loss will still affect how the model understands and uses the instruction along the attention path. However, the model no longer undertakes the task of “predicting the question itself” but concentrates the update signals more on generating appropriate answers.
If we use a more intuitive analogy: when a teacher corrects a test paper, they no longer score the “question copying” part but only evaluate the answer part.
Accompanying the “answer-only objective” is the PrefixLM mask. In a standard causal mask, each token can only see the content before itself. This design is suitable for left-to-right generation, but for a fully given instruction, this limitation is not necessary.
HRM-Text allows the tokens in the instruction part to see each other bidirectionally; after entering the answer part, the standard causal generation method is restored.
Thus, the model can first integrate the entire instruction as a complete context and then gradually generate the answer. In the decoder-only implementation, it achieves a division of labor similar to that of an encoder-decoder: the instruction side is more like encoding, and the answer side is more like decoding.
The attention analysis in the paper shows that compared with a pure causal mask, PrefixLM brings higher attention entropy, and the attention pattern is more global and diverse. It is not just a change in a mask but an improvement in the way the model uses instruction information.
Differences in computing loss only for answers, PrefixLM attention mask, and attention distribution.
The effects of these designs can be clearly seen from the ablation experiments.
Under the condition of the same training FLOPs, the research team successively added “answer-only prediction”, PrefixLM, and the HRM architecture and observed how the model's performance changed.
Taking ARC-Challenge as an example, when a 1 billion Transformer uses full-sequence prediction and a causal mask, the score is 51.91; after changing to answer-only prediction, it increases to 62.88; after adding PrefixLM, it further increases to 74.32; finally, after changing to the HRM architecture, it reaches 81.91.
On MATH, the scores increase from 35.44 to 47.04, 48.36, and 56.16 in sequence. On GSM8K, they increase from 48.37 to 69.75, 75.06, and 84.53 in sequence.
This set of results shows that the efficiency of HRM-Text does not come from a single change but is the result of the combined effect of three directions: the hierarchical recursive architecture increases the effective computation depth; the task completion objective concentrates the training signals on task completion; PrefixLM improves the way the model integrates the instruction context.
To ensure the credibility of the results, Sapient Intelligence systematically verified the data contamination problem. HRM-Text was trained only using publicly available and traceable data, and a strict data contamination analysis was conducted on the evaluation set. Under the strictest Clean Split condition, the model still achieved results consistent with the main experiment, indicating that the performance improvement does not come from test set leakage but from the ability improvement brought by the model architecture itself. See the paper for detailed analysis.
Putting HRM-Text into a broader comparison of small models, its characteristics can also be seen.
It performs outstandingly on benchmarks such as MATH, GSM8K, DROP, and ARC-Challenge, which focus on task execution and reasoning; on benchmarks such as MMLU, which rely more on broad knowledge coverage, it is competitive but not leading.
For example, Qwen3.5 2B listed in the paper achieves 64.5 on MMLU, higher than HRM-