Hermes Team Rewrites Pre - training: 60% Reduction in Computing Power Cost, a New Efficiency

The paper has been read over 410,000 times! You can also reduce costs without changing the model architecture.

The model's capabilities need to be further improved, but the training cost cannot be increased indefinitely — this might be the strongest consensus in the current AI industry.

From developers to model companies, the focus of concern has gone beyond "whose model is stronger." Instead, it's a more practical question: "With the same number of GPUs and the same training time, can we conduct more effective experiments, process more effective data, and achieve better loss and downstream metrics?"

The Nous Research team, which quickly gained popularity with Hermes Agent (140K Star), has just proposed a Token Superposition Training method: Token Superposition Training (TST), which is expected to reduce the pre-training cost of large models by an order of magnitude.

Currently, the view count of this post has exceeded 410,000. Hugging Face: http://huggingface.co/papers/2605.06546

In the paper "Efficient Pre-Training with Token Superposition," the most notable is a set of experiments on a 10-billion-parameter MoE model (Qwen3-like 10B-A1B MoE). The results are very intuitive:

The baseline training of 1.05T tokens consumes 12,311 B200-hours;
While TST training of 2T tokens only consumes 4,768 B200-hours, about 38.7% of the baseline;
Meanwhile, the final loss decreases from 2.252 to 2.236, and the 0-shot evaluations of HellaSwag, ARC-E, ARC-C, MMLU, etc. are improved simultaneously.

In other words, TST uses only about 40% of the GPU time to achieve a lower loss and better downstream metrics. It's equivalent to reducing the pre-training time to 40% of the original under the same final loss, with a speedup of about 2.5 times.

If the Hermes Agent, which surpassed OpenClaw and topped the global OpenRouter, proves that the Nous Research team can not only train models but also optimize their capabilities to the extreme with agents; then the newly proposed TST shifts the focus from "how to use the model" back to the source of capabilities, targeting pre-training itself.

The reason for comparing Nous Research with DeepSeek is not only that this American team has long adhered to the open-source camp, but also because their cost-reduction approaches are completely different.

DS represents system-level reconstruction. Whether it's MoE, MLA, or sparsification and parallel optimization, it squeezes computing power through system-level engineering. Efficiency improvement is never free. Engineering always has to pay for complexity elsewhere.

NR, on the other hand, rewrites the learning path in the early stage of pre-training. It doesn't touch the architecture but starts from the way the model learns tokens. The approach is more lightweight and easier to implement.

TST: Let the model "skim" first, then "read carefully"

To understand TST, let's first go back to the most basic action in pre-training: next-token prediction.

In standard training, the model sees the previous tokens and predicts the next token. This mechanism is simple yet powerful. In the past few years, almost all mainstream LLMs have been built on this paradigm.

But TST poses a very simple question: Is it really necessary for the model to read tokens one by one carefully at the beginning of pre-training?

NR's answer is: Not necessarily. They divide pre-training into two stages.

Caption: Comparison of TST with standard next-token prediction, MTP, and SuperBPE. TST changes the input granularity and output supervision target in the early stage of training but doesn't change the final model architecture.

The first stage is called the superposition phase. In the early stage of training, the model no longer reads text token by token. Instead, it groups consecutive tokens into a bag. For example, if the bag size is 8, it treats 8 consecutive tokens as a group.

On the input side, the model averages the embeddings of the tokens in the group to form a compressed superposed token. On the output side, the model no longer predicts the next single token but predicts which tokens will appear in the next group of tokens.

The second stage is called the recovery phase. After training reaches a certain proportion, TST is removed, and the model returns to standard next-token prediction. That is, in the second half, it is still trained in the way of an ordinary LLM, pulling the representation obtained from the "coarse-grained learning" in the early stage back to the form of an autoregressive model that can be generated and deployed.

The paper calls TST a drop-in pretraining method. The key lies here: it doesn't need to modify the parallel strategy, optimizer, tokenizer, training data, or model architecture. What it really changes is the input granularity and supervision target in the early stage of training.

This is also different from many training efficiency improvement solutions: TST only changes the training process, not the inference model.

Currently, many methods that touch on training-side optimization will also affect inference. For example, changing the tokenizer requires re-compatibility of the ecosystem; modifying the model structure requires adaptation of the deployment link; changing the attention or inference mechanism also requires adjustment of the online service.

But TST keeps the complexity in the training stage and finally delivers an ordinary LLM.

Of course, training only with TST is not enough. The paper also clearly points out that if the model uses TST throughout the process, it will output the mixed probabilities of multiple future tokens, and the generation results will become chaotic. Therefore, TST must switch back to standard autoregressive training in the later stage.

This also explains why TST is more suitable to be understood as a "phased training strategy" rather than a substitute for next-token prediction.

More straightforwardly, what TST does is a bit like letting the model "skim" in the early stage of pre-training: First, learn local semantics, word co-occurrence, and coarse-grained distributions; after the basic representation is established, return to the standard autoregressive training of reading tokens one by one to supplement the generation ability and token-level accuracy.

That is, tokens are compressed during training, but it's still an ordinary LLM during inference.

Why can it save GPU? Each step processes more text

The speedup of TST is not a mystery. Its core is a resource trade-off, using a coarser token representation to exchange for higher data throughput.

The data throughput here corresponds to data throughput per FLOPs in the paper, which can be understood as "how much original text can be processed per unit of computation." In other words, it's not that the GPU suddenly becomes faster, but that the model can see more text with the same computation.

In standard training, the model processes one token at each position. With a sequence length of L, the Transformer has to process L representations.

But in the superposition phase of TST, s consecutive tokens are combined into a superposed token. The sequence length processed inside the model becomes shorter, but the original text corresponding to each position becomes more.

Because the model computes on a coarser-grained representation, it can process s times as many data tokens with the same FLOPs.

Caption: In the 3B model experiment, TST reaches the baseline loss with fewer training steps under the equal-loss setting, indicating that its main benefit comes from higher data throughput in the early stage of training.

Traditional pre-training is like reading word by word carefully; while the early training of TST is like quickly skimming through a paragraph to grasp the local theme, word co-occurrence, and semantic distribution. After the model establishes the basic representation, it switches back to reading word by word carefully.

This "skimming" is not without cost — it will lose the word order information within the bag, so it cannot be used throughout the process. But when the model first encounters the language statistical structure, this low-resolution input is actually sufficient and efficient.

The paper defines this as a coarse-to-fine strategy: first, let the model learn the coarse-grained statistical structure in a simple, high-throughput distribution, and then restore full-resolution language modeling.

This is very different from the current mainstream efficiency approaches: MoE reduces the number of activated parameters for each token; sparse attention reduces the number of positions each token looks at; MTP (Multi-Token Prediction) predicts several future tokens at each position; while TST allows the model to learn with a different token granularity in the early stage of training.

It doesn't make the model smaller, nor does it directly make inference faster. Instead, it makes each step in the early stage of pre-training more "valuable."

This is crucial for developers. Pre-training is not a one-time deal but a process of continuous trial and error. The earlier the early training enters the effective range, the earlier the experiments on data recipes and hyperparameter settings can be verified.

To put it simply, TST saves not only the GPU hours for one training but also the trial-and-error cost for the entire experimental cycle.

The greatest benefit comes from 10-billion-parameter models

The paper doesn't only conduct experiments on small models but also verifies on 270M, 600M, 3B dense models, and a 10B-A1B MoE model. Here, the 10B-A1B MoE is a MoE model with about 10 billion total parameters and about 1 billion activated parameters per token. As mentioned at the beginning, this is the model that benefits the most from the experiment.

Caption: Core experimental results of TST on models of different scales.

Caption: In the 10B-A1B MoE experiment, TST reduces the B200 GPU training time to about 40% of the baseline and achieves a lower loss and better 0-shot metrics.

That is to say, TST processes more data tokens but achieves better results with less GPU time. The paper points out that under the same loss criterion, TST corresponds to a speedup of about 2.5 times.

This is already enough to impress developers. Because in model training, the most expensive part is often not a single successful training but all the trial-and-errors before success. Using more than half less GPU time in one experiment means that more data recipes, hyperparameter sets, and model scales can be tested with the same budget.

The paper also conducts multiple sets of small-scale hyperparameter sweep experiments to observe the effects of different bag sizes and superposition step ratios. Finally, the authors believe that within a reasonable range, TST is relatively robust to hyperparameter selection: when the bag size is between 4 and 8 and the ratio of superposition training steps is between 0.2 and 0.4, it usually performs well.

Caption: Under different bag sizes and training ratios, TST shows relatively stable benefits in terms of loss and downstream evaluations.

In addition, TST is not a single mechanism at work.

The paper conducts ablation experiments on the input side, output side, and the complete TST: both the input side and output side alone can outperform the baseline, but the complete TST has the best effect. Based on this, the authors point out that TST is a superposition of two mechanisms: the input side changes the input granularity and the FLOPs cost per unit of information; the output side changes the prediction target and the gradient signal.

The inspiration of this mechanism lies in that the input side provides the model with a low-resolution view in the early stage of training, allowing it to access more text at a lower cost; the output side is like changing the supervision signal from "what is the next token"

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

Hermes team rewrites pre-training: Computing power cost reduced by 60%, a new path to improve efficiency after DeepSeek

TST: Let the model "skim" first, then "read carefully"

Why can it save GPU? Each step processes more text

The greatest benefit comes from 10-billion-parameter models