Inference Model with Just 27 Million Parameters Outperforms DeepSeek and Claude

A brand-new large model inference architecture.

Reason like a human.

Is it time for a revolution in the architecture of large models?

In the reasoning work of complex tasks, current large language models (LLMs) mainly adopt the Chain of Thought (CoT) technique. However, these techniques have problems such as complex task decomposition, large data requirements, and high latency.

Recently, inspired by the hierarchical and multi - time - scale processing mechanisms of the human brain, researchers from Sapient Intelligence proposed the Hierarchical Reasoning Model (HRM), a brand - new recurrent architecture that can achieve high computational depth while maintaining training stability and efficiency.

Specifically, HRM performs sequential reasoning tasks in a single forward pass through two interdependent recurrent modules without explicit supervision of intermediate processes: one high - level module is responsible for slow, abstract planning, and the other low - level module is responsible for fast, detailed computations. With only 27 million parameters and using only 1000 training samples, HRM has achieved excellent performance on complex reasoning tasks.

This model can run without pre - training or CoT data, yet it has achieved nearly perfect performance on challenging tasks including complex Sudoku puzzles and finding the optimal path in large mazes. Additionally, on the Abstraction and Reasoning Corpus (ARC), HRM outperforms large models with significantly longer context windows. ARC is a key benchmark for measuring general artificial intelligence capabilities.

From this perspective, HRM has the potential to drive transformative progress in general computing.

Paper: Hierarchical Reasoning Model

Paper link: https://arxiv.org/abs/2506.21734

As shown in the figure below: The left figure —— The inspiration for HRM comes from the brain's hierarchical processing and time - separation mechanisms. It contains two recurrent networks operating on different time scales to collaboratively solve tasks. The right figure —— Using only about 1000 training samples, HRM (about 27 million parameters) has outperformed the state - of - the - art CoT models on inductive benchmark tests (ARC - AGI) and challenging symbolic tree search puzzles (Sudoku - Extreme, Maze - Hard), while the CoT models completely failed. HRM uses random initialization, does not require a chain of thought, and directly completes tasks based on the input.

Hierarchical Reasoning Model

The necessity of depth in complex reasoning is shown in the figure below.

Left figure: On Sudoku - Extreme Full, which requires a large amount of tree search and backtracking, increasing the width of the Transformer does not improve performance, while increasing the depth is crucial. Right figure: The standard architecture has reached saturation and cannot benefit from increasing depth. HRM overcomes this fundamental limitation and effectively uses its computational depth to achieve nearly perfect accuracy.

The core design of HRM is inspired by the brain: hierarchical structure + multi - time - scale processing. Specifically, it includes:

Hierarchical processing mechanism: The brain processes information through a multi - level hierarchical structure of cortical regions. High - level brain regions (such as the prefrontal cortex) integrate information and form abstract representations on a longer time scale, while low - level brain regions (such as the sensory cortex) are responsible for processing immediate and specific sensorimotor information.

Time - scale separation: The neural activities of these hierarchical structures have different intrinsic time rhythms, manifested as specific neural oscillation patterns. This time - separation mechanism enables high - level brain regions to stably guide the fast - computing process of low - level brain regions.

Recurrent connection characteristics: The brain has dense recurrent neural network connections. This feedback loop improves representation accuracy and context adaptability through iterative optimization but requires additional processing time. Notably, this mechanism can effectively avoid the deep credit assignment problem in the Backpropagation Through Time (BPTT) algorithm.

The HRM model consists of four learnable components: the input network f_I (・; θ_I ), the low - level recurrent module f_L (・; θ_L), the high - level recurrent module f_H (・; θ_H), and the output network f_O (・; θ_O).

HRM maps the input vector x to the output prediction vector y´. First, the input x is projected by the network into a representation

The final state of the module at the end of a cycle is:

Finally, after N complete cycles, the prediction is extracted from the hidden state of the H module

HRM exhibits hierarchical convergence: the H module converges stably, while the L module converges repeatedly within a cycle and is then reset by H, resulting in a peak in the residual. Recurrent neural networks show rapid convergence, and the residual quickly approaches zero. In contrast, deep neural networks experience gradient vanishing, and significant residuals mainly appear in the initial layer (input layer) and the final layer.

HRM introduces:

First, approximate gradients. Recurrent models usually rely on BPTT to calculate gradients. However, BPTT needs to store all hidden states during the forward - propagation process and combine them with gradients during back - propagation, which results in a linear relationship (O (T)) between memory consumption and the number of time steps T.

HRM designs a one - step gradient approximation method. The core idea is to use the gradients of the last states of each module and treat other states as constants.

The above method requires O (1) memory, does not need to be unrolled over time, and can be easily implemented using automatic differentiation frameworks such as PyTorch, as shown in Figure 4.

Second, deep supervision. This paper integrates the deep - supervision mechanism into HRM.

Given a data sample (x, y), the HRM model is then forward - propagated multiple times, and each propagation is called a segment. Let M denote the total number of segments executed before termination. For each segment m ∈ {1, ..., M}, let

denote the hidden state at the end of segment m, which contains high - level and low - level state components. Figure 4 shows the pseudocode for deep - supervision training.

Adaptive Computation Time (ACT). The brain dynamically switches between automatic thinking (System 1) and deliberate reasoning (System 2).

Inspired by the above mechanism, this paper integrates an adaptive stopping strategy into HRM to achieve fast thinking and slow thinking.

Figure 5 shows a performance comparison of two HRM variants. The results show that ACT can effectively adjust its computing resources according to task complexity, thereby significantly saving computing resources while minimizing the impact on performance.

Inference time extension. An effective neural model should be able to dynamically utilize additional computing resources during the inference phase to improve performance. As shown in Figure 5 - (c), the HRM model can seamlessly extend inference computation by only increasing the computation limit parameter Mmax without retraining or adjusting the model architecture.

Experiments and Results

In this study, the authors conducted ARC - AGI, Sudoku, and maze benchmark tests, and the results are shown in Figure 1:

HRM performs excellently on complex reasoning tasks, but it raises an intriguing question: What underlying reasoning algorithms does the HRM neural network actually implement? Answering this question is crucial for enhancing the model's interpretability and deepening the understanding of the HRM solution space.

The authors attempted to visualize the reasoning process of HRM. In the maze task, HRM seems to initially explore multiple potential paths simultaneously, then exclude blocked or inefficient paths, construct a preliminary solution outline, and perform multiple optimization iterations; in the Sudoku task, this strategy is similar to the depth - first search method, where the model explores potential solutions and backtracks when encountering a dead - end; HRM adopts a different method for ARC tasks, making progressive adjustments to the board and continuously iterating for improvement until a solution is found. Different from Sudoku, which requires frequent backtracking, the problem - solving path of ARC follows a more consistent progressive pattern, similar to hill - climbing optimization.

More importantly, the model can adapt to different reasoning methods and may select effective strategies for each specific task. However, the authors also stated that further research is needed to fully understand these problem - solving strategies.

Visualization of the intermediate prediction results of HRM in benchmark tasks. Top figure: MazeHard —— Blue cells represent the predicted path. Middle figure: Sudoku - Extreme —— Bold cells represent the initially given values; Red - highlighted cells violate Sudoku constraints; Gray shading indicates changes from the previous time step. Bottom figure: ARC - AGI - 2 task —— Left figure: Provided example input - output pairs; Right figure: Intermediate steps for solving the test input.

The following figure compares the hierarchical dimensional organizational structure of the HRM model with that of the mouse cortex.

For example, a dimensional hierarchy can be observed in the mouse cortex, where the PR (Participation Ratio) of population activity increases monotonically from low - level sensory regions to high - level association regions, supporting this relationship between dimension and functional complexity (Figure 8 a, b).

The results shown in Figure 8 - (e,f) show a clear contrast: in the untrained model, the high - level module and the low - level module do not show any hierarchical differentiation, their PR values are low, and there is almost no difference.

This control experiment shows that the dimensional hierarchical structure is a characteristic that naturally emerges as the model learns complex reasoning tasks, rather than an inherent property of the model architecture itself.

该文观点仅代表作者本人，36氪平台仅提供信息存储空间服务。

With only 27 million parameters, this inference model has outperformed DeepSeek and Claude.

Hierarchical Reasoning Model

Experiments and Results