Hierarchical Reasoning Model

📝 Paper Summary

Neurosymbolic reasoning Recurrent Neural Networks (RNNs) System 2 reasoning

HRM mimics the human brain's hierarchical processing by using coupled slow and fast recurrent modules to perform deep latent reasoning without chain-of-thought supervision or expensive backpropagation through time.

Core Problem

Standard LLMs rely on brittle Chain-of-Thought (CoT) prompting for reasoning, which is computationally shallow, token-inefficient, and dependent on extensive human supervision.

Why it matters:

Transformers have fixed depth (AC^0 or TC^0 complexity), preventing them from solving polynomial-time reasoning problems end-to-end without external scratchpads
Chain-of-Thought requires generating slow, expensive token sequences and massive training data, yet remains fragile to single-step errors
Naive recurrent models suffer from vanishing gradients and prohibitive memory costs (O(T)) due to Backpropagation Through Time (BPTT)

Concrete Example: In complex Sudoku puzzles (Sudoku-Extreme Full), state-of-the-art CoT methods fail completely (0% accuracy) because they cannot effectively search and backtrack, whereas HRM solves them with near-perfect accuracy using latent computation.

Key Novelty

Hierarchical Reasoning Model (HRM)

Couples a 'high-level' slow module (planning) with a 'low-level' fast module (execution); the high-level state updates only after the low-level module converges to a local equilibrium
Replaces Backpropagation Through Time (BPTT) with a memory-efficient O(1) one-step gradient approximation based on Deep Equilibrium Models theory
Incorporates an Adaptive Computational Time (ACT) mechanism where the model learns to pause and 'think' for variable durations based on problem complexity

Evaluation Highlights

Achieves 40.3% on the ARC-AGI benchmark with only 27M parameters and 1000 training examples, outperforming o3-mini-high (34.5%) and Claude 3.7 (21.2%)
Solves 'Sudoku-Extreme Full' with near-perfect accuracy (~98-99%) using only 1000 samples, while GPT-4o and o1-mini score ~0%
Achieves 100% accuracy on optimal pathfinding in 30x30 mazes where CoT-based baselines fail completely (0%)

Breakthrough Assessment

9/10

HRM demonstrates a radical efficiency jump: beating massive LLMs (Claude 3.7, o3-mini) on ARC-AGI with a tiny 27M parameter model trained on just 1k examples. It validates a non-Transformer, biologically plausible reasoning path.

⚙️ Technical Details

Problem Definition

Setting: Sequence-to-sequence learning where input tokens x are mapped to output tokens y through an intermediate latent reasoning process involving N high-level cycles.

Inputs: Input sequence x (e.g., Sudoku grid, ARC puzzle grid)

Outputs: Target sequence y (e.g., completed grid, solution tokens)

Pipeline Flow

Input Projection
Recurrent Reasoning Loop (High-Level Module ↔ Low-Level Module)
Output Prediction

System Modules

Input Network

Projects discrete input tokens into a working vector representation

Model or implementation: Embedding layer + Linear projection

Low-Level Module (L-Module) (Recurrent Reasoning)

Performs rapid, detailed computations to reach a local equilibrium based on the current high-level plan

Model or implementation: Encoder-only Transformer block (Llama-based architecture)

High-Level Module (H-Module) (Recurrent Reasoning)

Performs slow, abstract planning; updates once after the L-module completes a cycle of T steps

Model or implementation: Encoder-only Transformer block (Llama-based architecture)

Output Network

Decodes the final high-level state into token probabilities; also predicts halting probability

Model or implementation: Linear head (Softmax/Stablemax) + Q-head for halting

Novel Architectural Elements

Dual-timescale recurrent architecture: Fast L-module (runs T steps) nested inside Slow H-module (runs N cycles)
Hierarchical convergence mechanism: L-module is reset/re-contextualized by H-module updates rather than running as a single flat RNN
One-step gradient approximation path: Backprops only through the final equilibrium states (Output -> H_final -> L_final -> Input), bypassing BPTT entirely

Modeling

Base Model: Custom 27M parameter model (Encoder-only Transformer blocks for recurrent modules)

Training Method: Supervised Learning with Deep Supervision + Q-Learning for Halting

Objective Functions:

Purpose: Minimize prediction error at each supervision segment.

Formally: Cross-entropy loss averaged over tokens: Loss(y_hat, y) = (1/l') * sum(log p(y_i))
Purpose: Train the halting mechanism to stop when prediction is correct.

Formally: Q-learning loss L_Q = ||Q_hat - G_hat||^2, where reward is 1 for correct prediction at halt, 0 otherwise

Training Data:

ARC-AGI: ~1000 official training samples
Sudoku: 1000 input-output examples
Maze: 1000 input-output examples

Key Hyperparameters:

parameters: 27M
optimizer: Adam-atan2
context_window: 900 tokens (30x30 grid)
+ 3 more
initialization: Truncated LeCun Normal
weight_decay: Used (implied by AdamW discussion)
max_segments_Mmax: Used as fixed hyperparameter for inference scaling

Compute: Constant memory footprint O(1) during backprop (vs O(T) for BPTT)

Comparison to Prior Work

vs. Transformers: HRM has adaptive/recurrent depth rather than fixed depth, enabling O(N) or polynomial computation
vs. CoT/O1: HRM performs 'latent reasoning' in hidden states without generating expensive intermediate tokens
vs. Standard RNNs/DEQ: HRM uses hierarchical timescales (Fast/Slow modules) to prevent premature convergence and uses O(1) gradient approximation instead of BPTT or expensive Jacobian inversion
+ 1 more
vs. Neural Turing Machines [not cited in paper]: HRM uses latent vector state recursion rather than explicit external memory tape access

Limitations

Current evaluation is limited to synthetic/logic tasks (Sudoku, Maze, ARC); performance on natural language reasoning is not established.
Requires ground truth answers for the Q-learning reward signal in the ACT mechanism.
The simple element-wise addition for merging module inputs might be suboptimal compared to gating mechanisms.

Reproducibility

Code availability is not provided. The paper describes the architecture and algorithms (one-step gradient, deep supervision) in detail but does not link to a repository. Datasets (ARC, Sudoku) are standard/publicly available.

📊 Experiments & Results

Evaluation Setup

Few-shot learning (training from scratch on ~1000 examples) on logical reasoning tasks.

Benchmarks:

ARC-AGI (Inductive reasoning / Visual puzzles)
Sudoku-Extreme (Constraint satisfaction / Backtracking search)
30x30 Maze (Optimal pathfinding / Breadth-First Search)

Metrics:

Accuracy (%)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
ARC-AGI Benchmark: HRM significantly outperforms massive state-of-the-art LLMs despite having orders of magnitude fewer parameters and training data.
ARC-AGI	Accuracy	34.5	40.3	+5.8
ARC-AGI	Accuracy	21.2	40.3	+19.1
Complex Reasoning Tasks: HRM solves hard search problems where standard CoT approaches fail.
Sudoku-Extreme Full	Accuracy	0	98	+98
30x30 Maze (Optimal Path)	Accuracy	0	100	+100

Experiment Figures

Bar chart comparing HRM performance on ARC-AGI against major LLMs (o3-mini, Claude 3.7, GPT-4o, etc.).

Analysis of 'Forward Residual' (activity level) over timesteps for HRM vs Standard RNN.

Effect of Adaptive Computational Time (ACT) and Inference-time scaling.

Main Takeaways

Effective depth is critical: Increasing the computational depth (number of recurrence steps) dramatically improves performance on reasoning tasks, whereas simply scaling model width (parameters) yields diminishing returns.
Latent reasoning works: Complex multi-step reasoning can be performed efficiently in hidden states without externalizing thought into tokens (CoT), saving massive compute.
Inference-time scaling: The model allows increasing the maximum computation steps (M_max) at test time to boost performance further without retraining.
Data efficiency: HRM achieves these results with only ~1000 training examples, suggesting the inductive bias of hierarchical recurrence is far better suited for reasoning than standard Transformer architectures.

📚 Prerequisite Knowledge

Prerequisites

Recurrent Neural Networks (RNNs)
Implicit Function Theorem / Deep Equilibrium Models
Reinforcement Learning (Q-learning)
Transformer architecture (specifically encoder-only)

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps in text

BPTT: Backpropagation Through Time—the standard algorithm for training RNNs by unrolling the network over time, which is memory-intensive

Deep Equilibrium Models (DEQ): Neural networks that find a fixed point (equilibrium) of a hidden layer and compute gradients using the Implicit Function Theorem instead of unrolling layers

ARC-AGI: Abstraction and Reasoning Corpus—a benchmark measuring general intelligence through few-shot solving of visual logic puzzles

Hierarchical convergence: The process where a low-level module converges to a local equilibrium conditioned on a high-level state, which then updates to restart the low-level process

One-step gradient: An approximation method that computes gradients at the equilibrium point using only the final state, avoiding the memory cost of storing history

Deep supervision: Training technique where the model predicts the output and computes loss at multiple intermediate steps (segments) rather than just at the end

Adaptive Computational Time (ACT): Mechanism allowing the model to dynamically decide when to stop 'thinking' (iterating internal states) based on a learned halting policy

RMSNorm: Root Mean Square Normalization—a normalization technique used in Transformers to stabilize training

AdamW: A variant of the Adam optimizer that decouples weight decay from gradient updates

Post-Norm: An architecture where normalization is applied after the residual connection, often used for stability in deep networks