Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

📝 Paper Summary

Test-time compute scaling Latent reasoning Recurrent neural networks

A transformer architecture that scales test-time compute by iterating a recurrent block in latent space rather than generating more tokens, enabling performance gains on reasoning tasks without long chain-of-thought outputs.

Core Problem

Mainstream reasoning models scale test-time compute by generating long chains of thought, which wastes tokens on intermediate verbalization and requires specialized training data.

Why it matters:

Verbalizing every intermediate thought is inefficient compared to human-like non-verbal reasoning
Standard chain-of-thought approaches require massive context windows and specialized long-context training data
Latent reasoning could capture abstract concepts (spatial, physical intuition) that are difficult to represent in words

Concrete Example: When solving a complex math problem, a standard model must write out every step token-by-token. This requires a huge context window. The proposed model 'thinks' by looping its internal state 32 times before generating a single answer token, using the same small context window.

Key Novelty

Latent Recurrent Depth Transformer

Instead of a fixed number of layers, the model has a 'core' block that loops a variable number of times (recurrent depth) during a single forward pass
The model is trained with a randomized number of iterations per sample, allowing it to learn to think for longer or shorter periods
Input embeddings are injected at every step of the loop to stabilize the recurrence (preventing the state from forgetting the prompt)

Architecture

Diagram of the Recurrent Depth Architecture

Evaluation Highlights

The 2B parameter recurrent model matches the performance of a 9B parameter standard model (Gemma-2-9B) on math reasoning tasks when allowed to 'think' for 48 iterations
Scaling inference iterations from 16 to 48 improves accuracy on the GSM8K benchmark from ~35% to ~45% without retraining
Outperforms standard depth-scaled transformers (132 effective layers) while using significantly fewer parameters (3.5B params)

Breakthrough Assessment

8/10

Offers a compelling alternative to Chain-of-Thought for test-time scaling. The ability to decouple parameter count from effective depth and compute is a significant architectural shift.

⚙️ Technical Details

Problem Definition

Setting: Next-token prediction where the mapping from input sequence to output probabilities involves an iterative latent process

Inputs: Sequence of input tokens x

Outputs: Output probability distribution p over vocabulary V

Pipeline Flow

Prelude (Embeds inputs to latent space)
Recurrent Core (Loops r times, updating latent state)
Coda (Un-embeds latent state to probabilities)

System Modules

Prelude Block (P)

Embed input tokens into latent space

Model or implementation: Transformer layers (Embedding + l_P layers)

Recurrent Core Block (R)

Iteratively process and update hidden state

Model or implementation: Recurrent Transformer Block (l_R layers repeated r times)

Coda Block (C)

Project final latent state to token probabilities

Model or implementation: Transformer layers (l_C layers + Head)

Novel Architectural Elements

Recurrent Core Block that loops a variable number of times (r) during a single forward pass
Injection of the initial embedding (e) at every recurrent step to stabilize long-term dependency
'Sandwich' normalization structure (Norm-Attention-Norm-Norm-MLP-Norm) specifically to stabilize deep recurrence

Modeling

Base Model: Custom Recurrent Transformer (Huginn-0125)

Training Method: Pre-training with randomized recurrent depth

Objective Functions:

Purpose: Minimize negative log-likelihood over a distribution of recurrent iterations.

Formally: L = E_{x, r}[-sum log p(x' | M(x, r))]

Training Data:

Pre-trained on 800 billion tokens (SlimPajama and Starcoder data)
Validation on standard benchmarks

Key Hyperparameters:

max_learning_rate: Not reported in the paper
context_length: Not explicitly reported in the paper
hidden_size_large_model: 5280
+ 2 more
recurrence_iterations_mean: Training samples drawn from log-normal Poisson distribution
backprop_steps_k: 8 (truncated backpropagation through time)

Compute: Trained on Frontier (AMD cluster). Exact GPU/hours not reported.

Comparison to Prior Work

vs. Universal Transformers: Uses input injection and 'sandwich' norm for stability at scale; trains with variable depth
vs. Chain-of-Thought: Reasoning happens in latent space (vectors) rather than token space (words); does not increase context length
vs. Equilibrium Models [not cited in paper]: Does not seek a fixed point via root-finding but iterates for a finite, variable computational budget

Limitations

Recurrent step has sequential dependency, preventing parallelization across the depth dimension (unlike standard layers which are fixed)
Inference speed is slower per token as the model 'thinks' (loops) multiple times before generating
Currently scaled only to 3.5B parameters; significantly smaller than SOTA frontier models

Reproducibility

Code: https://github.com/seal-rg/recurrent-pretraining

Code and data are publicly available at github.com/seal-rg/recurrent-pretraining. Model weights available at huggingface.co/tomg-group-umd/huginn-0125.

📊 Experiments & Results

Evaluation Setup

Zero-shot and few-shot evaluation on reasoning and general language benchmarks while varying test-time compute (iterations).

Benchmarks:

GSM8K (Grade school math reasoning)
ARC-Challenge (Science reasoning)
WikiText (Language modeling (perplexity))

Metrics:

Accuracy (%)
Perplexity (lower is better)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Scaling inference iterations improves performance on math reasoning tasks.
GSM8K	Accuracy	35.0	45.0	+10.0
The recurrent model competes with much larger standard models.
GSM8K	Accuracy	45.0	45.0	0.0
Performance on knowledge-heavy tasks does not scale with iterations.
MMLU	Accuracy	55.0	55.0	0.0

Experiment Figures

Accuracy on GSM8K vs. Inference FLOPs for the Recurrent Model and baselines (Llama, Gemma).

Main Takeaways

Test-time compute scaling via recurrent iterations works effectively for reasoning tasks (Math, Code) but not for knowledge retrieval (MMLU).
The model learns 'thinking' behaviors purely from next-token prediction on standard data, without specialized Chain-of-Thought supervision.
Latent space trajectories reveal emergent behaviors like 'orbiting' patterns when the model performs numerical reasoning.
Recurrent depth allows a smaller model (in parameters) to emulate the performance of a larger model by expending more compute time.

📚 Prerequisite Knowledge

Prerequisites

Transformer architecture (Decoder-only)
Recurrent Neural Networks (RNNs)
Chain-of-Thought (CoT) reasoning
Test-time compute scaling

Key Terms

Recurrent depth: Applying the same transformer block multiple times in a loop to process the same tokens, effectively increasing network depth without adding parameters

Latent space: The internal vector representation of data within a neural network, as opposed to the discrete token space of words

Test-time compute: The amount of computation (FLOPs) used during inference (generating answers), which can be increased to improve performance

Chain-of-Thought: A technique where models generate intermediate reasoning steps in text before producing the final answer

RoPE: Rotary Positional Embeddings—a method for encoding token positions in transformers using rotation matrices

RMSNorm: Root Mean Square Normalization—a normalization technique used to stabilize training in deep neural networks

KV-cache: Key-Value cache—storing calculated attention keys and values to speed up generation, which this model can share across recurrent steps

SiLU: Sigmoid Linear Unit—an activation function used in the model's MLPs

Effective depth: The total number of layers the data passes through, calculated as (prelude layers) + (recurrent layers × iterations) + (coda layers)