Reasoning with Latent Thoughts: On the Power of Looped Transformers

📝 Paper Summary

Language Model Scaling Laws Parameter Efficient Reasoning

Looped transformers, which iteratively reuse a small block of layers, achieve strong reasoning performance by increasing effective depth without increasing parameter count, acting as an implicit chain-of-thought mechanism.

Core Problem

Standard scaling laws suggest performance depends primarily on parameter count, but reasoning problems (like math or induction) often require computational depth (number of steps) rather than just memory capacity.

Why it matters:

Building deeper models to handle complex reasoning usually requires a linear increase in parameters and memory, which is computationally expensive
Current non-looped models may fail at reasoning tasks even with many parameters if they lack sufficient depth to process compositional steps

Concrete Example: In an n-ary addition task (e.g., adding 32 numbers), a shallow model with many parameters fails to track the carry-over operations, whereas a looped model with few parameters but high effective depth can solve it.

Key Novelty

Reasoning via Looped Transformers ($k \otimes L$)

Decouple model depth from parameter count by iteratively applying the same block of $k$ Transformer layers $L$ times (weight tying)
Theoretically framing the 'loop' as generating 'latent thoughts', allowing the model to simulate Chain-of-Thought (CoT) reasoning steps internally without outputting tokens

Architecture

Conceptual illustration of looped vs. non-looped models (described in text)

Evaluation Highlights

On 32-operand addition, a looped model ($k \otimes 12/k$) nearly matches the performance of a full 12-layer non-looped baseline while using significantly fewer parameters ($1/L$ fraction).
On i-GSM (synthetic math), looped models match or outperform iso-flop non-looped models (same effective depth, more params) and significantly outperform iso-param models.
In 1B-scale language modeling, looped models demonstrate an inductive bias for reasoning, achieving competitive performance on math/coding tasks compared to iso-flop baselines despite having worse perplexity.

Breakthrough Assessment

8/10

Challenges the dominant parameter-scaling paradigm by demonstrating that depth via looping is sufficient for reasoning, offering a pathway to highly parameter-efficient 'thinking' models.

⚙️ Technical Details

Problem Definition

Setting: Sequence-to-sequence learning and Causal Language Modeling

Inputs: Input sequence (e.g., math problem, text prompt)

Outputs: Target sequence (e.g., solution, next token)

Pipeline Flow

Input Embedding
Recurrent Processing (Transformer Block looped L times)
Output Head

System Modules

Transformer Block

Perform one step of latent reasoning/computation

Model or implementation: Standard Transformer Layers ($k$ layers)

Loop Mechanism

Orchestrate the iterative application of the Transformer Block

Model or implementation: Control logic (fixed $L$ iterations)

Novel Architectural Elements

Application of the ($k \otimes L$) notation and design to decouple depth from parameters specifically for reasoning tasks
Use of looping as a replacement for explicit Chain-of-Thought tokens in 'latent' space

Modeling

Base Model: Transformer (Decoder-only for LM experiments)

Training Method: Standard supervised learning (for synthetic tasks) and Causal Language Modeling (for Pile)

Training Data:

Synthetic Addition: Uniform mixture of 2, 4, 8, 16, 32 operands
p-hop induction: Synthetic pointer chasing sequences
i-GSM: Synthetic math problems (DAG of arithmetic modulo 7)
The Pile: 250B tokens for Language Modeling

Key Hyperparameters:

k (unique layers): Varied in {2, 3, 4, 6, 8, 12}
L (loops): Varied to match effective depths of 12 or 24
Model Scale: Up to 1B parameters (24 layers)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Standard Transformers: Looped models achieve similar reasoning at a fraction of the parameters by reusing weights.
vs. Universal Transformers: Focuses specifically on the 'reasoning' capabilities and the trade-off between perplexity and reasoning, rather than just parameter efficiency or adaptive compute.

Limitations

Looped models generally exhibit worse perplexity (PPL) compared to iso-flop non-looped models because they have fewer parameters to memorize data.
The definition of 'reasoning' is difficult to formalize, leading to reliance on synthetic proxies (addition, math) and specific benchmarks.
Training stability or optimization challenges for looped models are not deeply detailed in the provided text.

Reproducibility

No specific code URL provided in the text. Synthetic datasets (Addition, p-hop, i-GSM) are described procedurally. Training on 'The Pile' follows standard practices.

📊 Experiments & Results

Evaluation Setup

Comparison of looped vs. non-looped models on synthetic reasoning and standard LM benchmarks.

Benchmarks:

n-ary Addition (Algorithmic Reasoning (Synthetic)) [New]
p-hop Induction (Associative Recall / Pointer Chasing) [New]
i-GSM (Symbolic Math (Grade School Math style)) [New]
Language Modeling Benchmarks (Various (Closed book QA, Open book QA, Math, Reasoning))

Metrics:

Accuracy
Perplexity (for LM pretraining)
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

Depth is the primary driver for reasoning tasks: Looped models ($k \otimes L$) match the performance of non-looped models ($kL \otimes 1$) on addition and induction tasks, despite having $L$ times fewer parameters.
Parameter count drives perplexity: Iso-flop non-looped models have better perplexity than looped models, confirming that memorization/prediction requires capacity (parameters).
Looped models exhibit a strong inductive bias for reasoning: In 1B scale experiments, looped models perform competitively on reasoning-heavy downstream tasks (math, coding) compared to much larger iso-flop baselines, even while their perplexity is worse.
Performance on reasoning tasks scales logarithmically with effective depth (loops), exhibiting behavior similar to inference-time scaling in Chain-of-Thought.

📚 Prerequisite Knowledge

Prerequisites

Transformer Architecture (Layers, Attention)
Weight Sharing / Tied Weights
Chain-of-Thought (CoT) Reasoning
Scaling Laws (Iso-flop vs. Iso-param)

Key Terms

Looped Model ($k \otimes L$): A model architecture where a block of $k$ unique transformer layers is applied iteratively $L$ times to the input representation

Iso-flop: Comparison between models that require the same number of floating-point operations (computation) during inference (e.g., a shallow looped model vs. a deep non-looped model)

Iso-param: Comparison between models that have the same number of trainable parameters (e.g., a shallow non-looped model vs. a shallow looped model)

Inductive Bias: Assumptions built into a learning algorithm that encourage it to learn certain types of solutions (here, reasoning processes) over others (like rote memorization)

Latent Thoughts: Intermediate hidden states generated during the loop iterations that represent reasoning steps, analogous to explicit tokens in Chain-of-Thought

Perplexity: A measurement of how well a probability model predicts a sample; lower values indicate better prediction of the training data distribution

i-GSM: A synthetic grade-school math dataset constructed as a Directed Acyclic Graph (DAG) of arithmetic operations