Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

📝 Paper Summary

LLM Reasoning Capabilities Test-time Compute Scaling

By analyzing LLMs on the infinitely scalable Tents puzzle, this study demonstrates that reasoning effort (token count) scales linearly with problem complexity only up to a critical threshold, beyond which models fail to maintain logical coherence.

Core Problem

While recent LLMs leverage increased test-time compute for reasoning, it is unclear how their 'reasoning effort' (token usage) scales as problem complexity systematically increases.

Why it matters:

Understanding scaling behaviors reveals potential bottlenecks and efficiency limits in current reasoning architectures (like o1 or R1)
Simply measuring accuracy is insufficient; analyzing token usage provides insight into the 'algorithmic cost' of reasoning within LLMs
Identifying critical complexity thresholds helps define the boundaries of current 'system 2' reasoning capabilities in extrapolative settings

Concrete Example: When solving a small Tents puzzle, an LLM might use a few hundred tokens. As the grid grows to 10x10, a human solver's effort increases predictably. If the LLM's token usage peaks and then drops while failing to solve the puzzle (as seen with o3-mini), it indicates a breakdown in the reasoning process rather than a lack of capacity.

Key Novelty

Scaling Analysis using Tents Puzzle

Uses the 'Tents' logic puzzle as a controlled testbed because it is infinitely scalable (grid size can be increased arbitrarily) and has a known linear-time solution
Analyzes the correlation between problem size (grid dimensions) and reasoning effort (output token count) to test for linear scaling behavior
Identifies a 'frustration' phenomenon in some models where reasoning effort decreases after a certain complexity threshold, suggesting a loss of coherence

Evaluation Highlights

OpenAI o3-mini achieved the highest success rate, solving puzzles up to size 100 (10x10 grid), while Qwen/QwQ-32B-Preview struggled beyond size 25
DeepSeek R1 and o3-mini demonstrated a linear increase in reasoning tokens as problem size grew, confirming they adapt effort to complexity for solvable instances
o3-mini exhibited non-monotonic effort scaling: token usage peaked at problem size 100 and then decreased, indicating a potential failure mode for highly complex problems

Breakthrough Assessment

7/10

A valuable analysis paper that quantifies the 'thinking' process of new reasoning models. While it doesn't propose a new model, the identification of linear scaling and the 'frustration' peak provides important insights into the nature of test-time compute.

⚙️ Technical Details

Problem Definition

Setting: Solving logic puzzles (Tents) provided in text format

Inputs: Textual description of the puzzle rules and initial grid state (trees and empty cells)

Outputs: JSON formatted solution representing the solved grid

Pipeline Flow

Input Generation (create Tents puzzle instance)
Prompting (inject rules + puzzle state)
Inference (Model generates reasoning tokens + JSON solution)
Evaluation (Check validity against rules)

System Modules

Puzzle Generator

Generate Tents puzzle instances of varying grid sizes (e.g., 5x5 to 15x15)

Model or implementation: Procedural Algorithm (C code based on Simon Tatham's collection)

Reasoning Model

Solve the puzzle by generating a chain of reasoning followed by a structured answer

Model or implementation: Various (o3-mini, DeepSeek R1, etc.)

Modeling

Base Model: Evaluated multiple models: Gemini 2.0 Flash Thinking, OpenAI o3-mini, DeepSeek R1, Qwen/QwQ-32B-Preview

Training Method: Various (Reinforcement Learning from Human Feedback / Specialized Reasoning Training)

Compute: Total cost of experiments was around 80 USD in API credits. Inference-only evaluation.

Comparison to Prior Work

vs. GSM8K/MATH: Tents puzzle allows systematic, infinite scaling of complexity (grid size) unlike fixed-difficulty math problems
vs. PUZZLES: Focuses on LLM extrapolative reasoning scaling rather than Reinforcement Learning agents
vs. Standard CoT: Evaluates models trained specifically for 'thinking' (o1-like) rather than just prompting standard models
+ 1 more
vs. Big-Bench Hard [not cited in paper]: Focuses on a single controllable algorithmic task rather than a diverse suite of hard tasks

Limitations

Evaluation limited to a single puzzle type (Tents)
Maximum solvable problem size was relatively small (10x10), limiting the range of scaling analysis
Results may be sensitive to the specific text representation used for the grid
Analysis relies on proprietary models (o3-mini, Gemini) where exact training data and architecture are opaque

Reproducibility

The full prompt is provided in Appendix A.1. The puzzle generation logic extends the open-source 'PUZZLES' benchmark code (https://github.com/ETH-DISCO/rlp). Total API cost reported ($80). Model weights for o3-mini and Gemini are closed source; DeepSeek R1 and QwQ are open weights.

📊 Experiments & Results

Evaluation Setup

Zero-shot (one-shot prompt with rules) evaluation on generated Tents puzzles of increasing size

Benchmarks:

Tents Puzzle (Algorithmic/Logic Puzzle)

Metrics:

Success Rate (binary solvability)
Reasoning Effort (total token count)
Problem Size (rows × columns)
Statistical methodology: Linear regression (R² fit) used to analyze the relationship between problem size and reasoning effort

Experiment Figures

Scatter plot of Reasoning Tokens (Y-axis) vs. Problem Size (X-axis) for correctly solved puzzles across four models.

Bar chart or curve showing the maximum solvable problem size or success rate distribution.

Detailed scaling of o3-mini's reasoning effort including failed attempts, showing a peak.

Main Takeaways

Reasoning effort scales linearly with problem size for models that can solve the tasks (o3-mini, R1), suggesting they are performing actual algorithmic work.
There is a 'logical coherence limit': no model successfully solved puzzles larger than size 100 (10x10 grid), regardless of the reasoning tokens generated.
o3-mini exhibits a 'frustration' effect where reasoning effort peaks and then drops for unsolvable/large puzzles, whereas other models might just fail without the drop.
Higher reasoning effort strategies (in o3-mini) allow solving larger puzzles but are token-inefficient for smaller, simpler puzzles compared to low-effort strategies.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs)
Understanding of 'chain-of-thought' or reasoning token paradigms
Basic knowledge of constraint satisfaction problems

Key Terms

Reasoning Effort: The total number of tokens generated by the model to produce a final answer, serving as a proxy for computational work or 'thinking time'

Tents puzzle: A logic puzzle played on a grid where players must place tents next to trees according to specific adjacency and numerical constraints; solvable in linear time

Test-time compute: The computational resources (tokens/time) an AI model uses during inference to reason through a problem before answering

Extrapolative reasoning: The ability to solve problems that are larger or more complex than those seen during training

Problem size: Defined in this paper as the product of the puzzle grid dimensions (rows × columns)

Logical coherence: The ability of the model to maintain a consistent and valid chain of reasoning throughout the generation process