← Back to Paper List

Reasoning Effort and Problem Complexity: A Scaling Analysis in LLMs

Benjamin Estermann, Roger Wattenhofer
arXiv (2025)
Reasoning Benchmark RL

📝 Paper Summary

LLM Reasoning Capabilities Test-time Compute Scaling
By analyzing LLMs on the infinitely scalable Tents puzzle, this study demonstrates that reasoning effort (token count) scales linearly with problem complexity only up to a critical threshold, beyond which models fail to maintain logical coherence.
Core Problem
While recent LLMs leverage increased test-time compute for reasoning, it is unclear how their 'reasoning effort' (token usage) scales as problem complexity systematically increases.
Why it matters:
  • Understanding scaling behaviors reveals potential bottlenecks and efficiency limits in current reasoning architectures (like o1 or R1)
  • Simply measuring accuracy is insufficient; analyzing token usage provides insight into the 'algorithmic cost' of reasoning within LLMs
  • Identifying critical complexity thresholds helps define the boundaries of current 'system 2' reasoning capabilities in extrapolative settings
Concrete Example: When solving a small Tents puzzle, an LLM might use a few hundred tokens. As the grid grows to 10x10, a human solver's effort increases predictably. If the LLM's token usage peaks and then drops while failing to solve the puzzle (as seen with o3-mini), it indicates a breakdown in the reasoning process rather than a lack of capacity.
Key Novelty
Scaling Analysis using Tents Puzzle
  • Uses the 'Tents' logic puzzle as a controlled testbed because it is infinitely scalable (grid size can be increased arbitrarily) and has a known linear-time solution
  • Analyzes the correlation between problem size (grid dimensions) and reasoning effort (output token count) to test for linear scaling behavior
  • Identifies a 'frustration' phenomenon in some models where reasoning effort decreases after a certain complexity threshold, suggesting a loss of coherence
Evaluation Highlights
  • OpenAI o3-mini achieved the highest success rate, solving puzzles up to size 100 (10x10 grid), while Qwen/QwQ-32B-Preview struggled beyond size 25
  • DeepSeek R1 and o3-mini demonstrated a linear increase in reasoning tokens as problem size grew, confirming they adapt effort to complexity for solvable instances
  • o3-mini exhibited non-monotonic effort scaling: token usage peaked at problem size 100 and then decreased, indicating a potential failure mode for highly complex problems
Breakthrough Assessment
7/10
A valuable analysis paper that quantifies the 'thinking' process of new reasoning models. While it doesn't propose a new model, the identification of linear scaling and the 'frustration' peak provides important insights into the nature of test-time compute.
×