← Back to Paper List

Harnessing the Reasoning Economy: A Survey of Efficient Reasoning for Large Language Models

Rui Wang, Hongru Wang, Boyang Xue, Jianhui Pang, Shudong Liu, Yi Chen, Jiahao Qiu, Derek Fai Wong, Heng Ji, Kam-Fai Wong
The Chinese University of Hong Kong, University of Macau, The University of Hong Kong, Princeton University, University of Illinois Urbana-Champaign
arXiv (2025)
Reasoning RL Benchmark

📝 Paper Summary

Large Reasoning Models (LRMs) Efficient Inference System 2 Reasoning
This survey establishes the concept of 'Reasoning Economy'—balancing performance benefits with computational budgets—and categorizes strategies to optimize Large Reasoning Models (LRMs) during both post-training and test-time.
Core Problem
Large Reasoning Models (LRMs) often exhibit inefficient behaviors, such as 'overthinking' on simple tasks (wasting compute) or 'underthinking' on complex ones (failing to solve), lacking a mechanism to dynamically adjust effort.
Why it matters:
  • Applying a 'one-size-fits-all' deep reasoning approach to all tasks wastes significant computational resources and time.
  • Long Chain-of-Thought (CoT) sequences often contain redundant tokens that do not contribute to the final answer.
  • Current models fail to achieve the global optimum between accuracy (benefit) and token usage (budget), unlike humans who intuitively know when to stop or think deeper.
Concrete Example: A model might generate a massive reasoning chain for a simple arithmetic problem (overthinking), wasting tokens, while failing to trigger deep search for a complex AIME math problem (underthinking), resulting in an incorrect answer.
Key Novelty
Taxonomy of Reasoning Economy
  • Introduces the concept of 'Reasoning Economy' to quantify the trade-off between model performance and computational cost.
  • Systematically categorizes efficiency bottlenecks into 'Inefficient Model Behaviors' (e.g., length bias, fake thinking) and 'Inefficient Model Usage' (unreasonable algorithm selection).
  • Classifies optimization solutions into Post-training regulations (Data, Algorithm, Architecture) and Test-time improvements (Input/Output adaptive budgeting).
Evaluation Highlights
  • Highlights that LLaMA-3-8B-Instruct improves accuracy from 82.9% (100 samples) to 98.44% (10,000 samples) via test-time scaling (cited from Brown et al., 2024).
  • Notes that DeepSeek-R1-Distill-Qwen-14B improves AIME24 accuracy from 69.7% (pass@1) to 80% (majority vote @ 64 samples) via parallel scaling (cited from Yang et al., 2024a).
  • Identifies that test-time scaling is often more effective than additional training for easy/medium problems, but less so for difficult problems (cited from Snell et al., 2024).
Breakthrough Assessment
8/10
A timely and comprehensive survey that formalizes the rapidly emerging field of efficient reasoning (System 2) in LLMs, providing a crucial roadmap for future research in 'Reasoning Economy'.
×