Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

📝 Paper Summary

Efficient Reasoning Large Reasoning Models (LRMs)

This survey systematically categorizes techniques for reducing the computational overhead of Chain-of-Thought reasoning in Large Language Models without sacrificing problem-solving accuracy.

Core Problem

Large Reasoning Models (LRMs) like OpenAI o1 often exhibit the 'overthinking phenomenon,' generating verbose and redundant reasoning steps that increase inference costs and latency.

Why it matters:

The verbosity of reasoning models (e.g., thousands of tokens for simple math) makes them prohibitively expensive and slow for real-time applications like autonomous driving or robotics
Pre-training incentives often correlate longer reasoning with better performance, creating a tension where efficiency requires working against standard training objectives
Current reasoning models lack mechanisms to dynamically adjust effort, spending excessive compute on trivial queries

Concrete Example: When asked 'what is the answer of 2 plus 3?', smaller reasoning models may generate thousands of tokens of step-by-step derivation before outputting '5', wasting computational resources on a trivial task.

Key Novelty

Taxonomy of Efficient Reasoning

Categorizes efficiency methods into three pillars: Model-based (optimizing the model itself via RL or SFT), Output-based (compressing steps dynamically during inference), and Prompt-based (guiding efficiency via inputs)
Distinguishes 'efficient reasoning' (smart, concise thought processes) from traditional 'model compression' (quantization, pruning), focusing on reducing the reasoning length rather than model parameters

Architecture

Taxonomy of Efficient Reasoning approaches

Evaluation Highlights

Reviews RL methods that integrate length-based rewards (e.g., O1-Pruner, L1) to penalize verbosity while maintaining accuracy
Discusses SFT strategies using variable-length CoT data (e.g., CoT-Valve) where models learn to skip steps or mix long/short reasoning paths
Highlights dynamic inference methods (e.g., Speculative Rejection, Fast MCTS) that terminate reasoning early or switch strategies based on problem difficulty

Breakthrough Assessment

9/10

The first comprehensive survey to structure the emerging field of efficient reasoning, clearly distinguishing it from general model compression and providing a unified taxonomy for future research.

⚙️ Technical Details

Problem Definition

Setting: Optimizing Large Language Model inference to minimize token generation (reasoning length) while maintaining accuracy on complex reasoning tasks

Inputs: Input prompt/question q

Outputs: Reasoning chain c and final answer a, where len(c) is minimized subject to P(a|q, c) being high

Pipeline Flow

Survey Structure: Categorizes methods into Model-based, Output-based, and Prompt-based efficiency

System Modules

Model-based Efficient Reasoning

Modify the model weights or training process to intrinsically generate shorter reasoning

Model or implementation: Various (e.g., Llama, DeepSeek variants)

Reasoning Output-based Efficient Reasoning

Dynamically reduce steps during generation

Model or implementation: Frozen or Adapted LLM

Input Prompts-based Efficient Reasoning

Guide the model to be concise via prompt engineering or routing

Model or implementation: Frozen LLM

Novel Architectural Elements

Taxonomy categorizing efficiency into Model-based, Output-based, and Prompt-based approaches
Distinction between 'smart concise reasoning' and traditional 'model compression' (quantization/pruning)

Modeling

Base Model: Survey covers multiple models (OpenAI o1, DeepSeek-R1, Llama series)

Training Method: Review of RL and SFT methods

Objective Functions:

Purpose: Penalize long reasoning chains during RL training.

Formally: Reward = Accuracy_Reward - lambda * Length_Penalty
Purpose: Distill efficient reasoning from long CoT.

Formally: SFT on (Question, Short_CoT, Answer) pairs

Adaptation: LoRA, Full Fine-Tuning (depending on specific paper surveyed)

Training Data:

Variable-length CoT datasets created via post-reasoning compression (summarizing long traces)
Variable-length CoT datasets created via during-reasoning constraints (prompting models to skip steps)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Quantization/Pruning: Efficient Reasoning focuses on reducing the *number of tokens generated* (reasoning length) rather than the size of the model weights
vs. Standard CoT: Efficient Reasoning aims to dynamically adjust or compress the chain based on difficulty, rather than always generating verbose steps

Limitations

Efficient reasoning is in early research stages; many methods are heuristic or experimental
Shortening reasoning can risk accuracy degradation if critical steps are skipped ('hasty generalization')
Evaluation benchmarks for efficiency are less standardized than accuracy benchmarks

Reproducibility

Code: https://github.com/ysui7/Efficient-Reasoning-Survey

The paper is a survey; it provides a public repository (https://github.com/ysui7/Efficient-Reasoning-Survey) tracking the research. It does not introduce a new model to replicate but summarizes existing ones.

📊 Experiments & Results

Evaluation Setup

Survey of evaluation methods in literature

Benchmarks:

Sys2Bench (System-2 Reasoning)
S1-Bench (Reasoning Benchmark)

Metrics:

Accuracy (Pass@1)
Average Token Length / Inference Cost
Responsiveness / Latency
Statistical methodology: Not explicitly reported in the paper

Experiment Figures

Illustration of the 'Overthinking' phenomenon

Main Takeaways

RL with length penalties successfully reduces verbosity but requires careful tuning to avoid harming reasoning quality
SFT on mixed short/long CoT data helps models adapt reasoning length to problem difficulty
Dynamic inference methods (early exit, token skipping) offer inference-time flexibility without retraining
The 'overthinking' problem is a direct side-effect of training for maximum accuracy via extensive CoT, creating a trade-off frontier

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Reinforcement Learning with Human Feedback (RLHF)
Supervised Fine-Tuning (SFT)
Model Compression concepts

Key Terms

CoT: Chain-of-Thought—a technique where models generate intermediate reasoning steps to solve complex problems

LRMs: Large Reasoning Models—LLMs specifically optimized for reasoning tasks, often via RL (e.g., OpenAI o1, DeepSeek-R1)

Overthinking Phenomenon: The tendency of reasoning models to generate excessively detailed or redundant steps for simple problems, wasting compute

System-1 vs System-2: System-1 refers to fast, intuitive thinking; System-2 refers to slow, deliberate, step-by-step reasoning

PPO: Proximal Policy Optimization—a standard reinforcement learning algorithm used to train models

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs)

Process Reward Model (PRM): A reward model that evaluates the correctness of intermediate reasoning steps, not just the final answer

Monte Carlo Tree Search (MCTS): A search algorithm used to explore reasoning paths by simulating future outcomes, speculated to be used in models like OpenAI o1

KV Cache: Key-Value Cache—memory used during LLM inference to store past token representations; compressing it speeds up generation

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model weights