AdaptThink: Reasoning Models Can Learn When to Think

📝 Paper Summary

Efficient Reasoning Reasoning Models (Chain-of-Thought)

AdaptThink uses reinforcement learning to teach reasoning models to skip the lengthy thinking process for simple problems while retaining it for complex ones, optimizing both efficiency and accuracy.

Core Problem

Large reasoning models like DeepSeek-R1 apply lengthy chain-of-thought processes to every query, including simple ones where such overhead is unnecessary and degrades user experience.

Why it matters:

Thinking processes substantially increase inference overhead and latency, creating bottlenecks for real-time applications.
Simple queries (e.g., those solvable by standard LLMs) receive excessively detailed, redundant responses when forced through reasoning models.
Current efficiency methods (length penalties, merging) still force thinking on all instances rather than deciding *whether* to think.

Concrete Example: For a simple math problem like '2+2', a reasoning model might generate a long trace verifying the properties of addition before answering '4'. AdaptThink detects this simplicity and outputs '4' immediately, saving tokens.

Key Novelty

Difficulty-Adaptive Thinking Mode Selection via RL

Teaches the model to switch between 'Thinking' (long CoT) and 'NoThinking' (direct answer) based on problem difficulty.
Uses a constrained optimization objective that penalizes 'Thinking' unless it provides a significant accuracy gain over the reference model.
Introduces importance sampling during training to mix Thinking and NoThinking trajectories, overcoming the 'cold start' problem where models initially always think.

Architecture

Pseudocode for the AdaptThink RL algorithm, detailing the importance sampling and gradient update steps.

Evaluation Highlights

Reduces average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% on three math datasets while improving accuracy by 2.4%.
On GSM8K, reduces average response length by 50.9% while improving accuracy by 4.1% compared to the base model.
On MATH500, reduces average response length by 63.5% while improving accuracy by 1.4%.

Breakthrough Assessment

8/10

Effective, practical solution to the efficiency bottleneck of reasoning models. The 'NoThinking' simplification and adaptive switching mechanism show strong empirical gains in both speed and accuracy.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning where the policy learns to generate a response y given prompt x, choosing between a long thinking trace or immediate answer.

Inputs: Prompt x containing a problem statement and a special token <think>

Outputs: Response y, which starts with either a thinking process or the token </think> (indicating NoThinking mode)

Pipeline Flow

Input Prompt -> Model Policy
Mode Selection (Implicit): Model generates first token
If token is </think> -> NoThinking Mode (Direct Answer)
If token is thinking start -> Thinking Mode (Chain of Thought -> Answer)

System Modules

Policy Model

Generates the response token-by-token. The first token determines the thinking mode.

Model or implementation: DeepSeek-R1-Distill-Qwen (1.5B / 7B)

Novel Architectural Elements

Implicit mode selection via first-token generation: The decision to think or not is modeled as the probability of generating the </think> token immediately versus starting a thought trace.

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Qwen-7B

Training Method: Reinforcement Learning (PPO-style)

Objective Functions:

Purpose: Maximize NoThinking probability while maintaining accuracy constraint.

Formally: Maximize E[indicator(NoThinking) * delta + Reward(x,y) - BaselineReward(x)]
Purpose: Importance Sampling correction to handle cold start.

Formally: Sample 50/50 from Thinking/NoThinking distributions, reweight gradients using importance weights pi_theta / pi_IS.

Adaptation: Full model update

Training Data:

Training data not explicitly detailed in snippet, implied to be standard math reasoning datasets (GSM8K, MATH, etc.) used for RL.

Key Hyperparameters:

delta (penalty weight inverse): Controls trade-off between efficiency and accuracy
K (samples per prompt): Number of samples for reward estimation
w_start: Token used to force Thinking mode (e.g., 'Alright')

Compute: Not reported in the paper

Comparison to Prior Work

vs. Length-based rewards: AdaptThink makes a binary decision (Think/NoThink) based on difficulty rather than just shortening the trace.
vs. NoThinking (Ma et al.): AdaptThink learns to *select* the mode dynamically per problem, whereas Ma et al. apply it statically to all problems or via simple heuristics.
vs. Early Exit [not cited in paper]: AdaptThink decides at the *start* whether to think, avoiding partial computation of layers or steps.

Limitations

Reliance on the accuracy of the reference model for reward baselines.
Binary Thinking/NoThinking choice might be too coarse compared to variable-length thinking.
Requires ground truth or verifiable rewards (like math problems) for RL training.

Reproducibility

Code: https://github.com/THU-KEG/AdaptThink

Code and models are publicly available at https://github.com/THU-KEG/AdaptThink. The paper details the algorithm and objectives clearly.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks categorized by difficulty.

Benchmarks:

GSM8K (Grade School Math)
MATH500 (Challenging Math Problems)
AIME2024 (Math Competition)

Metrics:

Accuracy
Average Response Length (tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
AdaptThink significantly reduces token usage while improving accuracy across multiple math benchmarks using DeepSeek-R1-Distill-Qwen-1.5B.
GSM8K	Average Response Length	Not explicitly reported in the paper	Not explicitly reported in the paper	-50.9%
GSM8K	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+4.1%
MATH500	Average Response Length	Not explicitly reported in the paper	Not explicitly reported in the paper	-63.5%
MATH500	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+1.4%
AIME2024	Average Response Length	Not explicitly reported in the paper	Not explicitly reported in the paper	-44.7%
AIME2024	Accuracy	Not explicitly reported in the paper	Not explicitly reported in the paper	+1.6%

Experiment Figures

Pilot study comparing Thinking vs. NoThinking performance across problem difficulty levels on MATH500.

Main Takeaways

NoThinking (direct answering) outperforms Thinking (long CoT) on simple problems (Level 1-3 MATH500) in terms of efficiency and sometimes accuracy.
AdaptThink successfully learns to switch modes based on problem difficulty, using Thinking for hard problems and NoThinking for easy ones.
The method achieves a 'win-win' by drastically cutting inference costs (up to ~63% length reduction) while slightly boosting accuracy.
Importance sampling is critical for training to prevent the model from collapsing into the 'always think' mode (cold start problem).

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO)
Chain-of-Thought (CoT) Reasoning
Importance Sampling

Key Terms

Thinking: A mode where the model generates a long chain of thought (exploration, reflection) before the final answer.

NoThinking: A mode where the model skips the thinking process entirely, enforced by prompting with an empty thinking segment (<think></think>).

AdaptThink: The proposed RL algorithm that teaches models to adaptively select between Thinking and NoThinking modes.

PPO: Proximal Policy Optimization—an RL algorithm used here to update the model policy based on rewards.

Importance Sampling: A technique used to estimate properties of a distribution using samples from a different distribution; used here to balance Thinking/NoThinking samples during training.

Cold Start: The issue where the initial model always selects 'Thinking', preventing the RL agent from exploring 'NoThinking' without intervention.