AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting

📝 Paper Summary

Efficient Reasoning in LLMs Adaptive Computation Reinforcement Learning for Reasoning

AdaCtrl enables reasoning models to dynamically adjust their thinking time based on self-assessed problem difficulty and user-specified tags, balancing efficiency for simple tasks and depth for complex ones.

Core Problem

Large reasoning models (like o1 or DeepSeek R1) often "overthink" simple problems, wasting computation and increasing latency, while universal compression methods sacrifice performance on hard tasks.

Why it matters:

Unnecessary long reasoning chains for trivial questions (e.g., log2(64)) degrade user experience due to high latency.
Existing efficiency methods are either too rigid (universal shortening) or rely on fragile instruction-following without true difficulty awareness.
Users lack explicit control over the trade-off between speed and reasoning depth in current reasoning models.

Concrete Example: When asked a simple question like "Evaluate log2(64)", standard reasoning models might engage in lengthy planning and verification steps. AdaCtrl identifies this as an "[Easy]" problem and outputs a concise solution, saving tokens.

Key Novelty

Two-Stage Difficulty-Aware Training (AdaCtrl)

Implements a cold-start fine-tuning phase using a mixed dataset of concise solutions for easy problems and long reasoning traces for hard ones, teaching the model to predict difficulty tags.
Uses a difficulty-aware Reinforcement Learning (RL) framework where the model is rewarded for correctly estimating difficulty (calibration) and for adjusting response length dynamically based on that estimate.

Architecture

The overall training pipeline of AdaCtrl, illustrating the two stages: Cold-Start Fine-Tuning and Difficulty-Aware Reinforcement Learning.

Evaluation Highlights

Reduces response length by 91.04% on GSM8K (easy dataset) while maintaining or improving accuracy (+2.05%) compared to standard RL baselines using Qwen2.5-7B.
Achieves 10.41% accuracy improvement on the challenging AIME2024 dataset with Qwen2.5-14B while simultaneously reducing token usage by 18.20%.
Outperforms standard RL baselines (SFT+RL) across four benchmarks (AIME24, AIME25, MATH500, GSM8K) in both accuracy and efficiency.

Breakthrough Assessment

8/10

Strong practical contribution addressing the 'overthinking' problem in reasoning models. The explicit user control via tags and the self-calibration reward mechanism are effective and well-motivated.

⚙️ Technical Details

Problem Definition

Setting: Generative reasoning where the model must predict both a difficulty tag t and a reasoning chain p followed by an answer y, maximizing utility (accuracy - length cost).

Inputs: Natural language query q

Outputs: Difficulty tag t (e.g., [Easy], [Hard]), reasoning process p, and final answer y

Pipeline Flow

Input Query -> Difficulty Estimation (internal) -> Tag Generation ([Easy]/[Hard]) -> Reasoning Generation (Concise or Detailed) -> Final Answer

System Modules

Policy Model

Generates difficulty tag, reasoning chain, and answer

Model or implementation: Qwen2.5-Instruct (7B/14B)

Novel Architectural Elements

Integration of explicit difficulty tags ([Easy]/[Hard]) as control tokens that trigger different reasoning behaviors (concise vs. elaborate) within a single model architecture.

Modeling

Base Model: Qwen2.5-7B-Instruct and Qwen2.5-14B-Instruct

Training Method: Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Reward correct final answers.

Formally: Binary reward based on correctness of boxed answer.
Purpose: Calibrate difficulty estimation.

Formally: Reward matches between generated tag and 'golden' difficulty (estimated via rollout accuracy frequency).
Purpose: Penalize unnecessary length for easy problems.

Formally: Cosine-based length penalty applied ONLY when the generated tag is [Easy].

Adaptation: Full fine-tuning

Training Data:

Cold-Start SFT: 8K instances from DeepMATH (4K easy/short, 4K hard/long)
RL: 30K instances from DeepMATH (10K easy, 20K hard), distinct from SFT set

Key Hyperparameters:

learning_rate_sft: 1e-5
learning_rate_rl: 1e-6
batch_size_sft: 8
+ 6 more
batch_size_rl: 256
micro_batch_size_rl: 32
grpo_group_size: 16
alpha: 0.5
beta: 0.5
delta_threshold: 0.625

Compute: NVIDIA H800 GPUs

Comparison to Prior Work

vs. R1-SFT-RL: AdaCtrl introduces difficulty-aware rewards and mixed data (short/long) to prevent universal length hacking.
vs. Universal Compression (e.g., TokenSkip [not cited in paper]): AdaCtrl selectively compresses only easy problems rather than enforcing conciseness globally.

Limitations

Binary difficulty classification (Easy/Hard) may be too coarse for some problems.
Relies on existing difficulty annotations (DeepMATH) for initial cold-start, which may not perfectly align with model capabilities.
Evaluation limited to mathematics domain.

Reproducibility

Code: https://github.com/JoeYing1019/AdaCtrl

Code will be released at https://github.com/JoeYing1019/AdaCtrl. Datasets used (DeepMATH, AIME, GSM8K, MATH500) are public. Exact training time not reported.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning across diverse difficulty levels.

Benchmarks:

AIME2024 (Challenging Math Olympiad)
AIME2025 (Challenging Math Olympiad)
MATH500 (Mixed difficulty math problems)
GSM8K (Grade school math (easier))

Metrics:

Accuracy (Acc.)
Average Response Length (Len.) in tokens
Statistical methodology: Averages over 8 independent runs reported for AIME datasets.

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of AdaCtrl-7B against the primary RL baseline (R1-SFT-RL) shows simultaneous improvements in accuracy and reductions in length across all datasets.
AIME2025	Accuracy	46.67	48.34	+1.67
AIME2025	Length	9089	7986	-1103
MATH500	Length	6924	2628	-4296
GSM8K	Length	2914	261	-2653
Results on the larger 14B model show even stronger gains in accuracy on hard tasks.
AIME2024	Accuracy	50.42	60.83	+10.41
AIME2024	Length	13149	10756	-2393

Main Takeaways

AdaCtrl successfully decouples reasoning length from performance, using short chains for easy problems (GSM8K) and long chains for hard ones (AIME), unlike baselines that tend to be verbose everywhere.
The Cold-Start Fine-Tuning phase is critical; models initialized with mixed short/long data (Cold-Start-SFT) manage budgets significantly better than those trained only on long traces (R1-SFT).
Difficulty-aware RL further refines the budget allocation, leading to higher accuracy than SFT alone by allowing the model to self-correct its difficulty assessment.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) reasoning
Reinforcement Learning from Human Feedback (RLHF)
Proximal Policy Optimization (PPO) or GRPO

Key Terms

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that normalizes rewards within a group of outputs generated from the same prompt to reduce variance.

Chain-of-Thought: A prompting or training technique where models generate intermediate reasoning steps before the final answer.

Cold-Start Fine-Tuning: An initial supervised training phase used to instill basic capabilities (here, format adherence and basic difficulty estimation) before reinforcement learning.

Difficulty-Aware Tags: Explicit tokens (e.g., [Easy], [Hard]) generated by the model to signal its assessment of problem complexity.

Overthinking: The phenomenon where reasoning models generate unnecessarily complex and lengthy reasoning paths for simple problems.

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs).

DeepMATH: A mathematics dataset with annotated difficulty levels used for training and evaluation.