ARM: Adaptive Reasoning Model

📝 Paper Summary

Large Reasoning Models (LRMs) Efficient Inference Reinforcement Learning for Reasoning

Arm addresses the 'overthinking' problem in reasoning models by using Ada-GRPO to train the model to adaptively select the most efficient reasoning format (from Direct Answer to Long CoT) based on task difficulty.

Core Problem

Large reasoning models (LRMs) like OpenAI-o1 apply Long Chain-of-Thought (Long CoT) uniformly to all tasks, wasting computational resources on easy problems ('overthinking') and potentially introducing noise.

Why it matters:

Inefficiency: Models consume excessive tokens (and cost) for simple queries that don't require complex reasoning
Format Collapse: Standard RL training (GRPO) tends to over-optimize for the highest-accuracy format (Long CoT), leading models to lose the ability to answer concisely
Current mitigation strategies rely on inaccurate manual token budgets or separate length-constrained models rather than autonomous adaptation

Concrete Example: For an easy commonsense question (e.g., 'What is the color of the sky?'), a standard reasoning model might generate hundreds of tokens of reasoning (Long CoT) before answering, whereas a Direct Answer is sufficient and cheaper. Conversely, complex math requires Long CoT. Standard models cannot switch between these autonomously.

Key Novelty

Adaptive Reasoning Model (Arm) trained via Ada-GRPO

Trains a single model to autonomously select one of four formats (Direct Answer, Short CoT, Code, Long CoT) based on the input prompt's difficulty
Introduces Ada-GRPO (Adaptive Group Relative Policy Optimization), which modifies the reward function to encourage format diversity early in training
Uses a time-decaying diversity factor that initially rewards using less-frequent formats to prevent 'format collapse' to Long CoT, then gradually shifts focus purely to accuracy

Architecture

Conceptual illustration of Adaptive Reasoning. It contrasts fixed Long CoT (used by standard models) with Arm's adaptive selection.

Evaluation Highlights

Reduces inference token usage by an average of ~30% (and up to ~70% on easy tasks) compared to models relying solely on Long CoT
Achieves comparable accuracy to standard GRPO-trained models (within 1% difference) while using significantly fewer tokens
Instruction-Guided Mode (explicitly forcing Long CoT) achieves 74.5% accuracy on Arm-7B, outperforming the same backbone trained with standard GRPO (73.2%)

Breakthrough Assessment

8/10

Significant contribution to reasoning efficiency. Successfully addresses the 'overthinking' problem and 'format collapse' in RL training without sacrificing accuracy, offering a practical Pareto improvement for LRMs.

⚙️ Technical Details

Problem Definition

Setting: Generative reasoning where the model must output both a rationale and a final answer

Inputs: Natural language question q

Outputs: Answer generated in one of four formats: Direct Answer, Short CoT, Code, or Long CoT

Pipeline Flow

Input Question
Arm Model (Internal Format Selection)
Output Generation (in selected format)

System Modules

Arm Model

Generate reasoning and answer in one of 4 formats (Direct, Short CoT, Code, Long CoT) based on learned internal policy

Model or implementation: Qwen2.5-Base (3B, 7B, or 14B)

Novel Architectural Elements

Implicit format selection mechanism learned via RL (Ada-GRPO) rather than a separate classifier module
Support for three inference modes: Adaptive (autonomous), Instruction-Guided (user-forced format), and Consensus-Guided (vote among efficient formats, fallback to Long CoT)

Modeling

Base Model: Qwen2.5-Base (3B, 7B, 14B)

Training Method: Two-stage training: SFT followed by Ada-GRPO (RL)

Objective Functions:

Purpose: SFT Stage - Teach model the structure of 4 reasoning formats.

Formally: Standard Cross-Entropy Loss on dataset with 4 format variations per question.
Purpose: RL Stage (Ada-GRPO) - Optimize accuracy while encouraging format diversity.

Formally: Reward r' = r (accuracy) + alpha(t) * (G / F(o)) (diversity), where F(o) is format frequency in group G, and alpha(t) is a time-decaying factor.

Training Data:

Stage 1 (SFT): AQuA-Rat (3.0K MC, 7.8K open-form), augmented with GPT-4o/DeepSeek-R1 generated rationales for Code/Long CoT
Stage 2 (RL): CommonsenseQA, GSM8K, MATH (total 19.8K samples)

Key Hyperparameters:

group_size_G: Not reported in the paper
learning_rate: Not reported in the paper

Compute: Ada-GRPO achieves ~2x training speedup compared to standard GRPO by utilizing shorter formats during rollout

Comparison to Prior Work

vs. DeepSeek-R1: Arm adaptively down-regulates compute (Direct Answer/Short CoT) for easy tasks, whereas R1 tends to use Long CoT uniformly
vs. Standard GRPO: Arm uses Ada-GRPO to prevent 'format collapse', ensuring the model retains the ability to use short formats
vs. Length-Penalty RL [cited]: Arm selects formats based on semantic difficulty rather than just penalizing raw token count, preventing performance degradation on hard tasks

Limitations

Relies on rule-based rewards (verifiable answers), making it difficult to apply to open-ended generation tasks without ground truth
Performance on easy tasks with DeepSeek-R1-Distill backbone is worse than base model despite higher token cost (a general issue with distilled models, but Arm mitigates this)
Requires a cold-start SFT stage with synthetic data to learn the 4 formats before RL can work effectively

Reproducibility

Code availability is not provided in the paper text. Training datasets (AQuA-Rat, CSQA, GSM8K, MATH) are public. Method relies on proprietary models (GPT-4o, DeepSeek-R1) for SFT data augmentation (generating synthetic Long CoT/Code rationales).

📊 Experiments & Results

Evaluation Setup

Tested on commonsense, mathematical, and symbolic reasoning tasks across varying difficulty levels (Easy, Medium, Hard)

Benchmarks:

CommonsenseQA (CSQA) (Commonsense Reasoning (Easy))
OpenBookQA (OBQA) (Commonsense Reasoning (Easy))
GSM8K (Mathematical Reasoning (Medium))
MATH (Mathematical Reasoning (Medium))
SVAMP (Mathematical Reasoning (Medium))
Big-Bench-Hard (BBH) (Symbolic Reasoning (Medium))
AIME'25 (Competition Math (Hard))

Metrics:

Accuracy (pass@1)
Token Usage (Average tokens per query)
majority@k
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Average across all datasets	Accuracy (Instruction-Guided Long CoT)	73.2	74.5	+1.3
Average across all datasets	Token Usage (relative to DS-R1-Distill-7B)	100	27.8	-72.2

Experiment Figures

Bar chart showing the distribution of reasoning formats (Direct, Short CoT, Code, Long CoT) chosen by different models (SFT, GRPO, Arm) across difficulty levels (Easy, Medium, Hard).

Pareto frontier plot of Accuracy vs. Token Usage.

Main Takeaways

Arm achieves comparable accuracy to standard GRPO models but with ~30% fewer tokens on average, validating the adaptive strategy.
On easy tasks (Commonsense), Arm saves up to ~70% of tokens by switching to Direct Answer, avoiding the 'overthinking' observed in baselines.
Standard SFT fails to teach adaptivity (models output formats uniformly regardless of difficulty), while standard GRPO causes format collapse (models converge to Long CoT only). Ada-GRPO solves both.
Consensus-Guided Mode allows trading efficiency for maximum performance by aggregating simpler formats and falling back to Long CoT only upon disagreement.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning with Verifiable Rewards (RLVR)
Chain-of-Thought (CoT) prompting
Policy Optimization algorithms (PPO/GRPO)

Key Terms

Long CoT: Long Chain-of-Thought—a reasoning format involving detailed, iterative steps, self-reflection, and verification, used by models like OpenAI-o1

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes rewards within a group of sampled outputs for the same input, eliminating the need for a separate value function critic

Ada-GRPO: Adaptive GRPO—the authors' proposed variant that adds a time-decaying diversity reward to prevent the model from converging to a single reasoning format

Format Collapse: The phenomenon where an RL-trained model converges to using only the highest-accuracy format (usually Long CoT) for all tasks, losing the ability to use efficient formats

SFT: Supervised Fine-Tuning—training the model on labeled examples (here, questions paired with answers in specific formats) before RL

Code: A reasoning format that uses programming code (e.g., Python) to structure the problem-solving process