Mitigating Overthinking through Reasoning Shaping

📝 Paper Summary

Large Reasoning Models (LRMs) Reinforcement Learning from Verifier Reward (RLVR)

The paper introduces Group Relative Segment Penalization (GRSP), a method to reduce excessive reasoning in Large Reasoning Models by penalizing redundant steps rather than individual tokens, using length-aware weighting to preserve accuracy.

Core Problem

Large Reasoning Models trained with RLVR tend to 'overthink,' generating excessively long and meandering reasoning trajectories that inflate computational costs without improving accuracy.

Why it matters:

Existing token-level penalties (like length ratios) are too coarse, often disrupting the learning of valid reasoning patterns and degrading task performance
Identifying redundant content at the token level is ambiguous and difficult for verifiers, leading to unstable RL training
Test-time scaling benefits are lost if models are simply forced to be short without understanding which parts of the reasoning are actually redundant

Concrete Example: When solving a math problem, a model might repeatedly explore and revise paths (overthinking), generating hundreds of unnecessary tokens. Token-level penalties might force it to truncate valid reasoning, causing it to fail, whereas a human would simply identify and remove entire redundant 'steps' or segments.

Key Novelty

Group Relative Segment Penalization (GRSP)

Shifts supervision granularity from tokens to reasoning segments (steps), which correlate better with redundancy and are easier to assess
Applies a 'descending' weighting scheme where shorter segments are penalized more heavily than longer ones, encouraging deep thinking within fewer steps rather than rapid, shallow, redundant steps

Architecture

The workflow of GRSP. It illustrates sampling a response group, segmenting the reasoning into steps, clustering segments by length, calculating z-score penalties for each cluster, applying descending weights, and summing them into the final reward.

Evaluation Highlights

Achieves highest accuracy (64.88%) and lowest token count (2155) on Omni-MATH 500 using Reinforce + GRSP, surpassing both standard Reinforce and length-penalty baselines
Reduces average reasoning segments from 26.66 (Reinforce) to 21.07 (GRSP) on keyword-based segmentation analysis, effectively pruning redundancy
Demonstrates scalability: On Qwen-2.5-32B-it, GRSP significantly reduces token usage while maintaining near-identical accuracy compared to standard RL baselines

Breakthrough Assessment

7/10

Offers a practical, statistically grounded solution to the specific 'overthinking' problem in LRMs. While a refinement of RLVR rather than a new paradigm, it effectively balances efficiency and accuracy where previous methods failed.

⚙️ Technical Details

Problem Definition

Setting: Reinforcement Learning from Verifier Reward (RLVR) for mathematical reasoning tasks

Inputs: Math problem prompt x

Outputs: Reasoning trajectory followed by final answer y

Pipeline Flow

Inference (Sampling group of responses)
Segmentation (Splitting reasoning into steps)
Reward Calculation (Verifier + GRSP Penalty)
RL Update (GRPO or Reinforce)

System Modules

Policy Model

Generates reasoning trajectory and final answer

Model or implementation: Qwen-2.5-14B-it (and other sizes)

Segmenter (Reward Calculation)

Splits reasoning content into discrete steps/segments

Model or implementation: Keyword-matching or Token Log-probability heuristic

Reward Engine (Reward Calculation)

Computes total reward including correctness and GRSP penalty

Model or implementation: Deterministic Verifier + GRSP Logic

Novel Architectural Elements

Step-level supervision mechanism integrated into the reward function (GRSP)
Length-aware weighting logic that applies differential penalties based on segment length clusters

Modeling

Base Model: Qwen-2.5-14B-it (also tested 7B, 32B, and DeepSeek-R1)

Training Method: Reinforcement Learning (Reinforce and GRPO)

Objective Functions:

Purpose: Maximize expected reward.

Formally: J(θ) = E[R(x, y)]
Purpose: Penalize segment counts relative to the group, weighted by length cluster.

Formally: Penalty = Σ (w_k * z-score(count_k)), where w_k decreases as cluster length k increases

Adaptation: Full fine-tuning

Training Data:

SFT: NuminaMATH problems with O1-mini patterned thoughts
RL: Omni-MATH and AIME (harder problems to induce longer reasoning)

Key Hyperparameters:

penalty_weight_type: Descending (default)
segment_length_limit: 300 tokens (for clustering)
clusters: 5 length clusters

Compute: Not reported in the paper

Comparison to Prior Work

vs. LCPO: GRSP penalizes at the segment level rather than total token length, avoiding the collapse of reasoning depth
vs. O1-Pruner: GRSP uses group-relative z-scores and length-aware weighting, stabilizing training compared to simple ratio penalties
vs. Vanilla RL: GRSP explicitly targets 'overthinking' redundancy, reducing token costs while maintaining accuracy

Limitations

Keyword-based segmentation relies on language-specific markers (limited applicability across languages without adaptation)
Descending weighting heuristic is empirical; optimal weights might vary by task
Experiments limited to math reasoning; generalization to coding or general domains not tested

Reproducibility

Code availability is not provided. Data sources (NuminaMATH, Omni-MATH, AIME) are public. Method relies on specific keyword lists or log-prob thresholds for segmentation, which are described but not provided as explicit artifacts.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning across varied difficulty levels

Benchmarks:

MATH 500 (General Math QA)
AIMO Prize-1 (Challenging Math QA (10 problems))
Omni-MATH 500 (Difficult Math QA (Olympiad level))

Metrics:

Accuracy (Acc)
Average Length (tokens per response)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Omni-MATH 500 (most challenging benchmark) showing GRSP's superiority in efficiency and accuracy.
Omni-MATH 500	Accuracy	29.98	30.40	+0.42
Omni-MATH 500	Average Length	3034	2155	-879
Omni-MATH 500	Accuracy	27.20	30.40	+3.20
Omni-MATH 500	Average Length	2811	2155	-656
Ablation study on segmentation strategy (Confidence vs. Keyword).
MATH 500 + Omni-MATH	Accuracy	64.72	64.91	+0.19
MATH 500 + Omni-MATH	Average Length	3477	3415	-62

Experiment Figures

Ratio of average segment counts (Passed / Failed cases) across length clusters for different models.

Training dynamics (Accuracy, Avg Length, Segment Length) comparing Descending vs. Ascending weighting schemes.

Main Takeaways

GRSP effectively mitigates overthinking, particularly on harder tasks (Omni-MATH) where standard RL tends to bloat reasoning length
Descending weighting (penalizing short segments more) stabilizes training compared to Ascending weighting, which leads to accuracy collapse
Larger models (32B vs 7B) benefit more from GRSP, showing greater token reduction capabilities while maintaining accuracy
Accuracy and length optimization are not zero-sum; GRSP allows simultaneous improvement in both by shaping reasoning structure rather than just cutting tokens

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (RL) fundamentals, specifically PPO/GRPO
Chain-of-Thought (CoT) reasoning
Large Language Models (LLMs) and post-training

Key Terms

RLVR: Reinforcement Learning from Verifier Reward—using a binary correctness signal (e.g., correct answer) to train models via RL

LRMs: Large Reasoning Models—LLMs specifically optimized to generate long reasoning trajectories (thoughts) before answering

GRPO: Group Relative Policy Optimization—an RL algorithm that normalizes advantages within a group of sampled responses to the same prompt, removing the need for a critic model

z-score: A statistical measure describing a value's relationship to the mean of a group, measured in terms of standard deviations

Overthinking: The phenomenon where reasoning models generate excessively long, meandering, or repetitive thought processes that do not contribute to accuracy

SFT: Supervised Fine-Tuning—training a model on labeled examples before applying reinforcement learning

Test-time Scaling: The observation that allowing models to think longer (generate more tokens) during inference improves performance on complex tasks

Reinforce: A basic policy gradient RL algorithm that updates model weights based on the return (reward) of sampled trajectories