Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Model Compression / Efficiency Reinforcement Learning (RL) Post-training

FGO optimizes Long Chain-of-Thought reasoning by subdividing response groups into correct and incorrect sets, weighting them by length and entropy to penalize verbosity while maintaining accuracy.

Core Problem

Large Language Models often generate unnecessarily verbose Chain-of-Thought reasoning that increases computational costs without improving accuracy, and existing RL optimization methods like GRPO suffer from inefficient data utilization and entropy collapse.

Why it matters:

Reasoning ability does not scale linearly with length; excessively long CoT often degrades performance due to overthinking and redundant double-checking
Current compression methods either compromise logical consistency (token-level) or rely on expensive auxiliary models (instance-level)
Standard GRPO wastes training data when all group responses receive identical rewards (zero advantage) and tends to converge to identical responses (entropy collapse)

Concrete Example: In a math problem from MATH500, a standard model (ZR1-1.5B) produces a long CoT by repeatedly double-checking its work but still reaches an incorrect answer. FGO produces a concise derivation that is both shorter and correct.

Key Novelty

Fine-grained Group policy Optimization (FGO)

Subdivides a group of sampled responses into 'correct' (reward=1) and 'incorrect' (reward=0 -> -1) subgroups instead of treating them as a monolithic batch.
Applies granular weighting within subgroups: correct answers get higher weight if they are shorter and have lower entropy (higher confidence); incorrect answers get stronger negative weight if they are shorter and have higher entropy.

Evaluation Highlights

Achieves 100% data utilization during training by ensuring non-zero advantages via subgroup reward shaping, whereas GRPO suffers from degenerate cases where all group rewards are identical.
Effective mitigation of entropy collapse: trajectory-level entropy decreases more gradually and remains higher than baselines, preserving exploration.
Successfully compresses Chain-of-Thought length while maintaining or improving accuracy across MATH500, AIME24, AMC23, and Minerva benchmarks (exact numeric deltas not in provided text).

Breakthrough Assessment

7/10

Addresses specific, practical inefficiencies in both CoT length and the popular GRPO algorithm. While an incremental improvement on GRPO, the dual focus on efficiency and training stability is valuable.

⚙️ Technical Details

Problem Definition

Setting: Mathematical reasoning tasks using Chain-of-Thought generation

Inputs: Question s

Outputs: Reasoning chain and final answer a

Pipeline Flow

Input Question
LLM Inference (Sampling Group of Responses)
Evaluation against Ground Truth
RL Update (FGO)

System Modules

Reasoning Policy (LLM)

Generate a group of G responses containing reasoning steps and final answers

Model or implementation: Variants of Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen, ZR1, Qwen2.5-Math-Instruct

Novel Architectural Elements

Subgroup-specific reward mechanism: Splits responses into Positive (Correct) and Negative (Incorrect) sets for separate weighting strategies during RL update

Modeling

Base Model: Qwen2.5-Math-1.5B, DeepSeek-R1-Distill-Qwen-1.5B, ZR1-1.5B

Training Method: Fine-grained Group policy Optimization (FGO)

Objective Functions:

Purpose: Calculate fine-grained weights for correct responses.

Formally: W+ proportional to exp( -alpha * normalized_length - beta * normalized_entropy )
Purpose: Calculate fine-grained weights for incorrect responses.

Formally: W- proportional to exp( -alpha * normalized_length + beta * normalized_entropy )
Purpose: Optimize policy using clipped surrogate objective with fine-grained advantages.

Formally: Standard GRPO objective but with FGO-derived advantages.

Key Hyperparameters:

alpha: 0.01 (Controls CoT length compression)
beta: 1 (Controls exploration/entropy weight)
gamma: 0 (KL penalty/reference model alignment coefficient)
+ 1 more
training_samples: 3,200 (from MATH-Lighteval)

Compute: Memory and overhead reduced by setting gamma=0 (no reference model needed during training)

Comparison to Prior Work

vs. GRPO: FGO uses subgrouping (correct/incorrect) and incorporates length/entropy into rewards to ensure 100% data utilization and prevent entropy collapse.
vs. Instance-level compression: FGO does not require an additional auxiliary model [not cited in paper but implied].
vs. Token-level compression: FGO compresses at the reasoning trajectory level via RL rather than post-hoc token filtering, preserving logical consistency.

Limitations

Numeric performance deltas (accuracy/length reduction) are claimed but specific values are not extractable from the provided text snippet.
Depends on ground truth answers for reward verification (cannot be applied to open-ended tasks without a verifier).
Requires hyperparameter tuning of alpha to balance length and accuracy (alpha=0 leads to long responses, large alpha degrades accuracy).

Reproducibility

Code: https://github.com/Mr-XcHan/FGO

Code is publicly available at https://github.com/Mr-XcHan/FGO. Training data is the first 3,200 samples from MATH-Lighteval. Hyperparameters alpha and beta are explicitly specified.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning on standard benchmarks

Benchmarks:

MATH500 (Mathematical Problem Solving)
AIME24 (Mathematical Competition)
AMC23 (Mathematical Competition)
Minerva (Mathematical Reasoning)

Metrics:

Accuracy (Pass@1)
Token-Length (Average tokens)
ACT (Accuracy Contribution per hundred Tokens)
Self-reflection keyword frequency
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Training Set (MATH-Lighteval subset)	Data Utilization Rate	Inefficient (<100%)	100%	Positive
Not specified	Performance Balance	Longer responses	Optimal (alpha=0.01)	Improved

Experiment Figures

Case study comparison of ZR1-1.5B on a MATH500 problem

Frequency of reflection-related keywords per hundred tokens

Trajectory-level entropy dynamics during training

Main Takeaways

FGO achieves 100% data utilization in training, resolving a key inefficiency in GRPO where identical group rewards lead to zero learning signal.
The method successfully mitigates entropy collapse, maintaining higher trajectory-level entropy during training compared to baselines.
The hyperparameter alpha=0.01 effectively balances the trade-off between chain-of-thought length and accuracy; aggressive compression (high alpha) hurts performance, while no compression (alpha=0) retains redundant reasoning.
Self-reflection capabilities (measured by keywords like 'wait', 'alternatively') are preserved even as the reasoning chain is compressed.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning (PPO, Policy Gradients)
Chain-of-Thought (CoT) Prompting
Entropy and Information Theory

Key Terms

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

GRPO: Group Relative Policy Optimization—an RL algorithm that estimates advantages by comparing a group of outputs against the group average rather than using a learned value function

Entropy collapse: A training failure mode where a model's output distribution becomes too deterministic (low randomness), reducing its ability to explore better solutions

Self-reflection: The ability of a model to evaluate and revise its own reasoning process (e.g., using keywords like 'wait', 'alternatively')

ACT: Accuracy Contribution per hundred Tokens—a metric measuring the efficiency of reasoning by normalizing accuracy by the length of generated tokens

Pass@1: A metric measuring the percentage of problems where the model's first generated answer is correct