← Back to Paper List

Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin
arXiv (2026)
RL Reasoning Benchmark

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Model Compression / Efficiency Reinforcement Learning (RL) Post-training
FGO optimizes Long Chain-of-Thought reasoning by subdividing response groups into correct and incorrect sets, weighting them by length and entropy to penalize verbosity while maintaining accuracy.
Core Problem
Large Language Models often generate unnecessarily verbose Chain-of-Thought reasoning that increases computational costs without improving accuracy, and existing RL optimization methods like GRPO suffer from inefficient data utilization and entropy collapse.
Why it matters:
  • Reasoning ability does not scale linearly with length; excessively long CoT often degrades performance due to overthinking and redundant double-checking
  • Current compression methods either compromise logical consistency (token-level) or rely on expensive auxiliary models (instance-level)
  • Standard GRPO wastes training data when all group responses receive identical rewards (zero advantage) and tends to converge to identical responses (entropy collapse)
Concrete Example: In a math problem from MATH500, a standard model (ZR1-1.5B) produces a long CoT by repeatedly double-checking its work but still reaches an incorrect answer. FGO produces a concise derivation that is both shorter and correct.
Key Novelty
Fine-grained Group policy Optimization (FGO)
  • Subdivides a group of sampled responses into 'correct' (reward=1) and 'incorrect' (reward=0 -> -1) subgroups instead of treating them as a monolithic batch.
  • Applies granular weighting within subgroups: correct answers get higher weight if they are shorter and have lower entropy (higher confidence); incorrect answers get stronger negative weight if they are shorter and have higher entropy.
Evaluation Highlights
  • Achieves 100% data utilization during training by ensuring non-zero advantages via subgroup reward shaping, whereas GRPO suffers from degenerate cases where all group rewards are identical.
  • Effective mitigation of entropy collapse: trajectory-level entropy decreases more gradually and remains higher than baselines, preserving exploration.
  • Successfully compresses Chain-of-Thought length while maintaining or improving accuracy across MATH500, AIME24, AMC23, and Minerva benchmarks (exact numeric deltas not in provided text).
Breakthrough Assessment
7/10
Addresses specific, practical inefficiencies in both CoT length and the popular GRPO algorithm. While an incremental improvement on GRPO, the dual focus on efficiency and training stability is valuable.
×