Making Slow Thinking Faster: Compressing LLM Chain-of-Thought via Step Entropy

📝 Paper Summary

Chain-of-Thought (CoT) Compression Efficient LLM Inference Reasoning Models

This paper proposes identifying redundant reasoning steps using low step entropy and training models to autonomously skip them via a two-stage process combining supervised fine-tuning and reinforcement learning.

Core Problem

Chain-of-Thought reasoning generates verbose, redundant steps that increase computational cost and latency without adding informational value to the final answer.

Why it matters:

Verbose reasoning paths increase inference latency and computational costs, creating bottlenecks for scalable deployment of large reasoning models
Current compression methods (token pruning, latent reasoning) often lack a principled way to identify entire semantic steps that are superfluous vs. crucial
Overthinking can paradoxically diminish efficiency without proportional accuracy gains

Concrete Example: In a math problem, a model might generate deterministic, obvious intermediate algebraic manipulations that it is highly confident about (low entropy). Existing methods might keep these or prune random tokens, whereas this method identifies the entire low-entropy step as redundant and replaces it with a [SKIP] token.

Key Novelty

Step Entropy-based CoT Compression

Introduces 'step entropy' to quantify the information content of a reasoning step; low entropy implies the step is predictable and redundant
Proposes a pruning strategy that removes up to 80% of low-entropy steps while maintaining accuracy, unlike random or high-entropy pruning
Develops a two-stage training method (SFT + GRPO) where the model learns to autonomously output [SKIP] tokens for redundant steps

Architecture

The CoT compression pipeline. (a) Process of calculating step entropy and replacing low-entropy steps with [SKIP]. (b) Inference process where the model uses the compressed CoT context.

Evaluation Highlights

Pruning 80% of low-entropy steps reduces tokens by 16-45% across benchmarks while maintaining accuracy on DeepSeek-R1-7B
Trained models achieve 35-57% token reduction with autonomous compression while preserving or slightly improving accuracy
Outperforms random pruning and high-entropy pruning, which cause immediate performance degradation even at low pruning rates

Breakthrough Assessment

8/10

Offers a theoretically grounded metric (step entropy) for redundancy and successfully trains models to autonomously skip steps, achieving significant efficiency gains without accuracy loss.

⚙️ Technical Details

Problem Definition

Setting: Compressing Chain-of-Thought (CoT) sequences generated by Large Reasoning Models (LRM) while preserving final answer accuracy

Inputs: Problem instance x

Outputs: Compressed reasoning chain C' (containing [SKIP] tokens) and final answer A

Pipeline Flow

Full CoT Generation (Teacher)
Step Entropy Calculation
Pruning & [SKIP] Insertion
Training (Student)

System Modules

Teacher Model

Generate full, uncompressed CoT reasoning traces

Model or implementation: DeepSeek-R1-Distill-Qwen-7B

Entropy Calculator

Compute length-normalized step entropy for every step in the CoT

Model or implementation: Algorithm (Equation 5)

Student Model

Generate compressed CoT with [SKIP] tokens and final answer

Model or implementation: DeepSeek-R1-Distill-Qwen-7B (Fine-tuned)

Novel Architectural Elements

Integration of Step Entropy metric into the data preprocessing pipeline to systematically identify redundant steps
Autonomous compression mechanism where the model learns to output a special [SKIP] token instead of the full text for low-entropy steps

Modeling

Base Model: DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B, Qwen2.5-Math-7B (various baselines used)

Training Method: Two-stage training: Supervised Fine-Tuning (SFT) followed by Group Relative Policy Optimization (GRPO)

Objective Functions:

Purpose: Initialize the model to mimic entropy-pruned compressed traces.

Formally: Standard Cross-Entropy Loss on (problem, compressed CoT, answer) pairs.
Purpose: Optimize accuracy and efficiency during RL.

Formally: Maximize Reward R(C) = R_correctness + R_skip_ratio + R_sn + R_rl
Purpose: Reward correct answers.

Formally: Large positive reward if extracted answer matches ground truth.
Purpose: Encourage skipping steps.

Formally: Tiered reward based on ratio N_skip / N_steps.
Purpose: Penalize excessive skipping (degenerate behavior).

Formally: -1.0 if [SKIP] count > threshold.
Purpose: Penalize excessive length.

Formally: -1.0 if length > threshold.

Training Data:

Generated using DeepSeek-R1-Distill-Qwen-7B on math benchmarks
Compressed by pruning low-entropy steps (ratio kappa) and replacing with [SKIP]

Compute: Not reported in the paper

Comparison to Prior Work

vs. Token-level pruning: Operates at the semantic step level rather than token level, preserving better coherence [cited in paper]
vs. Latent Reasoning: Maintains an explicit (though compressed) trace with [SKIP] tokens, preserving interpretability better than fully latent methods [cited in paper]
vs. R1-Compress: Uses information-theoretic step entropy rather than search or heuristic chunking [cited in paper]

Limitations

Step entropy calculation requires access to token probabilities, which may not be available for all API-based models
Pruning too many steps (>80%) eventually degrades performance, implying a limit to redundancy
Relies on the quality of the initial 'teacher' CoT; garbage in, garbage out

Reproducibility

Code: https://github.com/staymylove/COT_Compresstion_via_Step_entropy

Code and data released at https://github.com/staymylove/COT_Compresstion_via_Step_entropy. Hyperparameters for pruning ratios are empirically determined.

📊 Experiments & Results

Evaluation Setup

Mathematical reasoning tasks using CoT generation

Benchmarks:

GSM8K (Grade school math word problems)
MATH (High school math problems)
AIME (Advanced math competitions)

Metrics:

Accuracy (Pass@1)
Average Token Count
Compression Ratio
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pruning Analysis: Experiments showing the impact of pruning low-entropy vs. high-entropy steps on DeepSeek-R1-Distill-Qwen-7B.
GSM8K	Accuracy	79.2	79.2	0.0
GSM8K	Accuracy	79.2	20.0	-59.2
GSM8K	Accuracy	79.2	55.0	-24.2
Training Results: Performance of the model trained (SFT+GRPO) to autonomously compress CoT.
GSM8K	Token Count	709	395	-314
MATH	Token Count	996	428	-568

Experiment Figures

Accuracy vs. Pruning Ratio for Low Entropy, High Entropy, and Random strategies on GSM8K.

Main Takeaways

Steps with low entropy are highly redundant; up to 80% can be removed with negligible accuracy loss.
High-entropy steps are critical for reasoning; removing them causes immediate accuracy collapse.
Models can be trained to autonomously predict when to [SKIP] steps, achieving significant token reduction (35-57%) while maintaining accuracy.
Step-level pruning is superior to token-level pruning and random step pruning.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Information Entropy (Shannon Entropy)
Reinforcement Learning (PPO/GRPO)
Autoregressive generation

Key Terms

Step Entropy: The sum of token-level entropies within a reasoning step, representing the model's uncertainty during generation; low step entropy implies redundancy

GRPO: Group Relative Policy Optimization—a reinforcement learning algorithm that optimizes a policy by comparing a group of outputs for the same input, often used to improve reasoning capabilities

CoT: Chain-of-Thought—a prompting technique where models generate intermediate reasoning steps before the final answer

SFT: Supervised Fine-Tuning—training a model on labeled examples (here, compressed CoT traces) to initialize behavior

[SKIP] token: A special token used to replace a redundant reasoning step in the compressed trajectory

DeepSeek-R1: A series of Large Reasoning Models (LRMs) known for generating detailed 'slow thinking' reasoning chains