Reasoning Models Can be Accurately Pruned Via Chain-of-Thought Reconstruction

📝 Paper Summary

Model Compression Reasoning Models

Reasoning-Aware Compression improves pruning accuracy by calibrating on self-generated chain-of-thought traces, preventing the performance collapse and 'rambling' behavior seen with standard prompt-based calibration.

Core Problem

Standard pruning methods calibrate on short input prompts, failing to capture the activation distribution of reasoning models which is dominated by long, self-generated chains of thought.

Why it matters:

Pruned reasoning models suffer disproportionate accuracy drops compared to standard language models when using generic calibration (e.g., C4)
Ineffective pruning causes models to 'ramble'—generating longer, less accurate chains of thought—which paradoxically increases inference latency despite the theoretical speedup of compression
Deploying large reasoning models like DeepSeek-R1 is resource-intensive; effective compression is critical for real-world usage

Concrete Example: When pruned to 50% sparsity using standard C4 calibration, the DeepSeek-R1-Distill-Qwen-7B model drops from 92.8% to 74.4% accuracy on MATH-500 and generates significantly longer, incoherent reasoning traces, slowing down inference.

Key Novelty

Reasoning-Aware Compression (RAC)

Augment the calibration dataset used for pruning by including 'on-policy' chain-of-thought traces generated by the model itself
Align the pruning objective with the inference-time distribution, where activations are driven primarily by reasoning tokens rather than input prompts

Architecture

The calibration data collection process for RAC

Evaluation Highlights

+15.6% accuracy on MATH-500 for DeepSeek-R1-Distill-Qwen-7B at 50% sparsity compared to standard C4 calibration (90.0% vs 74.4%)
+30.8% accuracy on MATH-500 for the 1.5B model at 50% sparsity (66.4% vs 35.6%), recovering most of the dense model's performance
Eliminates inference slowdown: RAC-pruned models maintain decoding lengths similar to dense models, avoiding the 'rambling' pathology of standard pruning

Breakthrough Assessment

8/10

Identifies a critical, counter-intuitive failure mode in pruning reasoning models (rambling) and provides a simple, highly effective fix that aligns calibration with the specific nature of reasoning tasks.

⚙️ Technical Details

Problem Definition

Setting: Post-training compression (pruning) of autoregressive reasoning Large Language Models

Inputs: Dense model weights and a set of calibration prompts x

Outputs: Sparse model weights preserving reasoning accuracy

Pipeline Flow

Prompt Input → Dense Model Rollout (Generate CoT)
Activation Collection (Prompt + CoT)
SparseGPT Optimization
Pruned Model Inference

System Modules

Dense Model Rollout

Generate on-policy chain-of-thought traces to simulate inference-time activation distributions

Model or implementation: DeepSeek-R1-Distill-Qwen (Dense)

SparseGPT Optimizer

Compute sparse weight masks and update remaining weights to minimize reconstruction error

Model or implementation: Layer-wise solver

Reasoning Inference

Execute task using compressed weights

Model or implementation: Pruned DeepSeek-R1-Distill-Qwen

Novel Architectural Elements

Integration of decode-time activations (CoT) into the calibration matrix for post-training pruning, specifically targeting the prompt-length vs. decode-length imbalance in reasoning models

Modeling

Base Model: DeepSeek-R1-Distill-Qwen (1.5B, 7B, 14B, 32B variants)

Training Method: Post-training Pruning (No fine-tuning)

Training Data:

Calibration data: 1M tokens
Source 1: Standard C4 (Baseline)
Source 2: Task prompts only (Baseline)
Source 3 (RAC): Task prompts + On-policy CoT traces (up to 8192 tokens/prompt)

Compute: Calibration uses 1M tokens. Inference speedups evaluated on NVIDIA Ampere hardware.

Comparison to Prior Work

vs. SparseGPT (Standard): RAC changes the calibration data distribution to include self-generated CoT, reducing decode-time reconstruction error
vs. PPC-GPT: RAC injects CoT activations directly into the pruning objective (SparseGPT), eliminating the need for a separate distillation stage

Limitations

Code generation tasks appear more sensitive to compression than math tasks, even with RAC
On-policy data generation adds a one-time computational cost prior to pruning compared to using pre-existing C4 data
Benefits are less pronounced at lower sparsity levels (e.g., 20%) where standard methods still perform well
Analysis is limited to DeepSeek-R1 distilled models; applicability to other reasoning architectures (e.g., non-distilled) is not explicitly tested

Reproducibility

Code: https://github.com/RyanLucas3/RAC

Code is publicly available at https://github.com/RyanLucas3/RAC. The paper uses open-source DeepSeek-R1-Distill-Qwen checkpoints and standard datasets (MATH-500, C4).

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on reasoning benchmarks

Benchmarks:

MATH-500 (Mathematical Reasoning)
LiveCodeBench (CodeGen) (Code Generation)

Metrics:

Accuracy (Pass@1)
Pass@1:16 (Top-16 Accuracy)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Pruning results on MATH-500 showing RAC consistently outperforming baselines, particularly at high sparsity (50%).
MATH-500	Accuracy	74.4	90.0	+15.6
MATH-500	Accuracy	35.6	66.4	+30.8
MATH-500	Accuracy	83.2	90.0	+6.8
Throughput analysis demonstrating that RAC preserves inference speed by preventing excessive CoT generation.
MATH-500	Runtime (minutes)	29.1	23.3	-5.8

Main Takeaways

RAC mitigates the 'rambling' pathology where standard pruning causes models to generate excessive, low-quality CoT tokens
Domain-specific prompts (Prompt-only) are better than C4, but full RAC (Prompts + CoT) provides significant additional gains
Larger models (14B, 32B) are more robust to compression overall, but RAC still offers improvements over baselines
On-policy calibration (using the model's own traces) outperforms off-policy calibration (using traces from a larger teacher model), suggesting activation patterns are model-specific

📚 Prerequisite Knowledge

Prerequisites

Neural network pruning (specifically SparseGPT)
Chain-of-Thought (CoT) prompting
Autoregressive generation

Key Terms

RAC: Reasoning-Aware Compression—the proposed method of calibrating pruning algorithms using self-generated chain-of-thought traces

CoT: Chain-of-Thought—intermediate reasoning steps generated by a model before the final answer

SparseGPT: A one-shot pruning algorithm that compresses LLMs by minimizing layer-wise reconstruction error using a small calibration dataset

On-policy: Data generated by the specific model being compressed (as opposed to external static datasets)

Calibration data: A small set of input tokens used by pruning algorithms to estimate the importance of weights

DeepSeek-R1: A class of reasoning-optimized Large Language Models trained via Reinforcement Learning to generate long reasoning traces

Pass@1: Accuracy metric measuring if the top-1 generated answer matches the ground truth

Pass@16: Accuracy metric measuring if the correct answer appears in the top 16 generated samples