TokenSkip: Controllable Chain-of-Thought Compression in LLMs

📝 Paper Summary

Efficient LLM Inference Prompt Compression Chain-of-Thought Reasoning

TokenSkip improves LLM efficiency by pruning semantically redundant tokens from Chain-of-Thought data and fine-tuning models to generate these compressed reasoning paths directly based on a control parameter.

Core Problem

Long Chain-of-Thought (CoT) sequences significantly increase inference latency and compute costs due to the autoregressive nature of LLMs, but simply cutting steps hurts reasoning accuracy.

Why it matters:

Longer CoT sequences (thousands of steps) are needed for complex reasoning (e.g., OpenAI o1), creating a linear increase in latency and quadratic cost in attention
Existing prompt-based compression often fails to adhere to target lengths, while brute-force truncation destroys reasoning capabilities
Step-skipping approaches can conflict with test-time scaling, impairing the model's ability to solve complex problems

Concrete Example: In a math problem asking for an age calculation, a standard CoT includes filler phrases like 'Let's break it down step by step' and 'so Marcus is'. TokenSkip removes these connectors, keeping only the critical equations and numbers (e.g., 'Deanna is 26... Marcus 26-5=21...'), significantly shortening the output without losing the logic.

Key Novelty

Controllable CoT Compression via SFT on Pruned Trajectories

Analyzes token importance in CoT to reveal that not all tokens contribute equally to reasoning (e.g., equations are more important than connectors)
Constructs a training dataset by pruning low-importance tokens from valid CoT trajectories based on a target compression ratio
Fine-tunes the LLM to generate these compressed CoTs directly when prompted with a specific control token (compression ratio), allowing adjustable efficiency-accuracy trade-offs

Architecture

The overall workflow of TokenSkip, illustrating the data construction via token pruning and the subsequent fine-tuning and inference process.

Evaluation Highlights

Reduces reasoning tokens by 40% (from 313 to 181) on GSM8K using Qwen2.5-14B-Instruct with less than a 0.4% performance drop
Achieves a 1.8x inference speedup on GSM8K with a 0.53 compression ratio while maintaining strong accuracy (only ~10% drop compared to 79% drop for truncation)
Reduces tokens by 30% on MATH-500 using LLaMA-3.1-8B-Instruct with less than a 4% performance decline

Breakthrough Assessment

7/10

Offers a practical, low-cost solution (SFT with LoRA) to a significant problem (CoT latency). The preservation of accuracy at 40% compression is impressive, though the method relies on existing importance metrics like LLMLingua-2.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive generation of Chain-of-Thought sequences given a question and a target compression ratio

Inputs: Natural language question x and a compression ratio token γ

Outputs: Compressed Chain-of-Thought sequence ĉ followed by the final answer â

Pipeline Flow

Input Processing (Append Compression Ratio)
LLM Inference (Autoregressive Generation)
Output Generation (Compressed CoT + Answer)

System Modules

Input Processor

Formats the input prompt by appending the target compression ratio

Model or implementation: Rule-based

Reasoning Engine

Generates the compressed chain-of-thought and final answer

Model or implementation: Qwen2.5-14B-Instruct or LLaMA-3.1-8B-Instruct (Fine-tuned with LoRA)

Novel Architectural Elements

Integration of a continuous control parameter (γ) directly into the SFT input to modulate output length dynamically during inference

Modeling

Base Model: LLaMA-3.1-8B-Instruct, Qwen2.5-Instruct (7B, 14B, 32B)

Training Method: Supervised Fine-Tuning (SFT) with LoRA

Objective Functions:

Purpose: Minimize the difference between generated compressed CoT and ground truth compressed CoT.

Formally: Standard cross-entropy loss over the output sequence y = {ĉ, â}

Adaptation: LoRA (Low-Rank Adaptation)

Trainable Parameters: 0.2% of model parameters (for Qwen2.5-14B-Instruct)

Training Data:

Source: GSM8K (7,473 examples) and MATH (7,500 examples) training sets
Process: Prune original CoTs using LLMLingua-2 importance scores at ratios {0.5, 0.6, ..., 1.0}
Filter: Only trajectories with correct answers are kept

Key Hyperparameters:

compression_ratios: {0.5, 0.6, 0.7, 0.8, 0.9, 1.0}
gpu_count: 2x NVIDIA 3090

Compute: Training takes ~2 hours for 7B model and ~2.5 hours for 14B model on 2 NVIDIA 3090 GPUs

Comparison to Prior Work

vs. Token-efficient Prompts: TokenSkip achieves actual compression targets (e.g., 0.5) whereas prompts often fail (staying at ~0.94-0.97 ratio)
vs. Truncation: TokenSkip maintains reasoning accuracy (0.4% drop vs 79% drop for truncation at 0.5 ratio) by intelligently skipping redundant tokens rather than cutting off the answer
vs. Step-skipping methods [not cited in paper]: TokenSkip operates at the token level, allowing finer-grained compression than removing entire reasoning steps

Limitations

Lower compression ratios (0.3, 0.4) lead to degraded ratio adherence and performance due to loss of critical information
Relies on the quality of the external token importance metric (LLMLingua-2) for training data generation
Experiments primarily focus on math reasoning benchmarks (GSM8K, MATH); generalizability to other domains is less explored in the main text

Reproducibility

Code: https://github.com/hemingkx/TokenSkip

Code and checkpoints are publicly available at https://github.com/hemingkx/TokenSkip. The method relies on LLMLingua-2 for data generation. Experiments use greedy decoding.

📊 Experiments & Results

Evaluation Setup

Math reasoning tasks using standard benchmarks

Benchmarks:

GSM8K (Grade school math reasoning)
MATH-500 (Challenging math problems (subset of MATH test set))

Metrics:

Accuracy
Number of CoT tokens
Inference latency
Compression Ratio Adherence
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GSM8K	Reasoning Tokens	313	181	-132
GSM8K	Accuracy Drop	0.0	-0.4	-0.4
MATH-500	Reasoning Tokens	1.0	0.7	-0.3
GSM8K	Accuracy Drop	0.0	-79	-79
GSM8K	Inference Speedup	1.0	1.8	+0.8

Experiment Figures

Performance of Qwen2.5-Instruct series on GSM8K across different compression ratios.

Main Takeaways

Larger models (e.g., Qwen2.5-14B) handle CoT compression better than smaller models (7B/8B), showing almost no performance drop at 40% compression
Prompt engineering approaches (e.g., 'Be Concise') fail to significantly reduce token counts (only ~5% reduction), whereas TokenSkip allows precise control
TokenSkip effectively learns to skip semantic connectors and filler words while retaining mathematical equations and numbers, which are critical for reasoning
Brute-force truncation is catastrophic for reasoning performance, while TokenSkip preserves accuracy even at 50% compression

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Autoregressive language modeling
Supervised Fine-Tuning (SFT)
Low-Rank Adaptation (LoRA)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

LLMLingua-2: A token classification method used to measure the semantic importance of tokens (whether they can be removed without losing meaning) using a bidirectional language model

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters

SFT: Supervised Fine-Tuning—training a pre-trained language model on a labeled dataset to adapt it to a specific task

GSM8K: Grade School Math 8K—a benchmark dataset consisting of high-quality grade school math word problems

MATH: A benchmark dataset of challenging mathematics problems derived from high school math competitions