TokenSkip improves LLM efficiency by pruning semantically redundant tokens from Chain-of-Thought data and fine-tuning models to generate these compressed reasoning paths directly based on a control parameter.
Core Problem
Long Chain-of-Thought (CoT) sequences significantly increase inference latency and compute costs due to the autoregressive nature of LLMs, but simply cutting steps hurts reasoning accuracy.
Why it matters:
Longer CoT sequences (thousands of steps) are needed for complex reasoning (e.g., OpenAI o1), creating a linear increase in latency and quadratic cost in attention
Existing prompt-based compression often fails to adhere to target lengths, while brute-force truncation destroys reasoning capabilities
Step-skipping approaches can conflict with test-time scaling, impairing the model's ability to solve complex problems
Concrete Example:In a math problem asking for an age calculation, a standard CoT includes filler phrases like 'Let's break it down step by step' and 'so Marcus is'. TokenSkip removes these connectors, keeping only the critical equations and numbers (e.g., 'Deanna is 26... Marcus 26-5=21...'), significantly shortening the output without losing the logic.
Key Novelty
Controllable CoT Compression via SFT on Pruned Trajectories
Analyzes token importance in CoT to reveal that not all tokens contribute equally to reasoning (e.g., equations are more important than connectors)
Constructs a training dataset by pruning low-importance tokens from valid CoT trajectories based on a target compression ratio
Fine-tunes the LLM to generate these compressed CoTs directly when prompted with a specific control token (compression ratio), allowing adjustable efficiency-accuracy trade-offs
Architecture
The overall workflow of TokenSkip, illustrating the data construction via token pruning and the subsequent fine-tuning and inference process.
Evaluation Highlights
Reduces reasoning tokens by 40% (from 313 to 181) on GSM8K using Qwen2.5-14B-Instruct with less than a 0.4% performance drop
Achieves a 1.8x inference speedup on GSM8K with a 0.53 compression ratio while maintaining strong accuracy (only ~10% drop compared to 79% drop for truncation)
Reduces tokens by 30% on MATH-500 using LLaMA-3.1-8B-Instruct with less than a 4% performance decline
Breakthrough Assessment
7/10
Offers a practical, low-cost solution (SFT with LoRA) to a significant problem (CoT latency). The preservation of accuracy at 40% compression is impressive, though the method relies on existing importance metrics like LLMLingua-2.
⚙️ Technical Details
Problem Definition
Setting: Autoregressive generation of Chain-of-Thought sequences given a question and a target compression ratio
Inputs: Natural language question x and a compression ratio token γ
Outputs: Compressed Chain-of-Thought sequence ĉ followed by the final answer â
Pipeline Flow
Input Processing (Append Compression Ratio)
LLM Inference (Autoregressive Generation)
Output Generation (Compressed CoT + Answer)
System Modules
Input Processor
Formats the input prompt by appending the target compression ratio
Model or implementation: Rule-based
Reasoning Engine
Generates the compressed chain-of-thought and final answer
Model or implementation: Qwen2.5-14B-Instruct or LLaMA-3.1-8B-Instruct (Fine-tuned with LoRA)
Novel Architectural Elements
Integration of a continuous control parameter (γ) directly into the SFT input to modulate output length dynamically during inference
Modeling
Base Model: LLaMA-3.1-8B-Instruct, Qwen2.5-Instruct (7B, 14B, 32B)
Training Method: Supervised Fine-Tuning (SFT) with LoRA
Objective Functions:
Purpose: Minimize the difference between generated compressed CoT and ground truth compressed CoT.
Formally: Standard cross-entropy loss over the output sequence y = {ĉ, â}
Adaptation: LoRA (Low-Rank Adaptation)
Trainable Parameters: 0.2% of model parameters (for Qwen2.5-14B-Instruct)
Training Data:
Source: GSM8K (7,473 examples) and MATH (7,500 examples) training sets
Process: Prune original CoTs using LLMLingua-2 importance scores at ratios {0.5, 0.6, ..., 1.0}
Filter: Only trajectories with correct answers are kept
Compute: Training takes ~2 hours for 7B model and ~2.5 hours for 14B model on 2 NVIDIA 3090 GPUs
Comparison to Prior Work
vs. Token-efficient Prompts: TokenSkip achieves actual compression targets (e.g., 0.5) whereas prompts often fail (staying at ~0.94-0.97 ratio)
vs. Truncation: TokenSkip maintains reasoning accuracy (0.4% drop vs 79% drop for truncation at 0.5 ratio) by intelligently skipping redundant tokens rather than cutting off the answer
vs. Step-skipping methods [not cited in paper]: TokenSkip operates at the token level, allowing finer-grained compression than removing entire reasoning steps
Limitations
Lower compression ratios (0.3, 0.4) lead to degraded ratio adherence and performance due to loss of critical information
Relies on the quality of the external token importance metric (LLMLingua-2) for training data generation
Experiments primarily focus on math reasoning benchmarks (GSM8K, MATH); generalizability to other domains is less explored in the main text
Code and checkpoints are publicly available at https://github.com/hemingkx/TokenSkip. The method relies on LLMLingua-2 for data generation. Experiments use greedy decoding.
📊 Experiments & Results
Evaluation Setup
Math reasoning tasks using standard benchmarks
Benchmarks:
GSM8K (Grade school math reasoning)
MATH-500 (Challenging math problems (subset of MATH test set))
Metrics:
Accuracy
Number of CoT tokens
Inference latency
Compression Ratio Adherence
Statistical methodology: Not explicitly reported in the paper
Key Results
Benchmark
Metric
Baseline
This Paper
Δ
GSM8K
Reasoning Tokens
313
181
-132
GSM8K
Accuracy Drop
0.0
-0.4
-0.4
MATH-500
Reasoning Tokens
1.0
0.7
-0.3
GSM8K
Accuracy Drop
0.0
-79
-79
GSM8K
Inference Speedup
1.0
1.8
+0.8
Experiment Figures
Performance of Qwen2.5-Instruct series on GSM8K across different compression ratios.
Main Takeaways
Larger models (e.g., Qwen2.5-14B) handle CoT compression better than smaller models (7B/8B), showing almost no performance drop at 40% compression
Prompt engineering approaches (e.g., 'Be Concise') fail to significantly reduce token counts (only ~5% reduction), whereas TokenSkip allows precise control
TokenSkip effectively learns to skip semantic connectors and filler words while retaining mathematical equations and numbers, which are critical for reasoning
Brute-force truncation is catastrophic for reasoning performance, while TokenSkip preserves accuracy even at 50% compression
📚 Prerequisite Knowledge
Prerequisites
Chain-of-Thought (CoT) prompting
Autoregressive language modeling
Supervised Fine-Tuning (SFT)
Low-Rank Adaptation (LoRA)
Key Terms
CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer
LLMLingua-2: A token classification method used to measure the semantic importance of tokens (whether they can be removed without losing meaning) using a bidirectional language model
LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of model parameters
SFT: Supervised Fine-Tuning—training a pre-trained language model on a labeled dataset to adapt it to a specific task
GSM8K: Grade School Math 8K—a benchmark dataset consisting of high-quality grade school math word problems
MATH: A benchmark dataset of challenging mathematics problems derived from high school math competitions