← Back to Paper List

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Ayeong Lee, Ethan Che, Tianyi Peng
Columbia Business School
arXiv.org (2025)
Reasoning Benchmark

📝 Paper Summary

Efficient LLM Reasoning Chain-of-Thought Compression Prompt Engineering
Reasoning accuracy is determined primarily by response length rather than prompt formatting, governed by an intrinsic minimum token threshold per question called token complexity.
Core Problem
Chain-of-Thought reasoning is effective but computationally expensive and verbose, and it is unclear which compression strategies (e.g., being concise vs. removing grammar) best preserve accuracy.
Why it matters:
  • Inference costs for reasoning models are projected to increase substantially as they are deployed in real-world applications
  • Current compression prompts like 'be concise' are applied ad-hoc without understanding their limits or optimal trade-offs
  • There is a lack of benchmarks to measure progress in reasoning efficiency against theoretical limits
Concrete Example: A user asks a math problem. Standard CoT uses 635 tokens. A 'be concise' prompt uses 505 tokens. The paper investigates if we can go lower (e.g., to 172 tokens) without getting the answer wrong, finding that below a specific threshold, the model fails regardless of the prompt used.
Key Novelty
The Token Complexity Hypothesis
  • Proposes that every problem has an intrinsic 'token complexity'—a sharp threshold of minimum tokens required for a specific LLM to solve it correctly
  • Demonstrates a 'universal trade-off curve' where diverse prompts (e.g., 'use bullet points', 'no spaces', 'Chinese') all fall on the same length-accuracy frontier
  • Applies rate-distortion theory to calculate upper bounds on how much reasoning can be compressed, serving as a benchmark for efficiency
Evaluation Highlights
  • Token complexity thresholds alone can predict the success or failure of CoT prompting strategies with 94% accuracy across benchmarks
  • Current prompt-based compression strategies are far from optimal: on GSM8K, GPT-4o achieves 1.40x compression via prompting but theoretically allows for 10.90x compression
  • Formatting matters less than length: prompts ranging from 'only numbers' to 'Chinese CoT' align on a single universal accuracy-length trade-off curve
Breakthrough Assessment
7/10
While it doesn't propose a new architecture, the discovery of 'token complexity' and the universal trade-off curve provides a fundamental theoretical framework for understanding and benchmarking reasoning efficiency.
×