CODI: Compressing Chain-of-Thought into Continuous Space via Self-Distillation

📝 Paper Summary

Implicit Chain-of-Thought Model Compression / Acceleration Knowledge Distillation

CODI compresses verbose chain-of-thought reasoning into compact continuous vectors by distilling the hidden state 'shift' induced by explicit reasoning into a student model via joint training.

Core Problem

Explicit Chain-of-Thought (CoT) is computationally expensive due to verbose token generation, while prior implicit CoT methods (reasoning in latent space) suffer from performance degradation and forgetting issues.

Why it matters:

Verbalizing full reasoning steps slows down inference significantly (communication vs. computation trade-off)
Learning explicit tokens can cause models to overfit on superficial linguistic cues rather than the underlying logic
Previous implicit methods like Coconut use curriculum learning, which leads to forgetting and consistently underperforms explicit CoT

Concrete Example: When solving a math problem like '10/5 * 2', explicit CoT generates multiple tokens '<<10/5=2>> <<2*2=4>>'. CODI replaces these with a fixed number of continuous vectors (e.g., 6 vectors) that encode the same reasoning state, speeding up generation while maintaining accuracy.

Key Novelty

Continuous Chain-of-Thought via Self-Distillation (CODI)

Jointly trains a model as both 'Teacher' (Explicit CoT) and 'Student' (Implicit CoT) to avoid curriculum learning forgetting
Uses feature-level distillation to force the student's latent thoughts to produce the same hidden state 'shift' as the teacher's explicit reasoning steps
Aligns the hidden activations of a specific distillation token (e.g., the colon before the answer) rather than the entire sequence

Architecture

The CODI training framework showing parallel Teacher and Student tasks. The Teacher processes explicit CoT tokens, while the Student processes continuous thought vectors. A distillation loss aligns the Student's last hidden state with the Teacher's state before the answer.

Evaluation Highlights

Achieves 99% of explicit CoT-SFT accuracy on GSM8k with GPT-2, marking the first implicit CoT method to match explicit performance at this scale
Outperforms the previous state-of-the-art implicit method (Coconut) by 28.2% accuracy on GSM8k
Achieves a 3.1x compression rate on GSM8k and up to 8.2x on the more verbose GSM8k-Aug-NL dataset

Breakthrough Assessment

8/10

Significant advance in implicit reasoning; CODI is the first to effectively close the performance gap between explicit and implicit CoT on small models, offering a viable path for efficient latent reasoning.

⚙️ Technical Details

Problem Definition

Setting: Sequence-to-sequence generation where intermediate reasoning steps (Chain-of-Thought) are compressed into continuous latent vectors

Inputs: Natural language question Q

Outputs: Final answer Y, preceded by a sequence of continuous thought vectors Z

Pipeline Flow

Input Question Encoding
Latent Thought Generation (Student Loop)
Answer Decoding

System Modules

Input Encoder

Encodes the question tokens into hidden states

Model or implementation: GPT-2 or LLaMA-3.2-1b-Instruct

Thought Generator

Autoregressively generates n continuous thought vectors starting from <bot>

Model or implementation: Shared LLM Backbone + MLP Projector

Answer Decoder

Generates the final answer tokens after processing the <eot> token

Model or implementation: Shared LLM Backbone

Novel Architectural Elements

Joint Teacher-Student architecture where the Student generates continuous vectors via an MLP projector while the Teacher generates discrete tokens
Shared weights between Teacher and Student tasks, differing only in the intermediate generation path (discrete vs continuous)

Modeling

Base Model: GPT-2 (small) and LLaMA-3.2-1b-Instruct

Training Method: Multi-task learning with Self-Distillation

Objective Functions:

Purpose: Teacher learns explicit CoT reasoning.

Formally: Cross-entropy loss on explicit CoT tokens and answer.
Purpose: Student learns to generate answers from continuous thoughts.

Formally: Cross-entropy loss on final answer given continuous thoughts Z.
Purpose: Align Student's latent state with Teacher's reasoning state.

Formally: L2 distance (MSE) between Student's hidden state at <eot> and Teacher's hidden state at the distillation token (':'), with stop-gradient on Teacher.

Training Data:

GSM8k-Aug (385k samples, math expressions only)
GSM8k-Aug-NL (includes natural language explanations)
CommonsenseQA-CoT (8.1k samples generated by GPT-4o-mini)

Key Hyperparameters:

thought_tokens: 6 (matching Coconut setup)
alpha: Weight for Teacher loss (hyperparameter)
beta: Weight for Student loss (hyperparameter)
+ 1 more
gamma: Weight for Distillation loss (hyperparameter)

Compute: Not reported in the paper

Comparison to Prior Work

vs. Coconut: CODI uses single-step self-distillation instead of multi-stage curriculum learning, avoiding forgetting
vs. iCoT: CODI uses additional continuous tokens for computation, whereas iCoT relies solely on internal state modifications without extra compute tokens
vs. CoT-SFT: CODI reasons in latent space, offering higher efficiency (3.1x compression) while maintaining comparable accuracy

Limitations

Performance gap still exists on larger models (LLaMA-1b) compared to Explicit CoT (90% of CoT-SFT performance)
Optimal compression ratio is dataset-dependent (e.g., 8.2x for verbose CoT vs 3.1x for concise CoT)
Requires CoT annotations for the teacher task during training

Reproducibility

Code: https://github.com/zhenyi4/codi

Code is publicly available at https://github.com/zhenyi4/codi. The paper provides implementation details for the loss function and model architecture modifications (MLP projector).

📊 Experiments & Results

Evaluation Setup

Mathematical and Commonsense Reasoning

Benchmarks:

GSM8k-Aug (Mathematical Reasoning (augmented dataset))
GSM8k-Aug-NL (Mathematical Reasoning with Natural Language)
CommonsenseQA-CoT (Commonsense Reasoning) [New]
SVAMP (Math Word Problems (OOD Robustness))
GSM-HARD (Hard Math Problems (OOD Robustness))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
GSM8k	Accuracy (relative to SOTA)	Not reported in the paper	Not reported in the paper	+28.2%
GSM8k	Compression Rate	1.0x	3.1x	+2.1x
GSM8k-Aug-NL	Compression Rate	1.0x	8.2x	+7.2x
GSM8k (GPT-2)	Performance Retention	100%	99%	-1%
GSM8k (LLaMA-1b)	Performance Retention	100%	90%	-10%

Experiment Figures

Bar charts comparing accuracy of CODI vs baselines (CoT-SFT, No-CoT, Coconut, iCoT) across GSM8k, GSM8k-Aug-NL, and CommonsenseQA for GPT-2 and LLaMA-1b.

Main Takeaways

CODI is the first implicit CoT method to match explicit CoT performance on small models (GPT-2), solving the performance gap that plagued prior methods like Coconut.
Self-distillation is more effective than curriculum learning for implicit CoT, preventing the catastrophic forgetting observed in baselines.
The method demonstrates strong robustness, generalizing well to out-of-domain datasets (SVAMP, GSM-HARD) and complex natural language CoT data.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Transformer architecture (Decoder-only)
Knowledge Distillation (Feature-level)
Hidden state activations

Key Terms

Implicit CoT: Reasoning performed in the model's latent continuous space (hidden states) rather than by generating explicit natural language tokens

Explicit CoT: Standard Chain-of-Thought reasoning where the model generates step-by-step natural language explanations before the answer

Latent Space: The high-dimensional vector space where the model represents data internally, as opposed to the discrete vocabulary space of tokens

Self-Distillation: A training process where a model acts as both teacher and student, transferring knowledge (e.g., reasoning patterns) from one task configuration to another within the same network

CoT Shift: The phenomenon where explicit reasoning tokens change the hidden activation values of the final query token compared to a sequence without reasoning

Curriculum Learning: A training strategy where the model learns from easy to hard examples or gradually changes the task difficulty; used by prior baselines like Coconut but prone to forgetting

Stop-gradient: An operation during training that prevents error gradients from backpropagating through a specific part of the network, used here to freeze the teacher's signals