SIM-CoT: Supervised Implicit Chain-of-Thought

📝 Paper Summary

Implicit Chain-of-Thought Efficient Reasoning Latent Space Reasoning

SIM-CoT stabilizes implicit reasoning by using a temporary auxiliary decoder during training to force latent states to encode specific reasoning steps, preventing representational collapse while maintaining inference efficiency.

Core Problem

Implicit Chain-of-Thought methods (reasoning in latent space) suffer from instability when scaling the number of latent tokens; training often collapses as latents become homogeneous and lose semantic meaning.

Why it matters:

Current implicit methods trade too much accuracy for efficiency, creating a performance gap preventing broad adoption
Without stability, models cannot scale the computational budget (number of thinking steps) to solve harder problems, limiting their potential compared to explicit CoT
Implicit reasoning is typically a 'black box', offering no interpretability for what the model is thinking

Concrete Example: When scaling the method 'Coconut' from 3 to 5 latent tokens on GSM8k, accuracy drops drastically (from ~45% to 12.5%). Analysis shows the 5 latent tokens become nearly identical vectors representing only numbers, losing the operator information needed for calculation.

Key Novelty

Supervised Implicit Chain-of-Thought (SIM-CoT)

Introduces an auxiliary decoder during training that takes an implicit latent vector and generates the corresponding explicit text reasoning step
Provides step-level supervision to the latent space, forcing each latent vector to capture distinct, meaningful semantic information (grounding)
Removes the auxiliary decoder during inference, retaining the token efficiency of implicit CoT with zero computational overhead

Architecture

Comparison of training vs. inference workflows for Coconut, CODI, and SIM-CoT. Highlights SIM-CoT's use of an auxiliary decoder.

Evaluation Highlights

+8.2% accuracy improvement over the implicit baseline Coconut on GSM8k-Aug using a GPT-2 backbone
Surpasses explicit Chain-of-Thought (CoT-SFT) by 2.1% on GPT-2 while being 2.3x more token-efficient
+3.0% improvement over the state-of-the-art implicit method CODI on LLaMA-3.1 8B

Breakthrough Assessment

8/10

Successfully addresses the critical stability collapse in implicit CoT, allowing it to finally surpass explicit CoT in accuracy while maintaining efficiency. Adds valuable interpretability to latent reasoning.

⚙️ Technical Details

Problem Definition

Setting: Autoregressive generation where intermediate reasoning steps are encoded as continuous latent vectors rather than discrete tokens

Inputs: Natural language question x

Outputs: Final answer a

Pipeline Flow

Implicit Phase: LLM generates K latent vectors (hidden states)
Training Only: Auxiliary Decoder reconstructs text step from each latent
Explicit Phase: LLM generates final answer tokens

System Modules

Base LLM (Implicit Phase)

Generate a sequence of K implicit latent vectors (hidden states) representing reasoning steps

Model or implementation: GPT-2 or Llama-3 series

Auxiliary Decoder

Force the latent vector to encode semantic information by decoding it into the corresponding explicit reasoning text

Model or implementation: Transformer Decoder (architecturally identical to LLM)

Base LLM (Explicit Phase)

Switch back to vocabulary decoding to generate the final answer based on the implicit reasoning context

Model or implementation: GPT-2 or Llama-3 series

Novel Architectural Elements

Hybrid training architecture where a detachable auxiliary decoder provides fine-grained supervision to internal hidden states of the main LLM

Modeling

Base Model: GPT-2, Llama-3.2-1B, Llama-3.2-3B, Llama-3.1-8B

Training Method: Supervised Fine-Tuning with Auxiliary Loss

Objective Functions:

Purpose: Ensure the latent vector captures the meaning of the reasoning step.

Formally: L_step = - sum log p_phi(s_k | z_k)
Purpose: Ensure the model generates the correct final answer.

Formally: L_ans-lm = - sum log p_theta(a | x, z_{1:K})
Purpose: Combined training objective.

Formally: L = L_step + L_ans-lm

Key Hyperparameters:

number_of_implicit_steps_K: Fixed in advance (e.g., up to 8)
decoder_size: Same architecture as base model (e.g., 1B decoder for 1B model)

Compute: Inference efficiency: ~2x speedup vs explicit CoT (e.g., 2.3x on GPT-2, 1.9x on Llama-1B)

Comparison to Prior Work

vs. Coconut: SIM-CoT adds step-level supervision via the auxiliary decoder, preventing collapse at higher latent counts
vs. CODI: SIM-CoT grounds each specific latent to a specific text step, whereas CODI aligns the entire trajectory roughly
vs. Explicit CoT (SFT): SIM-CoT moves reasoning to latent space for efficiency but retains accuracy via grounding

Limitations

Requires a fixed number of implicit steps K to be defined in advance
Relies on the availability of explicit CoT data to provide the supervision targets for the auxiliary decoder
Larger auxiliary decoders (3B/8B) were found to slightly reduce accuracy compared to 1B-scale decoders in ablation studies

Reproducibility

Code availability is not explicitly provided in the text. Training relies on the GSM8k-Aug dataset. Implementation details regarding learning rates and curriculum strategies are referenced as being in Appendix E (not provided in input).

📊 Experiments & Results

Evaluation Setup

Math reasoning tasks with implicit reasoning chains

Benchmarks:

GSM8k-Aug (Grade school math (In-domain))
SVAMP (Math word problems (Out-of-domain))
GSM-Hard (Harder math problems with larger numbers (Out-of-domain))
MultiArith (Multi-step arithmetic (Out-of-domain))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on GPT-2 backbone showing SIM-CoT outperforming both implicit baselines and the explicit CoT baseline.
GSM8k-Aug	Accuracy	44.3	52.5	+8.2
GSM8k-Aug	Accuracy	50.4	52.5	+2.1
Scaling results to larger LLaMA models, showing consistent improvements over the state-of-the-art implicit method CODI.
GSM8k-Aug	Accuracy	Not reported in the paper	Not reported in the paper	+3.4
GSM8k-Aug	Accuracy	Not reported in the paper	Not reported in the paper	+3.0
Out-of-domain generalization results (averaged across OOD datasets).
OOD Average (SVAMP, GSM-Hard, MultiArith)	Accuracy	Not reported in the paper	Not reported in the paper	+4.3

Experiment Figures

Analysis of Latent Instability. (a) Accuracy vs Latent Count showing collapse at 5 latents. (b) Information loss in failed models. (c) Distance metrics showing latents collapsing together. (d) Visualization of homogeneous latents.

Ablation on the number of implicit tokens/latents on GPT-2.

Main Takeaways

SIM-CoT effectively solves the 'latent instability' issue, allowing implicit CoT models to scale to more latent steps (8-16) without collapsing.
The method consistently outperforms both explicit CoT (SFT) and prior implicit methods (Coconut, CODI) across different model scales (GPT-2 to Llama-8B).
The approach provides a 2.3x inference speedup compared to explicit CoT while achieving higher accuracy.
The auxiliary decoder enables interpretability by allowing researchers to visualize what each latent step is reasoning about.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Transformer architecture (Hidden states)
Teacher Forcing
Knowledge Distillation (for CODI baseline)

Key Terms

Implicit Chain-of-Thought: A reasoning method where the model generates continuous vector representations (latents) instead of text tokens for intermediate steps to save compute

Latent Instability: A phenomenon where increasing the number of implicit reasoning steps leads to training collapse and loss of semantic diversity in the latent vectors

Homogeneity: When latent vectors in a sequence become too similar to each other, failing to encode distinct steps of reasoning

Auxiliary Decoder: A secondary transformer decoder used only during training to translate latent vectors back into text for supervision

Step-level Supervision: Training signal that aligns specific latent states with specific textual reasoning steps, rather than just checking the final answer

Coconut: A baseline implicit CoT method that uses curriculum learning to gradually replace explicit steps with implicit latents

CODI: A baseline implicit CoT method that uses trajectory-level distillation from an explicit CoT teacher