C3oT: Generating Shorter Chain-of-Thought without Compromising Effectiveness

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Inference Acceleration LLM Compression

C3oT trains LLMs on both long and compressed Chain-of-Thought (CoT) sequences using conditional tokens, enabling the model to generate concise reasoning during inference without losing accuracy.

Core Problem

Standard Chain-of-Thought (CoT) significantly increases decoding costs and latency because the reasoning steps are often much longer than the final answer.

Why it matters:

High inference costs hinder LLM deployment in latency-sensitive applications like search and recommendation.
Simply shortening CoT steps typically degrades reasoning performance, creating a trade-off between speed and accuracy.
Existing acceleration methods (like Implicit-CoT) often sacrifice too much performance compared to explicit reasoning.

Concrete Example: In a math problem asking for total clips sold, a standard CoT might output 'Natalia sold 48+24 = <<48+24=72>>72 clips altogether...' (long). C3oT aims to output just 'She sold 72 clips in April and May.' (short) while retaining the accuracy derived from the longer reasoning path.

Key Novelty

Conditioned Compressed Chain-of-Thought (C3oT)

Uses a 'Compressor' (GPT-4) to distill long, detailed CoT into a concise version retaining only key information.
Trains the student LLM on both long and short CoT simultaneously, distinguishing them with specific prompt tokens (Conditioned Training).
During inference, triggers the 'short CoT' mode via the specific prompt token, allowing the model to access reasoning capabilities learned from long CoT while outputting few tokens.

Architecture

Overview of the C3oT framework including the Compressor, Conditioned Training, and Conditioned Inference phases.

Evaluation Highlights

Compresses generated CoT length by up to 57.6% on GSM8K while maintaining accuracy comparable to models trained on full-length CoT.
Outperforms Implicit-CoT by substantial margins (e.g., +8.6% accuracy on GSM8K) while using slightly more tokens.
Achieves performance on par with standard long-CoT models across arithmetic (GSM8K, MathQA) and commonsense (ECQA, StrategyQA) datasets.

Breakthrough Assessment

7/10

Effectively solves the trade-off between CoT length and accuracy, a significant practical hurdle. While the core mechanics (knowledge distillation/conditioning) are known, the specific application to CoT compression is novel and effective.

⚙️ Technical Details

Problem Definition

Setting: Supervised Fine-Tuning (SFT) for reasoning tasks with variable-length rationale generation.

Inputs: Instruction x and a condition token c (indicating long or short mode).

Outputs: Rationales r (reasoning steps) and final answer y.

Pipeline Flow

Compressor (GPT-4) → [Long CoT, Short CoT] Pairs
Conditioned Training (LLM learns both modes)
Conditioned Inference (LLM generates Short CoT)

System Modules

Compressor

Condenses original long CoT into shortest form retaining key info

Model or implementation: GPT-4

Student LLM (Training)

Learns to generate both long and short CoT based on condition prompts

Model or implementation: Not explicitly specified (generic LLM architecture implied)

Student LLM (Inference)

Generates concise reasoning and answer using the short-mode prompt

Model or implementation: Fine-tuned LLM

Novel Architectural Elements

Conditioned training framework specifically applied to CoT length, allowing a single model to learn the mapping between verbose reasoning and concise summaries.

Modeling

Base Model: Not explicitly reported in the paper

Training Method: Supervised Fine-Tuning (SFT) with Conditioned Prompts

Adaptation: Full fine-tuning (implied by context of SFT)

Trainable Parameters: Not reported in the paper

Training Data:

Arithmetic: GSM8K, MathQA
Commonsense: ECQA, StrategyQA
Original CoT (r_long) from datasets
Compressed CoT (r_short) generated by GPT-4

Compute: Not reported in the paper

Comparison to Prior Work

vs. Implicit-CoT: C3oT generates *some* tokens (compressed) rather than zero, resulting in significantly higher accuracy/interpretability.
vs. Standard CoT: C3oT drastically reduces token count (up to >50%) with negligible performance loss.
vs. Direct Answer: C3oT maintains the reasoning benefits of CoT.

Limitations

Relies on a powerful closed-source model (GPT-4) for data compression/generation.
The specific base model used for experiments is not identified in the provided text.
Does not eliminate CoT latency entirely (unlike Implicit-CoT), only reduces it.

Reproducibility

Prompt templates for compression and conditioned training are provided in the Appendix. The specific base model architecture (e.g., Llama-2-7B) is not explicitly named in the text provided, though the method is model-agnostic. Code URL is not provided.

📊 Experiments & Results

Evaluation Setup

Comparison of accuracy and generated token length across arithmetic and commonsense reasoning tasks.

Benchmarks:

GSM8K (Arithmetic Reasoning)
MathQA (Arithmetic Reasoning)
ECQA (Commonsense Reasoning)
StrategyQA (Commonsense Reasoning)

Metrics:

Accuracy
Average Length (number of tokens)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance comparisons on Arithmetic datasets showing C3oT matches Long CoT accuracy while significantly reducing length.
GSM8K	Accuracy	40.1	40.3	+0.2
GSM8K	Average Length	125	53	-72
MathQA	Accuracy	40.7	41.0	+0.3
MathQA	Average Length	119	54	-65
Performance on Commonsense datasets.
ECQA	Accuracy	51.1	52.0	+0.9
ECQA	Average Length	91	45	-46
Comparison against Implicit-CoT (baseline that removes CoT entirely).
GSM8K	Accuracy	31.7	40.3	+8.6

Main Takeaways

C3oT successfully decouples reasoning effectiveness from CoT length, achieving >50% compression with no loss in accuracy.
The method consistently outperforms Implicit-CoT, suggesting that some explicit tokens are necessary for complex reasoning, even if compressed.
The approach works across different domains (Math and Commonsense), indicating robustness.
Conditioned training effectively allows a single model to handle both verbose and concise generation modes.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Supervised Fine-Tuning (SFT)
Knowledge Distillation
Conditioned Generation

Key Terms

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer.

Implicit-CoT: A method that attempts to internalize reasoning into hidden states without generating explicit tokens.

SFT: Supervised Fine-Tuning—training a pre-trained model on a labeled dataset.

Conditioned Training: Training a model to generate different styles of output based on a control token or prompt added to the input.

Compressor: A model (here GPT-4) used to rewrite long reasoning chains into shorter summaries while keeping key logic.

Inference Acceleration: Techniques to reduce the computational cost or time required for a model to generate an answer.

GPT-4: A large multimodal model by OpenAI, used here as the teacher/compressor to generate ground-truth short CoTs.