The CoT Collection: Improving Zero-shot and Few-shot Learning of Language Models via Chain-of-Thought Fine-Tuning

📝 Paper Summary

Chain-of-Thought Reasoning Instruction Tuning

The CoT Collection augments 1.84 million rationales across 1,060 tasks to equip smaller language models (3B–11B) with zero-shot chain-of-thought reasoning capabilities previously limited to large models.

Core Problem

Small language models (<100B parameters) fail to perform chain-of-thought reasoning on unseen tasks because existing CoT instruction datasets are too small (only ~9 tasks), leading to poor generalization.

Why it matters:

Chain-of-Thought prompting typically requires massive models (>100B params), making reasoning capabilities inaccessible due to high computational costs
Current small LMs struggle to generalize reasoning skills to novel tasks, often failing to generate rationales even when prompted
Relying on single-task CoT fine-tuning does not solve the broader problem of zero-shot generalization across diverse unseen domains

Concrete Example: When asked a complex boolean expression question, a standard small LM (like Flan-T5) directly outputs a potentially incorrect answer. In contrast, CoT-T5 trained on the CoT Collection first generates 'Let's think step by step' followed by a logical breakdown, leading to the correct result.

Key Novelty

Large-Scale Rationale Distillation for Instruction Tuning

Augments the existing Flan Collection by generating 1.84 million chain-of-thought rationales for 1,060 tasks using a large teacher model (Codex)
demonstrates that fine-tuning small LMs on a massive diversity of reasoning tasks (1,060) is far more effective than scaling up examples on a few tasks (9)
Introduces CoT-T5, a model that learns to consistently generate step-by-step reasoning before answering, even on unseen tasks

Architecture

Conceptual comparison between the CoT Collection and previous CoT datasets

Evaluation Highlights

+4.34% improvement in zero-shot accuracy on Big Bench Hard (BBH) for Flan-T5 (3B) using CoT evaluation
+13.98% improvement over ChatGPT on few-shot domain-specific tasks (legal/medical) when using CoT-T5-11B with LoRA
Outperforms T0-3B by +8.65% on the P3 benchmark by training on only 163 tasks (vs. T0's original training setup)

Breakthrough Assessment

8/10

Significantly democratizes reasoning capabilities for smaller, deployable models by providing a massive, high-quality open-source dataset and proving that task diversity drives CoT generalization.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot and Few-shot learning on unseen tasks requiring multi-step reasoning

Inputs: Task instruction I and input instance x

Outputs: Rationale r (step-by-step reasoning) followed by final answer y

Pipeline Flow

Input (Instruction + Question)
CoT-T5 Encoder-Decoder
Output (Rationale + Answer)

System Modules

CoT-T5

Generate reasoning chain and answer

Model or implementation: Flan-T5 (3B or 11B) fine-tuned on CoT Collection

Novel Architectural Elements

Not an architectural novelty, but a data-centric novelty: The system relies on a specialized fine-tuning recipe where the model is explicitly trained to output rationales on 1,060 diverse tasks

Modeling

Base Model: Flan-T5 (3B and 11B versions)

Training Method: Instruction Fine-tuning (Full Fine-tuning or LoRA)

Adaptation: LoRA (Rank=4) used for few-shot experiments; Full fine-tuning used for zero-shot experiments

Trainable Parameters: 2.35M (3B LoRA), 4.72M (11B LoRA), or Full parameters

Training Data:

CoT Collection: 1.84 million rationales augmented from Flan Collection (P3, SNI, Flan 2021)
1,060 tasks selected based on length constraints and public availability
Rationales generated by OpenAI Codex via few-shot prompting

Key Hyperparameters:

trigger_phrase: Let's think step by step
minimum_rationale_length: 8 tokens (constraint during evaluation)
LoRA_rank: 4
+ 1 more
LoRA_training_steps: 1000

Compute: Not reported in the paper

Comparison to Prior Work

vs. Flan-T5: CoT-T5 is fine-tuned on 1.84M rationales across 1,060 tasks, whereas Flan-T5 uses standard targets
vs. Specializing Smaller LMs: Focuses on generalization across 1,060 unseen tasks rather than specializing in one math/reasoning task
vs. Vicuna: Demonstrates that academic instruction tuning (CoT Collection) outperforms chat-based tuning on reasoning benchmarks like BBH
+ 1 more
vs. ChatGPT: Outperforms 175B+ models in few-shot settings by fine-tuning small models (3B/11B) on domain-specific CoT data

Limitations

Not optimized for long-form chat or dialogue applications
Limited multilingual capabilities (base model is primarily English-centric)
Reliance on a teacher model (Codex) limits the quality of rationales to the teacher's capability
Does not address cross-lingual transfer of CoT abilities

Reproducibility

Code: https://github.com/kaistAI/CoT-Collection

publicly available (https://github.com/kaistAI/CoT-Collection). Dataset (1.84M rationales) and CoT-T5 model checkpoints are released. Source code for training/eval provided. Note: The teacher model used for augmentation (OpenAI Codex) is deprecated, but the resulting dataset is preserved.

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation on unseen benchmarks and Few-shot adaptation on domain tasks

Benchmarks:

Big Bench Hard (BBH) (Challenging multi-step reasoning tasks)
P3 Evaluation (Diverse NLP tasks (QA, NLI, Classification))
MGSM (Multilingual Math Reasoning)
Domain Tasks (LEDGAR, Case Hold, MedNLI, PubMedQA) (Legal and Medical QA/Classification)

Metrics:

Accuracy
Exact Match (EM)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance on Big Bench Hard (BBH) demonstrates that CoT fine-tuning significantly improves reasoning capabilities of small models.
Big Bench Hard (BBH)	Accuracy (CoT Eval)	34.06	38.40	+4.34
Big Bench Hard (BBH)	Total Avg Accuracy	39.78	42.38	+2.60
Big Bench Hard (BBH)	Total Avg Accuracy	38.30	42.38	+4.08
Few-shot adaptation experiments on domain-specific tasks show CoT-T5 adapts better than baselines with limited data.
Domain Tasks (Avg)	Accuracy	54.98	68.96	+13.98
Domain Tasks (Avg)	Accuracy	66.59	68.96	+2.37

Experiment Figures

Scaling plot comparing BBH performance against the number of tasks/instances used for fine-tuning

Main Takeaways

Diversity of tasks is critical: Fine-tuning on 1,060 tasks with CoT rationales yields better generalization than fine-tuning on massive instances of only 9 tasks.
CoT fine-tuning unlocks reasoning in small LMs: Models as small as 3B parameters show significant gains on hard reasoning benchmarks (BBH) where they previously struggled.
Parameter-efficient adaptation works best with CoT-T5: In few-shot settings, combining CoT-T5 with LoRA outperforms full fine-tuning of base models and ICL with large proprietary models.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Transformer-based Language Models (T5 architecture)
Familiarity with Instruction Tuning and Chain-of-Thought prompting
Basic knowledge of distillation/synthetic data generation

Key Terms

CoT: Chain-of-Thought—a prompting method where the model generates intermediate reasoning steps before the final answer

Rationale: The text explanation or reasoning path generated by the model to justify its answer

Instruction Tuning: Fine-tuning language models on datasets formatted as natural language instructions (e.g., 'Translate this sentence:')

Zero-shot: Evaluating a model on a task it has not explicitly seen during training, without providing examples in the prompt

Few-shot: Evaluating or adapting a model using a small number of examples (e.g., 64) provided in the prompt or used for lightweight fine-tuning

LoRA: Low-Rank Adaptation—a parameter-efficient fine-tuning technique that updates only a small subset of parameters

Flan: Finetuned Language Net—a series of T5 models instruction-tuned on a large collection of tasks

BBH: Big Bench Hard—a challenging subset of the BIG-Bench benchmark focusing on tasks where models require multi-step reasoning