Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

📝 Paper Summary

Prompt Engineering Reasoning

Chain-of-thought prompting enables large language models to solve complex reasoning tasks by generating intermediate natural language steps before the final answer, without any model parameter updates.

Core Problem

Standard few-shot prompting works poorly on tasks requiring multi-step reasoning (arithmetic, commonsense, symbolic), and scaling model size alone does not sufficiently solve these problems.

Why it matters:

Scaling laws alone have shown diminishing returns or flat scaling curves for complex reasoning tasks
Fine-tuning models for reasoning requires large, expensive datasets of rationales
Standard prompting (Input -> Answer) forces models to perform complex computations in a single pass, often leading to errors

Concrete Example: Question: 'The cafeteria had 23 apples... used 20... bought 6 more. How many...?' Standard prompting outputs '27' (incorrect). Chain-of-thought prompting outputs 'The cafeteria had 23 apples... used 20... had 23-20=3... bought 6... 3+6=9. The answer is 9.' (correct).

Key Novelty

Chain-of-Thought Prompting

Augment few-shot exemplars with a series of intermediate natural language reasoning steps (a 'chain of thought') that leads to the final answer
Elicit the same step-by-step reasoning behavior in the model's output simply by showing it examples, rather than fine-tuning
Demonstrate that this ability is an emergent property of model scale, only appearing in sufficiently large language models (~100B+ parameters)

Architecture

Comparison of Standard Prompting vs. Chain-of-Thought Prompting inputs and outputs.

Evaluation Highlights

PaLM 540B with chain-of-thought prompting achieves 58% solve rate on GSM8K, surpassing the prior supervised state-of-the-art of 55%
Chain-of-thought prompting enables near-perfect solve rates (approx 100% and 90%+) on symbolic tasks (Coin Flip, Last Letter Concatenation) where standard prompting completely fails
Outperforms standard prompting on StrategyQA (75.6% vs ~60s%) and Sports Understanding (95.4% vs ~80s%) using PaLM 540B

Breakthrough Assessment

10/10

This is a seminal paper that introduced 'chain-of-thought' prompting, fundamentally changing how LLMs are used for reasoning. It demonstrated emergent abilities and unlocked performance previously thought impossible without fine-tuning.

⚙️ Technical Details

Problem Definition

Setting: Few-shot prompting for reasoning tasks

Inputs: A prompt consisting of triplets <input, chain of thought, output> as exemplars, followed by a test input

Outputs: A generated completion containing a chain of thought followed by the final answer

Pipeline Flow

Prompt Construction (Select exemplars with reasoning steps)
Inference (Model generates reasoning + answer for new input)

System Modules

Prompt Constructor

Concatenates k exemplars (input, chain-of-thought, output) and the test input into a single text prompt

Model or implementation: Manual annotation (8 exemplars used for most tasks)

Language Model

Generates the completion (chain of thought and answer) based on the prompt

Model or implementation: Various (PaLM 540B, GPT-3 175B, etc.)

Novel Architectural Elements

Prompt-based modification: The novelty is in the structure of the input data (including intermediate steps in exemplars) rather than the model architecture itself

Modeling

Base Model: PaLM (8B, 62B, 540B), LaMDA (422M, 2B, 8B, 68B, 137B), GPT-3 (various sizes up to 175B/text-davinci-002), UL2 20B, Codex

Compute: Inference only; no training or fine-tuning performed for this paper

Comparison to Prior Work

vs. Ling et al.: Uses few-shot prompting on frozen large models instead of training from scratch
vs. Cobbe et al.: Achieves comparable or better results with just 8 examples vs. fine-tuning on thousands

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on frozen language models across arithmetic, commonsense, and symbolic reasoning tasks

Benchmarks:

GSM8K (Math word problems)
SVAMP (Math word problems with varying structures)
ASDiv (Diverse math word problems)
AQuA (Algebraic word problems (multiple choice))
MAWPS (Math word problems)
CSQA (Commonsense QA)
StrategyQA (Multi-hop strategy inference)
Sports Understanding (Domain-specific plausibility (BIG-bench))
SayCan (Robot instruction mapping)

Metrics:

Solve rate (%)
Accuracy
Statistical methodology: Reported average over 5 random seeds for LaMDA; single seed for other models due to compute costs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Math word problem results showing PaLM 540B's performance with chain-of-thought prompting.
GSM8K	Solve rate (%)	55	58	+3
SVAMP	Solve rate (%)	57.4	79.0	+21.6
MAWPS	Solve rate (%)	88.3	93.3	+5.0
Commonsense reasoning results comparing Chain-of-Thought against standard prompting.
StrategyQA	Accuracy (%)	69.4	75.6	+6.2
Sports Understanding	Accuracy (%)	53	95.4	+42.4
Symbolic reasoning results demonstrating out-of-distribution (length) generalization.
Coin Flip (4 flips)	Solve rate (%)	0	96	+96

Experiment Figures

Scaling curves for GSM8K, SVAMP, and MAWPS across LaMDA, GPT, and PaLM models.

Ablation study comparing Chain-of-Thought to Equation only, Variable compute, and Reasoning after answer.

Main Takeaways

Emergent Ability: Chain-of-thought prompting only yields performance gains in models with ~100B+ parameters; smaller models fail to generate logical chains.
Complexity Correlation: The performance gain from chain-of-thought prompting is larger for more complicated problems (e.g., multi-step math vs. single-step).
Robustness: The method is robust to different annotators, exemplar permutations, and independent prompt sets.
Generalization: Facilitates length generalization in symbolic tasks (e.g., training on 2-step coin flips, solving 4-step coin flips), where standard prompting fails.

📚 Prerequisite Knowledge

Prerequisites

Few-shot prompting / In-context learning
Transformer language models
Basic arithmetic and logic problems

Key Terms

Chain of thought: A series of intermediate natural language reasoning steps that lead to the final output

Few-shot prompting: Providing a language model with a few input-output examples in the context window to guide its behavior for a new task

Emergent ability: A capability that is not present in smaller models but appears suddenly as the model scale increases past a certain threshold

GSM8K: Grade School Math 8K—a benchmark of high-quality grade school math word problems

Greedy decoding: A generation strategy where the model selects the highest probability token at each step

PaLM: Pathways Language Model—a large dense language model developed by Google with up to 540 billion parameters

LaMDA: Language Models for Dialog Applications—a family of Transformer-based models specialized for dialog

Codex: A GPT-3 variant fine-tuned on code, capable of code generation and reasoning

Standard prompting: The traditional few-shot method where exemplars consist only of input-output pairs without intermediate steps

OOD: Out-of-Distribution—evaluating on data different from the training or exemplar distribution (e.g., longer sequences than seen in examples)