Large Language Models are Zero-Shot Reasoners

📝 Paper Summary

Prompt Engineering Chain of Thought Reasoning Zero-Shot Learning

Large language models can perform complex multi-step reasoning zero-shot by simply adding the prompt 'Let's think step by step' before the answer, without needing task-specific examples.

Core Problem

Large language models (LLMs) struggle with multi-step reasoning tasks (like arithmetic) in standard zero-shot settings, typically requiring carefully crafted few-shot examples to elicit reasoning.

Why it matters:

Creating task-specific few-shot examples requires manual engineering and expertise, limiting the broad applicability of LLMs.
The assumption that LLMs require few-shot examples to reason obscures their fundamental zero-shot capabilities.
Standard scaling laws for LLMs often fail on system-2 tasks (slow, multi-step reasoning) without specific prompting techniques.

Concrete Example: In a math problem asking for the number of blue golf balls, a standard zero-shot model immediately guesses a wrong number. In contrast, Zero-shot-CoT forces the model to articulate 'There are 16 balls... Half are golf balls...' leading to the correct answer.

Key Novelty

Zero-shot Chain of Thought (Zero-shot-CoT)

Proposes a single, task-agnostic prompt ('Let's think step by step') to trigger multi-step reasoning in LLMs without any examples.
Uses a two-stage prompting pipeline: first to generate the reasoning path, and second to extract the final concise answer from that reasoning.

Architecture

The full pipeline of Zero-shot-CoT compared to standard prompting.

Evaluation Highlights

Increases accuracy on MultiArith from 17.7% (standard zero-shot) to 78.7% (Zero-shot-CoT) using text-davinci-002.
Improves GSM8K performance from 10.4% to 40.7% with text-davinci-002, significantly closing the gap with few-shot methods.
Outperforms standard Few-shot prompting (without CoT) on MultiArith even when the few-shot baseline uses 8 examples (78.7% vs 33.8%).

Breakthrough Assessment

9/10

A seminal paper that fundamentally changed how researchers interact with LLMs, proving that reasoning is an emergent zero-shot capability triggerable by a simple phrase rather than requiring complex few-shot engineering.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot reasoning on arithmetic, symbolic, and logical tasks using pre-trained Large Language Models.

Inputs: A natural language question x (e.g., a math word problem).

Outputs: A reasoning path z followed by a final answer prediction y.

Pipeline Flow

Stage 1: Reasoning Extraction (Input Question + Trigger Prompt -> Reasoning Text)
Stage 2: Answer Extraction (Input Question + Reasoning Text + Answer Prompt -> Final Answer)

System Modules

Reasoning Extraction

Elicit a step-by-step reasoning path from the LLM.

Model or implementation: LLM (e.g., text-davinci-002, PaLM)

Answer Extraction

Extract the specific final answer (number or option) from the generated reasoning.

Model or implementation: LLM (same model as above)

Novel Architectural Elements

Two-stage zero-shot prompting pipeline where the prompt itself ('Let's think step by step') acts as the reasoning trigger without in-context examples.

Modeling

Base Model: InstructGPT (text-davinci-002), PaLM (540B), Original GPT-3 (davinci)

Compute: Inference only. No training performed.

Comparison to Prior Work

vs. Few-shot-CoT: Does not require task-specific manual engineering of reasoning examples; uses a single fixed trigger.
vs. Standard Zero-shot: Elicits multi-step reasoning significantly improving performance on math/logic tasks.
vs. Standard Few-shot: Outperforms even 8-shot standard prompting on complex tasks without using any examples.

Limitations

Underperforms Few-shot-CoT which uses hand-crafted task-specific examples.
Performance gains are minimal or non-existent on smaller models (<100B parameters).
Does not improve commonsense reasoning tasks (e.g., CommonsenseQA) as dramatically as arithmetic tasks.
Sensitive to the exact phrasing of the trigger prompt (e.g., 'Let's think' performs worse than 'Let's think step by step').

Reproducibility

Code: https://github.com/kojima-takeshi188/zero_shot_cot

📊 Experiments & Results

Evaluation Setup

Zero-shot evaluation across 12 datasets covering arithmetic, symbolic, commonsense, and logical reasoning.

Benchmarks:

MultiArith (Arithmetic Reasoning)
GSM8K (Arithmetic Reasoning)
AQUA-RAT (Arithmetic Reasoning)
SVAMP (Arithmetic Reasoning)
CommonsenseQA (Commonsense Reasoning)
StrategyQA (Commonsense Reasoning)
Last Letter Concatenation (Symbolic Reasoning)
Coin Flip (Symbolic Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on Arithmetic Reasoning tasks using InstructGPT (text-davinci-002).
MultiArith	Accuracy	17.7	78.7	+61.0
GSM8K	Accuracy	10.4	40.7	+30.3
AQUA-RAT	Accuracy	22.4	33.5	+11.1
SVAMP	Accuracy	58.8	62.1	+3.3
Comparison against Few-shot baselines (Wei et al. 2022) on MultiArith.
MultiArith	Accuracy	33.8	78.7	+44.9
MultiArith	Accuracy	93.0	78.7	-14.3
Results on PaLM 540B showing scalability.
MultiArith	Accuracy	25.5	66.1	+40.6
GSM8K	Accuracy	12.5	43.0	+30.5
Symbolic Reasoning Improvements.
Last Letter (4 words)	Accuracy	0.2	57.6	+57.4
Coin Flip (4 times)	Accuracy	12.8	91.4	+78.6

Main Takeaways

LLMs are decent zero-shot reasoners; the CoT capability is untapped in standard zero-shot prompting.
The single prompt 'Let's think step by step' is versatile and effective across diverse arithmetic, symbolic, and logical tasks.
Scaling laws apply strongly to Zero-shot-CoT: benefits are negligible for small models (e.g., GPT-3 Ada/Babbage) but enormous for large models (Davinci, PaLM 540B).
While Zero-shot-CoT underperforms Few-shot-CoT, it serves as a strong, minimal baseline that requires no sample engineering.
Qualitative error analysis shows Zero-shot-CoT can produce logical reasoning even when the final answer is incorrect, or conversely, get the right answer for the wrong reasons.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs) and prompting
Difference between zero-shot and few-shot learning
Concept of Chain of Thought (CoT) reasoning

Key Terms

CoT: Chain of Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer.

Zero-shot-CoT: The paper's proposed method; eliciting chain of thought reasoning using only a template (e.g., 'Let's think step by step') without example QA pairs.

Few-shot-CoT: Prior method (Wei et al., 2022) requiring manual creation of question-answer pairs with reasoning steps to guide the model.

System-1 vs System-2: Cognitive science distinction; System-1 is fast/intuitive (standard prompting), System-2 is slow/analytical (CoT prompting).

Greedy decoding: A generation strategy where the model selects the highest probability token at each step, making outputs deterministic.

InstructGPT: OpenAI's GPT-3 models fine-tuned with human feedback to follow instructions (e.g., text-davinci-002).

PaLM: A 540-billion parameter dense language model developed by Google.

Self-consistency: A decoding strategy (Wang et al., 2022) that samples multiple reasoning paths and takes a majority vote for the final answer.

Answer extraction: A secondary prompting step used to parse the final numerical or multiple-choice answer from the model's free-text reasoning.