Self-Consistency Improves Chain of Thought Reasoning in Language Models

📝 Paper Summary

Chain-of-thought prompting Decoding strategies

Self-consistency replaces greedy decoding in chain-of-thought prompting by sampling diverse reasoning paths and selecting the answer with the most consistent final result.

Core Problem

Naive greedy decoding in chain-of-thought prompting often yields suboptimal or incorrect reasoning paths because language models are not perfect reasoners.

Why it matters:

Complex reasoning tasks typically require deliberate thinking, where a single greedy path often fails to capture the correct solution
Language models may produce incorrect reasoning steps or make mistakes in a single generation, leading to wrong answers
Greedy decoding suffers from repetitiveness and local optimality, failing to explore the diversity of thought processes available to the model

Concrete Example: In a math problem about egg sales, a greedy decode might miscalculate '16 - 3 - 4 = 9' and incorrectly answer '$18'. Self-consistency samples multiple paths: some might repeat the error, but if the majority correctly calculate '16 - 3 = 13; 13 - 4 = 9' or arrive at the answer '$18' through valid logic, the aggregated result is more reliable.

Key Novelty

Sample-and-Marginalize Decoding for Reasoning

Replaces the standard 'greedy decode' used in chain-of-thought prompting with a sampling strategy (e.g., high temperature) to generate diverse reasoning paths
Leverages the intuition that while there are many incorrect ways to reason, correct reasoning paths tend to lead to the same unique answer
Aggregates the final answers from the sampled paths using majority voting (marginalization) to select the most consistent solution

Architecture

Conceptual comparison between Greedy Decode and Self-Consistency method

Evaluation Highlights

+17.9% absolute accuracy improvement on GSM8K using PaLM-540B compared to standard chain-of-thought prompting
+12.2% absolute accuracy improvement on AQuA using PaLM-540B compared to standard chain-of-thought prompting
+11.0% absolute accuracy improvement on SVAMP using PaLM-540B compared to standard chain-of-thought prompting

Breakthrough Assessment

9/10

A simple, unsupervised, and highly effective drop-in replacement for greedy decoding that significantly boosts reasoning performance across multiple benchmarks and model scales.

⚙️ Technical Details

Problem Definition

Setting: Few-shot reasoning tasks (arithmetic, commonsense, symbolic) where the final answer is from a fixed set

Inputs: A prompt containing few-shot chain-of-thought exemplars and a specific question

Outputs: A final answer derived from the most consistent sampled reasoning path

Pipeline Flow

Prompt Construction (Chain-of-Thought exemplars)
Sampling (Generate diverse reasoning paths)
Parsing (Extract final answers)
Aggregation (Majority vote)

System Modules

Prompter

Prepare the input with few-shot chain-of-thought exemplars and the target question

Model or implementation: Various (UL2, GPT-3, LaMDA, PaLM)

Decoder (Sampler)

Generate m diverse reasoning paths using sampling (not greedy)

Model or implementation: Various (UL2, GPT-3, LaMDA, PaLM)

Aggregator

Parse final answers from paths and select the most frequent one (majority vote)

Model or implementation: Deterministic algorithm

Novel Architectural Elements

Sample-and-marginalize decoding framework applied to reasoning paths: explicitly decoupling the reasoning process (latent variable) from the final answer to allow diverse paths to support the same conclusion

Modeling

Base Model: Evaluated on UL2-20B, LaMDA-137B, GPT-3 (175B), and PaLM-540B

Training Method: Few-shot prompting only (inference-time method)

Compute: Requires m times more inference compute than greedy decoding (m=40 in main experiments)

Comparison to Prior Work

vs. Chain-of-Thought: Replaces greedy decoding with sampling + majority vote aggregation
vs. Verifier: Unsupervised; requires no training of an auxiliary verifier model or extra annotations
vs. Sample-and-Rank: Aggregates by answer consistency (majority vote) rather than sequence probability; outperforms standard ranking
+ 1 more
vs. Ensembles: Acts as a 'self-ensemble' on a single model rather than combining different models or prompt permutations

Limitations

Incurs higher computational cost during inference compared to greedy decoding (requires sampling multiple paths)
Applicable primarily to tasks with fixed answer sets (e.g., math, multiple choice) where consistency is easily defined
Relies on the model's ability to generate reasoning paths; nonsensical paths can sometimes lead to the correct answer (false positives), though less likely with high consistency

📊 Experiments & Results

Evaluation Setup

Few-shot in-context learning on arithmetic and commonsense reasoning benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
SVAMP (Math word problems with varying structures)
AQuA (Algebra word problems (multiple choice))
StrategyQA (Commonsense reasoning (Yes/No))
ARC-challenge (Science questions (multiple choice))
MultiArith (Arithmetic reasoning)

Metrics:

Accuracy
Statistical methodology: Reported mean and standard deviation over 10 runs

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Self-consistency significantly improves arithmetic reasoning accuracy over greedy CoT prompting across all models, especially on PaLM-540B.
GSM8K	Accuracy	56.5	74.4	+17.9
SVAMP	Accuracy	79.0	86.6	+7.6
AQuA	Accuracy	35.8	48.3	+12.5
Self-consistency also boosts performance on commonsense reasoning tasks.
StrategyQA	Accuracy	75.3	81.6	+6.3
ARC-challenge	Accuracy	85.2	88.7	+3.5
Robustness to imperfect prompts: Self-consistency improves performance even when prompts contain errors.
GSM8K	Accuracy	14.9	23.4	+8.5

Experiment Figures

Accuracy vs. Number of Sampled Reasoning Paths on LaMDA-137B

Robustness to sampling strategies and model scaling on GSM8K

Main Takeaways

Consistent gains across all four language models (UL2, LaMDA, GPT-3, PaLM) and diverse reasoning tasks
Performance improves monotonically with the number of sampled reasoning paths (saturating around 40 paths)
Significantly outperforms sample-and-rank, beam search, and prompt-ensemble baselines
High correlation between answer consistency and accuracy suggests it can serve as an uncertainty estimator
Works with zero-shot chain-of-thought and non-natural language (equation-only) reasoning paths, showing generality

📚 Prerequisite Knowledge

Prerequisites

Chain-of-thought (CoT) prompting
Language model decoding strategies (greedy, temperature sampling, nucleus sampling)
Few-shot in-context learning

Key Terms

Chain-of-thought prompting: Prompting a language model to generate a series of short sentences describing the reasoning steps before the final answer

Greedy decoding: A decoding strategy where the model always selects the token with the highest probability at each step

Self-consistency: A decoding strategy that samples multiple reasoning paths and selects the answer that appears most frequently (majority vote)

Marginalization: In this context, summing the probabilities of different reasoning paths that lead to the same final answer to find the most likely answer

Temperature sampling: A sampling method where the logits are scaled by a temperature parameter T; higher T increases diversity

Top-k sampling: A sampling method that restricts the model to sample only from the k most likely next tokens

Nucleus sampling: A sampling method that restricts sampling to the smallest set of tokens whose cumulative probability exceeds a threshold p

GSM8K: Grade School Math 8K—a dataset of grade school math word problems

SVAMP: A challenge dataset for math word problems with varying linguistic structures

AQuA: Algebra Question Answering dataset

StrategyQA: A benchmark for implicit reasoning strategies

ARC: AI2 Reasoning Challenge—a dataset of grade-school science questions

Majority vote: Selecting the answer that occurs most frequently among the generated outputs