Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

📝 Paper Summary

Chain-of-Thought Prompting Zero-shot Reasoning

Plan-and-Solve Prompting improves zero-shot reasoning by explicitly instructing LLMs to devise a plan and execute it, replacing simple step-by-step triggers.

Core Problem

Zero-shot Chain-of-Thought (CoT) prompting often fails on complex tasks due to calculation errors, missing reasoning steps, and semantic misunderstandings.

Why it matters:

Large Language Models (LLMs) are often deployed as services without access to parameters for fine-tuning, making effective zero-shot prompting crucial
Manual few-shot prompting requires labor-intensive crafting of demonstrations, which zero-shot approaches aim to eliminate
Existing zero-shot triggers like 'Let's think step by step' are insufficient for preventing logic gaps in multi-step problems

Concrete Example: In a math problem asking for the combined weight of Grace and Alex, Zero-shot-CoT might attempt to add numbers immediately and miss the intermediate step of calculating Alex's weight first. Plan-and-Solve explicitly generates a plan: '1. Find Grace's weight... 2. Find Alex's weight... 3. Sum them', preventing the missing step.

Key Novelty

Plan-and-Solve (PS) Prompting

Replaces the generic 'Let's think step by step' trigger with a two-stage instruction: first devise a plan to decompose the task, then execute the plan
Extends this with PS+ prompting, which adds detailed instructions to extract variables and pay attention to calculations, effectively guiding the model to avoid specific error types like missing variables or arithmetic mistakes

Architecture

Comparison of Zero-shot-CoT vs. Plan-and-Solve (PS) Prompting workflows

Evaluation Highlights

PS+ prompting achieves 76.7% average accuracy across six arithmetic datasets, surpassing Zero-shot-CoT (70.4%) and Zero-shot-Program-of-Thought (73.5%)
On CommonsenseQA, PS+ prompting scores 71.9%, significantly outperforming Zero-shot-CoT (65.2%)
Matches the performance of 8-shot Manual-CoT (77.6% average on math) without requiring any manual demonstration examples

Breakthrough Assessment

7/10

Simple yet highly effective prompting strategy that significantly closes the gap between zero-shot and few-shot performance, addressing specific reasoning failure modes.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot reasoning on multi-step problems (arithmetic, commonsense, symbolic) using Large Language Models via prompting

Inputs: Natural language problem statement [X]

Outputs: Reasoning steps followed by a final numerical or text answer

Pipeline Flow

Prompting for Reasoning Generation (Input Problem + PS Trigger -> Reasoning Text)
Prompting for Answer Extraction (Reasoning Text + Extraction Trigger -> Final Answer)

System Modules

Reasoning Generator

Generate the reasoning plan and execution steps

Model or implementation: GPT-3 (text-davinci-003)

Answer Extractor

Extract the final numerical or categorical answer from the generated text

Model or implementation: GPT-3 (text-davinci-003)

Novel Architectural Elements

PS+ Trigger Sentence Structure: A comprehensive prompt template that enforces variable extraction, explicit planning, and calculation checks in a zero-shot setting

Modeling

Base Model: GPT-3 (text-davinci-003)

Training Method: Zero-shot inference via prompting

Key Hyperparameters:

temperature: 0 (greedy decoding)
self_consistency_samples: 10 (only for self-consistency experiments)
self_consistency_temperature: 0.7 (only for self-consistency experiments)

Compute: Inference costs associated with GPT-3 API usage; no training compute reported

Comparison to Prior Work

vs. Zero-shot-CoT: Adds explicit planning and variable extraction instructions, reducing missing steps and calculation errors
vs. Manual-CoT: Achieves comparable performance without the manual effort of crafting dataset-specific examples
vs. Zero-shot-PoT: Solves problems via natural language reasoning rather than code generation, performing better on arithmetic benchmarks in this study
+ 1 more
vs. Least-to-Most Prompting [not cited in paper]: Similar decomposition idea, but PS is a single-turn zero-shot instruction rather than a multi-stage query process

Limitations

Sensitive to the exact phrasing of the prompt instructions (prompt engineering effort required)
Semantic misunderstanding errors still persist (27% of errors on GSM8K) despite improvements in calculation and missing steps
Relies on proprietary API models (GPT-3), limiting full transparency into the underlying model mechanics

Reproducibility

Code: https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting

📊 Experiments & Results

Evaluation Setup

Zero-shot inference on standard reasoning benchmarks

Benchmarks:

GSM8K (Arithmetic Reasoning (Grade School Math))
SVAMP (Arithmetic Reasoning (Robustness))
MultiArith (Arithmetic Reasoning)
AQuA (Algebraic Word Problems)
CommonsenseQA (CSQA) (Commonsense Reasoning)
StrategyQA (Multi-step Commonsense Reasoning)
Last Letter (Symbolic Reasoning)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot PS+ consistently outperforms standard Zero-shot-CoT across arithmetic reasoning datasets.
Average (6 Math Datasets)	Accuracy	70.4	76.7	+6.3
GSM8K	Accuracy	56.4	59.3	+2.9
CommonsenseQA (CSQA)	Accuracy	65.2	71.9	+6.7
Last Letter	Accuracy	64.8	75.2	+10.4
PS+ Prompting performs comparably to or better than Program-of-Thought (PoT) and Few-shot methods on arithmetic tasks.
Average (6 Math Datasets)	Accuracy	73.5	76.7	+3.2
Average (6 Math Datasets)	Accuracy	77.6	76.7	-0.9

Experiment Figures

Pie chart breakdown of error types in Zero-shot-CoT on GSM8K

Correlation matrix between generated reasoning components (Variables, Plan, Solution) and error types

Main Takeaways

PS+ prompting reduces calculation errors (from 7% to 5%) and missing step errors (from 12% to 7%) compared to Zero-shot-CoT on GSM8K samples
Explicit instructions to extract variables and devise plans negatively correlate with errors, empirically validating the prompt design
The method generalizes well across different reasoning types (arithmetic, commonsense, symbolic) without task-specific tuning beyond the prompt template
Self-consistency (majority voting) further boosts PS+ performance (e.g., from 58.7% to 73.7% on GSM8K)

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models (LLMs)
Familiarity with Chain-of-Thought (CoT) prompting
Distinction between zero-shot and few-shot learning

Key Terms

CoT: Chain-of-Thought—a prompting technique that encourages the model to generate intermediate reasoning steps

Zero-shot-CoT: Invoking reasoning without examples using a trigger like 'Let's think step by step'

PoT: Program-of-Thought—prompting the model to generate code (usually Python) to solve reasoning problems

PS Prompting: Plan-and-Solve Prompting—the proposed method instructing the model to devise a plan and then execute it

PS+ Prompting: An extension of PS Prompting that includes detailed instructions to extract variables and pay attention to calculations

Self-Consistency: A decoding strategy where the model generates multiple reasoning paths and the final answer is selected via majority vote