(PS) Plan-and-Solve Prompting: Improving zero-shot CoT reasoning by LLMs

📝 Paper Summary

Prompt Engineering Zero-shot Reasoning Chain-of-Thought (CoT)

Plan-and-Solve Prompting replaces simple zero-shot triggers with explicit instructions to devise a plan and execute it, reducing missing steps and calculation errors in LLM reasoning.

Core Problem

Zero-shot Chain-of-Thought (CoT) prompting often fails due to calculation errors, missing reasoning steps, and semantic misunderstandings.

Why it matters:

Existing Zero-shot-CoT ('Let's think step by step') lacks specific guidance, leading models to skip crucial intermediate steps in complex multi-step problems
Few-shot CoT requires manually crafting task-specific demonstrations, which is labor-intensive and not always feasible for every new task

Concrete Example: In a math problem about combining weights, Zero-shot-CoT might immediately jump to adding numbers without defining variables, missing a step. Plan-and-Solve explicitly prompts: 'Let's first understand the problem... devise a plan... then carry out the plan', ensuring the variable is defined before calculation.

Key Novelty

Plan-and-Solve (PS) Prompting

Replaces the generic 'Let's think step by step' trigger with a two-stage instruction: first devise a plan to break the task into subtasks, then execute that plan
PS+ Prompting extends this by adding specific instructions to extract variables and pay attention to calculations, acting as a checklist for the LLM during generation

Architecture

Comparison of prompt templates and outputs between Zero-shot-CoT and Plan-and-Solve (PS) Prompting.

Evaluation Highlights

Outperforms Zero-shot-CoT on all 6 arithmetic datasets, with a +2.5% average accuracy gain for basic PS and +6.3% for PS+
PS+ prompting (76.7% average) performs comparably to 8-shot Manual-CoT (77.6%) on arithmetic reasoning without needing any demonstration examples
Achieves 99.6% accuracy on the Coin Flip symbolic reasoning task, effectively matching the 100% accuracy of few-shot baselines

Breakthrough Assessment

7/10

Significant improvement over Zero-shot-CoT with minimal cost (just a better prompt). Bridges the gap between zero-shot and few-shot performance, though fundamentally an incremental prompt engineering technique.

⚙️ Technical Details

Problem Definition

Setting: Zero-shot reasoning on multi-step tasks (Math, Commonsense, Symbolic) using Large Language Models

Inputs: A prompt containing the problem statement [X] and a specific trigger instruction [T]

Outputs: A generated reasoning chain followed by a final answer

Pipeline Flow

Input Problem → Prompt Construction (Template Q: [X] A: [Trigger])
Reasoning Generation (LLM generates plan + steps)
Answer Extraction (LLM extracts final numerical/text answer)

System Modules

Reasoning Generator

Generate the step-by-step reasoning text based on the trigger instructions

Model or implementation: GPT-3 (text-davinci-003)

Answer Extractor

Parse the final answer from the verbose reasoning text

Model or implementation: GPT-3 (text-davinci-003)

Novel Architectural Elements

Two-component prompt structure within a zero-shot setting: (1) Plan devising instruction, (2) Plan execution instruction

Modeling

Base Model: GPT-3 (text-davinci-003)

Comparison to Prior Work

vs. Zero-shot-CoT: PS adds explicit planning and variable extraction instructions
vs. Manual-CoT: PS is zero-shot (no demonstrations needed) but achieves comparable performance
vs. PoT: PS relies on natural language reasoning rather than code generation/execution
+ 1 more
vs. Least-to-Most Prompting: PS decomposes via a single plan instruction rather than multi-stage query decomposition prompts [not cited in paper as direct baseline, but conceptually related]

Limitations

Semantic misunderstanding errors persist (approx 27%) and are not fully addressed by PS prompting
Requires careful prompt engineering; GPT-3 is sensitive to specific wording of the instructions
Evaluation limited to GPT-3 (text-davinci-003); transferability to smaller or open-source models not explored

Reproducibility

Code: https://github.com/AGI-Edgerunners/Plan-and-Solve-Prompting

📊 Experiments & Results

Evaluation Setup

Zero-shot and Few-shot prompting on reasoning benchmarks

Benchmarks:

GSM8K (Grade school math word problems)
SVAMP (Arithmetic word problems with varying difficulty)
MultiArith (Multi-step arithmetic reasoning)
AQuA (Algebraic word problems)
CommonsenseQA (Commonsense reasoning)
StrategyQA (Multi-hop implicit reasoning)
Last Letter (Symbolic reasoning (concatenation))

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Zero-shot performance comparisons on arithmetic reasoning datasets. PS+ consistently beats standard Zero-shot-CoT.
GSM8K	Accuracy	56.4	59.3	+2.9
MultiArith	Accuracy	83.8	91.8	+8.0
SVAMP	Accuracy	69.9	75.7	+5.8
AQuA	Accuracy	38.9	46.0	+7.1
Comparison against Few-shot methods. PS+ is competitive with Manual-CoT despite being zero-shot.
Average (6 Math Datasets)	Accuracy	77.6	76.7	-0.9
Symbolic and Commonsense reasoning results.
CommonsenseQA	Accuracy	65.2	71.9	+6.7
Last Letter	Accuracy	64.8	75.2	+10.4
Ablation on Self-Consistency (SC) showing PS+ scales well with ensemble decoding.
GSM8K	Accuracy (w/ SC)	70.7	73.7	+3.0

Experiment Figures

Pie chart of error types in Zero-shot-CoT on GSM8K (46 incorrect answers).

Bar chart comparing Zero-shot-CoT vs Zero-shot-PS+ with and without Self-Consistency (SC) on GSM8K and SVAMP.

Main Takeaways

PS+ Prompting reduces calculation errors (from 7% to 5%) and missing step errors (from 12% to 7%) compared to Zero-shot-CoT on GSM8K.
The method generalizes well beyond math, showing strong improvements on symbolic (Last Letter, Coin Flip) and commonsense (CSQA, StrategyQA) tasks.
Variable definition and explicit planning in the generated text negatively correlate with error rates, confirming the hypothesis that structure improves reasoning quality.

📚 Prerequisite Knowledge

Prerequisites

Familiarity with Large Language Models (LLMs) like GPT-3
Understanding of Prompt Engineering
Knowledge of Chain-of-Thought (CoT) reasoning

Key Terms

Zero-shot-CoT: A prompting method that elicits reasoning by appending 'Let's think step by step' to a question without providing examples

Few-shot CoT: Prompting the model with a few input-output pairs containing step-by-step reasoning demonstrations

PoT: Program-of-Thought—prompting the model to generate and execute code (like Python) to solve reasoning problems

Self-Consistency (SC): A decoding strategy where the model generates multiple reasoning paths and selects the final answer via majority voting

Plan-and-Solve (PS): The proposed prompting strategy that explicitly instructs the model to devise a plan before solving

PS+: An enhanced version of Plan-and-Solve that adds instructions to extract variables and calculate intermediate results carefully