Chain of Thoughtlessness? An Analysis of CoT in Planning

📝 Paper Summary

Chain of Thought (CoT) reasoning Classical Planning Generalization analysis

Chain of Thought prompting in planning tasks relies on pattern matching rather than learning general algorithms, failing to generalize as problem complexity increases beyond the provided examples.

Core Problem

LLMs often fail to generalize reasoning capabilities out-of-distribution; while Chain of Thought (CoT) claims to teach models algorithmic procedures via examples, it is unclear if models actually learn the algorithm or just mimic patterns.

Why it matters:

Prevalent belief suggests CoT 'unlocks' reasoning, potentially leading to misplaced confidence in LLMs for critical planning tasks
The trade-off between the heavy human labor required to craft specific CoT prompts and the resulting brittle performance is poorly understood
Existing benchmarks (like GSM8K) often lack systematic scaling mechanisms, masking the failure of LLMs to generalize to larger instances of the same problem type

Concrete Example: A model provided with CoT examples of how to stack 3 blocks can solve 3-block problems, but when asked to stack 4 or more blocks using the exact same logic, its accuracy plummets to near zero.

Key Novelty

Systematic Granularity Stress-Test

Evaluates CoT performance across a spectrum of prompt specificity, from general 'progression proofs' applicable to any problem, down to highly specific 'table-to-stack' recipes
Uses the Blocksworld planning domain to mechanically scale problem difficulty (stack height), rigorously testing whether the 'learned' algorithm generalizes to larger instances

Architecture

A conceptual diagram illustrating the 'Target Distributions' of problems versus the 'Expected Generality' of different prompt types.

Evaluation Highlights

On 'table-to-stack' Blocksworld tasks, GPT-4 improves from 3.83% (zero-shot) to 59.3% with highly specific CoT prompts, but this gain is brittle
Performance collapses as the number of blocks increases: accuracy drops from ~60% to near 0% as the target stack height grows from 3 to 20, despite the algorithm remaining identical
On synthetic arithmetic tasks, CoinFlip accuracy remains high for short sequences but drops below 90% for sequences >31 steps, showing limits to length generalization

Breakthrough Assessment

8/10

A strong negative result that critically re-evaluates a dominant paradigm (CoT). It provides convincing evidence against the 'algorithmic learning' hypothesis for CoT using verifiable planning domains.

⚙️ Technical Details

Problem Definition

Setting: Classical planning (specifically Blocksworld domain) and synthetic reasoning tasks (CoinFlip, Arithmetic)

Inputs: Natural language query describing an initial state and a goal state (derived from PDDL)

Outputs: A plan (sequence of actions) to transition from initial state to goal state

Pipeline Flow

Prompt Generation (PDDL -> Natural Language with varying CoT granularity)
LLM Inference (Generate Plan)
Plan Extraction (Parse NL response to PDDL actions)
Validation (Check plan validity using VAL)

System Modules

Prompt Generator

Constructs prompts with varying levels of specific advice (Zero-shot, Universal Algorithm, Stacking Prompt)

Model or implementation: Script-based

Reasoning Engine

Generates the reasoning trace and final plan

Model or implementation: GPT-4 / Claude-3-Opus / GPT-4-Turbo

Validator

Verifies if the generated plan is executable and reaches the goal

Model or implementation: VAL (standard PDDL validator)

Modeling

Base Model: GPT-4, Claude-3-Opus, GPT-4-Turbo

Compute: Not reported in the paper

Comparison to Prior Work

vs. Sound Planners: LLMs fail to generalize and have low accuracy compared to 100% for symbolic planners
vs. Zero-Shot CoT: Manual CoT improves over Zero-Shot only when prompts are highly problem-specific
vs. Self-Consistency: Self-consistency actually degraded performance in Blocksworld (Table-to-Stack) due to large solution space
+ 1 more
vs. Chain-of-Thought (Standard): Authors argue standard CoT literature overclaims generalization capabilities; this paper shows these gains are brittle and instance-specific

Limitations

Study focuses primarily on Blocksworld and synthetic domains, though authors argue these are representative of sequential reasoning
Relies on manual prompt engineering, though this mirrors standard CoT practices
Results are specific to the model versions tested (GPT-4, Claude-3-Opus) and may change with future model capabilities
Self-consistency experiments were limited to Table-to-Stack problems

Reproducibility

Code: https://github.com/karthikv792/cot-planning

publicly available (https://github.com/karthikv792/cot-planning and https://github.com/kstechly/cot-scheduling). Code includes scripts for prompt generation and evaluation. VAL validator is standard software.

📊 Experiments & Results

Evaluation Setup

Evaluation of LLMs on Blocksworld planning problems and synthetic reasoning tasks

Benchmarks:

Blocksworld (Table-to-Stack) (Classical Planning)
CoinFlip (Symbolic Reasoning (State tracking))
LastLetterConcatenation (String Manipulation / Reasoning)
Single Digit Arithmetic (Multi-step Arithmetic) [New]

Metrics:

Plan Accuracy (Boolean correctness verified by VAL)
Task Accuracy (Exact match for synthetic tasks)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Comparison of different prompting strategies on Table-to-Stack Blocksworld problems (261 instances). Shows that high specificity is required for gains.
Blocksworld (Table-to-Stack)	Accuracy	3.83	59.3	+55.47
Blocksworld (Table-to-Stack)	Accuracy	19.1	40.6	+21.5
Blocksworld (Table-to-Stack)	Accuracy	9.96	24.5	+14.54
Performance on synthetic tasks comparing Zero-Shot to Manual CoT (GPT-4-Turbo). Demonstrates gains in constrained domains.
CoinFlip	Accuracy	56.38	98.89	+42.51
Single Digit Arithmetic	Accuracy	24.13	50.43	+26.30
LastLetterConcatenation	Accuracy	10.00	51.06	+41.06

Experiment Figures

Line plots showing Accuracy (y-axis) vs. Number of Blocks (x-axis) for different prompting strategies across three models (GPT-4-Turbo, Claude-3-Opus, GPT-4).

Main Takeaways

CoT performance improvements are highly dependent on prompt specificity; general algorithmic advice (Progression Proof) often yields negligible gains compared to problem-specific recipes (Stacking Prompt).
Length generalization fails: Even with specific CoT, performance degrades rapidly as the number of blocks/steps increases beyond what was shown in examples.
The 'Universal Algorithm' prompt, which teaches a general unstack-then-stack method, performed surprisingly well for Claude-3-Opus, suggesting some capacity for following explicit algorithms, though it still falls short of perfect generalization.
In LastLetterConcatenation, CoT improves syntactic correctness (e.g., getting the right set of letters) but fails to sequence them correctly for longer words, suggesting pattern matching over algorithmic execution.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Large Language Models and prompting strategies
Basics of classical planning (states, actions, goals)
Familiarity with Chain of Thought (CoT) literature

Key Terms

CoT: Chain of Thought—a prompting technique where the model is shown intermediate reasoning steps in examples to encourage it to generate similar reasoning steps

Blocksworld: A classic planning domain involving moving blocks between a table and stacks according to specific rules

PDDL: Planning Domain Definition Language—a standard formal language used to define planning problems (predicates, actions, preconditions, effects)

Zero-Shot CoT: A prompting strategy that appends 'Let's think step by step' to the query without providing examples

STRIPS: Stanford Research Institute Problem Solver—a formal language for automated planning problems

In-context learning: The ability of a model to improve its performance on a task by observing examples provided in the prompt context

Table-to-Stack: A simplified subset of Blocksworld problems where all blocks start on the table and must be arranged into a single target stack

Length Generalization: The ability of a model to apply a learned procedure to problem instances that are longer or larger (e.g., more steps) than those seen in training/examples