Least-to-Most Prompting Enables Complex Reasoning in LLMs

📝 Paper Summary

Prompt Engineering Large Language Model Reasoning Compositional Generalization

Least-to-most prompting enables large language models to solve complex reasoning tasks harder than their prompt exemplars by decomposing problems into simpler subproblems and solving them sequentially.

Core Problem

Chain-of-thought prompting often fails on tasks requiring generalization to problems harder or longer than the provided few-shot exemplars (easy-to-hard generalization).

Why it matters:

Language models typically struggle to generalize from short training examples to long test cases (length generalization), unlike humans
Existing neural-symbolic approaches for compositional generalization benchmarks like SCAN require training on thousands of examples, whereas prompting uses almost none
Standard few-shot prompting hits a ceiling on complex symbolic manipulation and multi-step math problems where reasoning depth exceeds the prompt's scope

Concrete Example: In the last-letter-concatenation task, chain-of-thought prompting correctly solves lists of length 4 (seen in prompt) but fails completely (0% accuracy) on lists of length 12. It fails to generalize the recursion needed for longer sequences.

Key Novelty

Least-to-Most Prompting

Decomposition Stage: Prompts the model to break a complex problem down into a list of simpler subproblems using few-shot exemplars demonstrating decomposition.
Subproblem Solving Stage: Sequentially solves each subproblem, appending the answer to the previous subproblem to the context for the next one.
Progressive Context: borrow from educational psychology where the model is 'taught' to build answers recursively using previous results.

Architecture

Illustration of the two-stage Least-to-Most prompting process on a math word problem.

Evaluation Highlights

99.7% accuracy on SCAN length split using code-davinci-002 with just 14 exemplars, compared to 16.2% with chain-of-thought prompting.
74.0% accuracy on last-letter-concatenation for length-12 lists (harder than prompt), while chain-of-thought gets 31.8% and standard prompting gets 0%.
Improvements on GSM8K math reasoning for problems requiring 5+ steps, raising accuracy from 39.07% (Chain-of-Thought) to 45.23%.

Breakthrough Assessment

9/10

Achieved near-perfect results on the SCAN length split via prompting alone, a massive leap over previous prompting methods and comparable to specialized trained models, essentially solving a major OOD generalization benchmark.

⚙️ Technical Details

Problem Definition

Setting: Few-shot prompting of frozen Large Language Models for complex reasoning tasks involving symbolic manipulation and compositional generalization.

Inputs: A natural language query q representing a complex problem.

Outputs: A final answer derived from sequentially solving subproblems generated by the model itself.

Pipeline Flow

Decomposition Prompting: Input Problem → [Decomposition Prompt] → Subproblems List
Sequential Solving: Subproblem 1 + [Solving Prompt] → Answer 1
Recursive Step: Answer 1 + Subproblem 2 + [Solving Prompt] → Answer 2 ... → Final Answer

System Modules

Decomposition Module

Break down the complex input question into a sequential list of simpler sub-questions.

Model or implementation: GPT-3 (code-davinci-002, text-davinci-002)

Subproblem Solver

Solve each sub-question sequentially, using context from previous answers.

Model or implementation: GPT-3 (code-davinci-002, text-davinci-002)

Novel Architectural Elements

Two-stage prompting architecture: Explicit separation of 'decomposition' and 'solving' phases via distinct prompts, rather than a single pass.
Recursive context accumulation: The output of the solver for subproblem i becomes part of the prompt for subproblem i+1.

Modeling

Base Model: GPT-3 (code-davinci-002, text-davinci-002, code-davinci-001)

Comparison to Prior Work

vs. Chain-of-Thought: Explicitly separates decomposition from solving; solves subproblems sequentially with accumulated context rather than in one continuous generation.
vs. Neural-Symbolic Stack Machines: Achieves comparable SCAN performance (99%+) with 14 examples and no training, versus 15,000+ training examples.
vs. Selection-Inference [not cited in paper]: Least-to-Most generates its own decomposition plan dynamically, whereas Selection-Inference typically selects from a fixed set of facts/rules.
+ 1 more
vs. Scratchpad [not cited in paper]: Similar focus on intermediate computation, but Least-to-Most structures the computation via explicit sub-question decomposition first.

Limitations

Decomposition prompts do not generalize well across domains (e.g., math prompts don't work for commonsense).
Requires designing task-specific decomposition exemplars.
Performance depends heavily on the specific engine; code-davinci-002 significantly outperforms text-davinci-002.
Error analysis shows failures in complex compositional instructions (e.g., confusing 'after' with 'and' in SCAN).

Reproducibility

Prompt templates for all tasks (Last-letter, SCAN, DROP, GSM8K) are provided in the Appendix. No specific code repository is linked, but the method is purely prompting based on standard API access. The paper explicitly lists the full prompt content.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on frozen LLMs without fine-tuning.

Benchmarks:

Last-Letter-Concatenation (Symbolic Manipulation) [New]
SCAN (Compositional Generalization (Length Split))
GSM8K (Math Word Problems)
DROP (Discrete Reasoning over Paragraphs)

Metrics:

Accuracy (Exact Match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Symbolic Manipulation (Last-Letter-Concatenation) results showing generalization to sequences longer than those seen in prompts.
Last-Letter-Concatenation	Accuracy	31.8	74.0	+42.2
Last-Letter-Concatenation	Accuracy	84.2	94.0	+9.8
Compositional Generalization (SCAN) results on the challenging Length Split.
SCAN (Length Split)	Accuracy	16.2	99.7	+83.5
Math Reasoning results showing improvements on difficult multi-step problems.
GSM8K	Accuracy	39.07	45.23	+6.16
DROP (Football subset)	Accuracy	59.56	73.42	+13.86

Main Takeaways

Least-to-Most prompting enables easy-to-hard generalization: models prompted with short examples can solve much longer/harder test cases.
The code-davinci-002 model consistently outperforms text-davinci-002 on symbolic tasks, suggesting code training aids symbolic reasoning.
The method is particularly effective for tasks with clear recursive structures (SCAN, concatenation) but less dominant on tasks where decomposition is trivial or ambiguous.
Decomposition accuracy is critical; for SCAN and Last-Letter, the decomposition step achieves near 100% accuracy, enabling the downstream success.

📚 Prerequisite Knowledge

Prerequisites

Few-shot prompting (providing input-output pairs in context)
Chain-of-Thought prompting (generating intermediate reasoning steps)
Compositional generalization (ability to understand unseen combinations of seen components)

Key Terms

SCAN: A benchmark for compositional generalization requiring models to map natural language commands to action sequences (e.g., 'jump left' -> 'TURN_LEFT JUMP').

Chain-of-Thought: A prompting technique where the model generates intermediate reasoning steps before the final answer.

length split: An evaluation setting where training/prompt examples are short, but test examples are long, testing out-of-distribution generalization.

code-davinci-002: A specific GPT-3 model variant optimized for code, which the paper finds significantly better at symbolic reasoning than text-davinci-002.

easy-to-hard generalization: The ability of a model to solve difficult problems (e.g., more steps, longer sequences) after only seeing easy examples.

greedy decoding: Selecting the most likely next token at each step during generation (temperature=0).