Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

📝 Paper Summary

Chain-of-Thought Prompting Data Augmentation for LLMs Few-shot Learning

Synthetic Prompting bootstraps a few hand-crafted seed examples into a large, diverse set of demonstrations by alternating between generating reasoning chains and questions, then selecting the most complex examples for inference.

Core Problem

Few-shot reasoning performance depends heavily on the quality and diversity of demonstrations, but manually creating large sets of high-quality reasoning chains is costly and tedious.

Why it matters:

Relying on a fixed, small set of examples limits the model's ability to generalize to different test inputs
Existing selection methods (complexity-based or similarity-based) assume a large pool of annotated examples already exists
Current few-shot methods struggle with complex algorithmic or numerical tasks when provided with only 2-4 examples

Concrete Example: In the 'Repeat Copy' task, a model given only 2 seed examples might fail to understand the pattern. Synthetic Prompting generates variations like 'Repeat the sentence... five times' during the synthesis phase. By selecting these complex self-generated examples as prompts, the model learns the algorithmic pattern better than with just the seeds.

Key Novelty

Backward-Forward Synthesis & In-Cluster Complexity Selection

Backward Process: The model generates a reasoning chain first (conditioned on a topic and target complexity), then synthesizes a question to match it, ensuring the question is answerable.
Forward Process: The model takes the synthesized question and generates a refined reasoning chain to improve precision.
In-Cluster Complexity: Instead of random selection, synthetic examples are clustered, and the most complex (longest reasoning chain) example from each cluster is selected for the final prompt.

Evaluation Highlights

+15.6% absolute accuracy gain on the Repeat Copy algorithmic task (2 seed examples) compared to PAL prompting.
+2.2% absolute gain on GSM8K math reasoning (4 seed examples) over PAL prompting.
Outperforms state-of-the-art prompting methods (CoT, PAL) across numerical, symbolic, and algorithmic reasoning benchmarks using just 2-8 seed examples.

Breakthrough Assessment

7/10

Strong empirical gains on reasoning tasks and a clever mechanism for self-improving prompts without external data. However, relies on the base model being strong enough to synthesize valid examples (tested on davinci-003).

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning where a model M is given a small set of seed examples S = {(q_i, c_i)} and must answer a test question q_test.

Inputs: A small set of seed examples (e.g., 2-8 pairs of Question + Reasoning Chain).

Outputs: An answer a_test derived from a reasoning chain generated by the model using augmented synthetic demonstrations.

Pipeline Flow

Seed Examples → Backward Synthesis (Generate Chain → Generate Question)
Synthesized Pairs → Forward Synthesis (Question → Refined Chain)
Example Bank → Clustering & Filtering (In-Cluster Complexity)
Selected Demonstrations → Inference on Test Question

System Modules

Backward Synthesizer (Data Augmentation)

Generate answerable questions by first hallucinating a reasoning chain

Model or implementation: text-davinci-003

Forward Synthesizer (Data Augmentation)

Refine the reasoning chain for the synthesized question

Model or implementation: text-davinci-003

Demonstration Selector

Select diverse and informative examples from the synthetic pool

Model or implementation: Sentence-BERT (encoder)

Novel Architectural Elements

Alternating Backward-Forward synthesis loop to ensure alignment between questions and reasoning chains
Conditioning question generation on 'Target Complexity' (number of steps) to force harder examples

Modeling

Base Model: text-davinci-003 (InstructGPT)

Compute: Inference only. Synthesis generates 1,000 examples per dataset using the LLM.

Comparison to Prior Work

vs. PAL: PAL uses a static set of human-annotated examples. Synthetic Prompting generates its own examples to augment this set.
vs. Zero-Shot CoT: Zero-shot relies on 'Let's think step by step'. Synthetic Prompting builds a few-shot prompt automatically.
vs. Auto-CoT [not cited in paper]: Auto-CoT clusters test questions and samples zero-shot chains. Synthetic Prompting generates *new* questions and chains from scratch to expand domain coverage.

Limitations

Depends on the base LLM's capability to generate valid code/reasoning (tested on strong InstructGPT).
Sensitivity to seed examples: better seed performance doesn't strictly guarantee better synthetic performance.
Cost: Generating 1,000 synthetic examples involves significant API usage before inference.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting (2, 4, or 8 seeds) on reasoning benchmarks. Inference uses greedy decoding.

Benchmarks:

GSM8K (Grade School Math (Numerical))
GSM-Hard (Hard Numerical Reasoning (Large Numbers))
Repeat Copy (Algorithmic Reasoning (BigBench))
Colored Objects (Symbolic Reasoning (BigBench))
SVAMP (Math Word Problems)

Metrics:

Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on Numerical Reasoning Tasks (using PAL-style code reasoning).
GSM8K	Accuracy	73.1	75.3	+2.2
GSM-Hard	Accuracy	62.9	64.7	+1.8
SVAMP	Accuracy	79.6	80.5	+0.9
Performance on Symbolic and Algorithmic Reasoning Tasks.
Repeat Copy	Accuracy	71.9	87.5	+15.6
Colored Objects	Accuracy	93.4	93.6	+0.2

Experiment Figures

Sensitivity analysis bar charts showing accuracy on GSM8K and Colored Objects across 3 different random seeds.

Main Takeaways

Synthetic Prompting consistently outperforms standard PAL prompting, especially when seed examples are scarce (2-4 examples).
The method is particularly effective for algorithmic tasks (Repeat Copy) where expanding the diversity of input patterns via synthesis helps the model generalize.
Ablations show that both 'Topic Word' (diversity) and 'Target Complexity' (difficulty) conditions are crucial; removing them degrades performance to near-baseline levels.
In-Cluster Complexity selection is superior to random or similarity-based selection, confirming that prompt diversity and complexity are key drivers of reasoning performance.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) Prompting
In-context Learning
Clustering (for demonstration selection)

Key Terms

PAL: Program-Aided Language models—a prompting technique where the reasoning chain is executable code (e.g., Python) rather than natural language.

CoT: Chain-of-Thought—a prompting method that encourages the model to generate intermediate reasoning steps before the final answer.

Backward Process: A synthesis step where the LLM generates a reasoning chain first, then generates a question that fits that chain.

Forward Process: A synthesis step where the LLM takes a generated question and produces a new, high-quality reasoning chain for it.

In-Cluster Complexity: A selection strategy where examples are grouped by semantic similarity, and the example with the most reasoning steps is chosen from each group.

Target Complexity: A constraint used during synthesis where the model is instructed to generate a reasoning chain with a specific number of steps/lines.

Self-Consistency: A technique (often called majority voting) where multiple reasoning paths are sampled, and the most common answer is selected; used here to validate synthetic examples.