โ† Back to Paper List

Synthetic Prompting: Generating Chain-of-Thought Demonstrations for Large Language Models

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, Weizhu Chen
Tsinghua University, Microsoft Research Asia, Microsoft
International Conference on Machine Learning (2023)
Reasoning Benchmark

๐Ÿ“ Paper Summary

Chain-of-Thought Prompting Data Augmentation for LLMs Few-shot Learning
Synthetic Prompting bootstraps a few hand-crafted seed examples into a large, diverse set of demonstrations by alternating between generating reasoning chains and questions, then selecting the most complex examples for inference.
Core Problem
Few-shot reasoning performance depends heavily on the quality and diversity of demonstrations, but manually creating large sets of high-quality reasoning chains is costly and tedious.
Why it matters:
  • Relying on a fixed, small set of examples limits the model's ability to generalize to different test inputs
  • Existing selection methods (complexity-based or similarity-based) assume a large pool of annotated examples already exists
  • Current few-shot methods struggle with complex algorithmic or numerical tasks when provided with only 2-4 examples
Concrete Example: In the 'Repeat Copy' task, a model given only 2 seed examples might fail to understand the pattern. Synthetic Prompting generates variations like 'Repeat the sentence... five times' during the synthesis phase. By selecting these complex self-generated examples as prompts, the model learns the algorithmic pattern better than with just the seeds.
Key Novelty
Backward-Forward Synthesis & In-Cluster Complexity Selection
  • Backward Process: The model generates a reasoning chain first (conditioned on a topic and target complexity), then synthesizes a question to match it, ensuring the question is answerable.
  • Forward Process: The model takes the synthesized question and generates a refined reasoning chain to improve precision.
  • In-Cluster Complexity: Instead of random selection, synthetic examples are clustered, and the most complex (longest reasoning chain) example from each cluster is selected for the final prompt.
Evaluation Highlights
  • +15.6% absolute accuracy gain on the Repeat Copy algorithmic task (2 seed examples) compared to PAL prompting.
  • +2.2% absolute gain on GSM8K math reasoning (4 seed examples) over PAL prompting.
  • Outperforms state-of-the-art prompting methods (CoT, PAL) across numerical, symbolic, and algorithmic reasoning benchmarks using just 2-8 seed examples.
Breakthrough Assessment
7/10
Strong empirical gains on reasoning tasks and a clever mechanism for self-improving prompts without external data. However, relies on the base model being strong enough to synthesize valid examples (tested on davinci-003).
×