Automatic Prompt Augmentation and Selection with Chain-of-Thought from Labeled Data

📝 Paper Summary

Chain-of-Thought (CoT) Prompting Prompt Engineering In-context Learning

Automate-CoT automatically generates reasoning chains for labeled data and selects the optimal subset of exemplars using a variance-reduced policy gradient strategy to maximize task performance.

Core Problem

Manual design of Chain-of-Thought (CoT) exemplars is labor-intensive and sensitive to factors like order, complexity, diversity, and style, making adaptation to new tasks difficult.

Why it matters:

Human-written prompts are costly to create and optimize for every new dataset
Performance of Large Language Models (LLMs) fluctuates significantly based on arbitrary choices like exemplar order (up to 3.3% drop on GSM8K)
Static human prompts often fail to match the complexity or diversity required by specific questions

Concrete Example: On the GSM8K math dataset, simply shuffling the order of human-written Manual-CoT exemplars causes accuracy to drop from 63.1% to 59.8%. Furthermore, using only complex exemplars helps hard questions but hurts performance on simple ones.

Key Novelty

Automate-CoT (Automatic Prompt Augmentation and Selection with Chain-of-Thought)

Augments labeled data by using an LLM to generate pseudo-reasoning chains, pruning those that lead to incorrect answers
Treats the selection of in-context exemplars as a latent variable optimization problem, using reinforcement learning (policy gradient) to find the combination that maximizes prediction accuracy

Architecture

The overall three-step pipeline of Automate-CoT: Augment, Prune, and Select.

Evaluation Highlights

+2.7% average accuracy improvement over Manual-CoT across five arithmetic reasoning tasks using text-davinci-002
+3.3% average improvement over Self-Consistency (SC) on arithmetic tasks under the self-consistency setting
Outperforms Auto-CoT by 4.8% on GSM8K using code-davinci-002

Breakthrough Assessment

7/10

Solid automated pipeline that removes the need for manual CoT engineering while achieving consistent gains across diverse reasoning tasks. Effectively addresses sensitivity issues identified in prior work.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning where a set of input-output pairs (questions and answers) are available, but reasoning chains are not manually provided

Inputs: A training dataset D containing questions Q and answers A

Outputs: A selected set of few-shot exemplars (question-rationale-answer triples) to prompt the LLM

Pipeline Flow

Augment (Generate pseudo-chains)
Prune (Filter by answer correctness)
Select (Optimize exemplar choice via RL)

System Modules

Augmentor

Generate rationale chains for questions in the training set

Model or implementation: GPT-3 (text-davinci-002)

Pruner

Filter out generated chains that lead to incorrect answers

Model or implementation: Exact Match Check

Selector

Select the optimal combination of exemplars to minimize loss

Model or implementation: Policy Gradient Optimizer (VR-PGE)

Novel Architectural Elements

Application of Variance-Reduced Policy Gradient Estimator (VR-PGE) specifically to the discrete optimization problem of selecting CoT exemplars from a machine-generated pool

Modeling

Base Model: text-davinci-002 and code-davinci-002 (GPT-3/Codex)

Comparison to Prior Work

vs. Manual-CoT: Fully automatic generation and selection; avoids human engineering and sensitivity issues
vs. Auto-CoT: Uses a policy gradient selection mechanism to optimize the specific combination of exemplars rather than just clustering-based sampling
vs. SC: Automate-CoT optimizes the *prompt* itself, whereas SC optimizes the *decoding* process (methods are orthogonal and can be combined)

Limitations

Relies on the assumption that correct answers imply correct reasoning chains (which may not always hold due to false positives)
Depends on access to a labeled dataset (questions + answers) to perform the pruning and selection
Involves an iterative optimization process (5 epochs) which adds computational cost compared to static prompting methods
Prompt length is constrained by the context window of the LLM (max 2048 tokens for GPT-3 used here)

Reproducibility

Code: https://github.com/SHUMKASHUN/Automate-CoT

Publicly available code at https://github.com/SHUMKASHUN/Automate-CoT. Uses OpenAI APIs (text-davinci-002, code-davinci-002). Hyperparameters: 5 epochs, learning rate 1e-3, batch size 10, max_tokens 256.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on reasoning and NLU tasks using GPT-3 variants

Benchmarks:

GSM8K, ASDiv, SVAMP, AQuA, SingleOp (Arithmetic Reasoning)
CommonsenseQA (CSQA), StrategyQA (Commonsense Reasoning)
Last Letter Concatenation (Letter (4)) (Symbolic Reasoning)
OpenBookQA, e-SNLI, SST-2 (Non-reasoning tasks (QA, NLI, Sentiment))

Metrics:

Exact Match Accuracy
Statistical methodology: Results averaged over three runs; variance reported in Appendix

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results on Arithmetic Reasoning tasks showing consistent improvements over Manual-CoT and Self-Consistency (SC).
Arithmetic Reasoning (Average of 5 tasks)	Accuracy	61.3	64.0	+2.7
Arithmetic Reasoning (Average of 5 tasks)	Accuracy	67.0	70.3	+3.3
GSM8K	Accuracy	68.2	73.0	+4.8
Results on other reasoning types (Commonsense and Symbolic) and non-reasoning tasks.
Commonsense Reasoning (Average)	Accuracy	69.0	72.4	+3.4
Letter (4)	Accuracy	60.6	63.8	+3.2
e-SNLI	Accuracy	74.8	78.2	+3.4

Experiment Figures

Analysis of complexity and diversity in CoT exemplars.

Main Takeaways

Automate-CoT consistently outperforms human-written Manual-CoT across arithmetic, commonsense, and symbolic reasoning tasks.
The method generalizes well to non-reasoning tasks like NLI and sentiment analysis, showing broad applicability.
The approach is robust across different model backends (text-davinci-002 and code-davinci-002).
Combining Automate-CoT with Self-Consistency yields further gains, suggesting the benefits are additive.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-context learning / Few-shot prompting
Policy Gradient methods in Reinforcement Learning

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model is encouraged to generate intermediate reasoning steps before the final answer

VR-PGE: Variance-Reduced Policy Gradient Estimator—a reinforcement learning technique used here to estimate gradients for discrete exemplar selection with lower variance than standard estimators

Self-consistency: A decoding strategy where the model generates multiple reasoning paths and selects the final answer via majority voting

Rational chain: The sequence of intermediate reasoning steps generated by the model to reach an answer

Latent variable: A variable that is not directly observed (here, which exemplars are selected) but is inferred or optimized during the training process