Active Prompting with Chain-of-Thought for Large Language Models

📝 Paper Summary

Prompt Engineering Active Learning

Active-Prompt improves Chain-of-Thought reasoning by identifying and annotating the most uncertain questions from a dataset to use as task-specific few-shot exemplars.

Core Problem

Standard Chain-of-Thought prompting relies on fixed, often randomly selected human-annotated exemplars, which may not be the most effective examples for teaching the model complex reasoning tasks.

Why it matters:

Current methods require manual engineering to craft exemplars without knowing which questions are actually helpful to the model
Fixed sets of exemplars fail to address specific difficulties or uncertainties the model faces in different reasoning domains
Random selection of few-shot examples often leads to suboptimal performance compared to targeted selection

Concrete Example: In standard CoT, a model might be prompted with 8 random math problems. If the test set contains complex probability questions but the random prompts are simple addition, the model fails. Active-Prompt detects the model is uncertain about probability, selects those specific hard questions for human annotation, and uses them as prompts.

Key Novelty

Uncertainty-based Active Learning for Prompt Selection

Instead of random examples, the system asks the LLM to answer training questions multiple times and measures uncertainty (disagreement, entropy, etc.)
Questions with the highest uncertainty are selected for human annotation (writing reasoning chains)
These specific, high-uncertainty questions become the few-shot exemplars for inference, effectively teaching the model where it is weakest

Architecture

Schematic of the Active-Prompt pipeline comprising four stages: Uncertainty Estimation, Selection, Annotation, and Inference.

Evaluation Highlights

Achieves superior performance on 8 complex reasoning datasets (arithmetic, commonsense, symbolic) compared to CoT and Self-consistency baselines
Outperforms Auto-CoT and Random-CoT, demonstrating that active selection is more effective than clustering or random sampling
Works effectively with multiple LLMs including code-davinci-002, text-davinci-002, and text-davinci-003

Breakthrough Assessment

8/10

Significant methodology improvement for in-context learning. Bridges active learning and prompt engineering effectively, showing that *which* examples are used matters as much as *how* they are used.

⚙️ Technical Details

Problem Definition

Setting: Few-shot in-context learning for complex reasoning tasks (arithmetic, commonsense, symbolic)

Inputs: Unlabeled training set D_tr, test set D_te, and a human annotator budget n

Outputs: Predictions for test set D_te using a prompt P constructed from n selected exemplars

Pipeline Flow

Uncertainty Estimation: Query LLM k times on training pool → Calculate uncertainty metric
Selection: Rank questions by uncertainty → Select top-n
Annotation: Human annotates selected questions with CoT reasoning
Inference: Use annotated examples as prompt → Generate answer (optionally with Self-consistency)

System Modules

Uncertainty Estimator

Determine which questions the model struggles with by generating k outputs and measuring divergence

Model or implementation: code-davinci-002 (primary)

Annotator (Oracle)

Provide ground truth reasoning chains and answers for selected questions

Model or implementation: Human annotator

Inference Engine

Solve test questions using the constructed prompt

Model or implementation: code-davinci-002 (primary)

Novel Architectural Elements

Integration of uncertainty-based active learning metrics (disagreement, entropy, variance) directly into the exemplar selection process for CoT prompting

Modeling

Base Model: code-davinci-002 (primary), text-davinci-002, text-davinci-003, gpt-3.5-turbo

Training Method: In-context learning (inference-only optimization of prompts)

Key Hyperparameters:

k_samples_uncertainty: 10
k_samples_inference: 40 (Self-consistency)
temperature_inference: 0.7
+ 2 more
pool_size: 1000 (randomly sampled if larger)
exemplar_count: 4 to 8 (dataset dependent)

Compute: Requires k forward passes per training sample for uncertainty estimation, plus standard inference compute

Comparison to Prior Work

vs. Auto-CoT: Active-Prompt uses human annotation for high-value targets rather than machine-generated chains; selects based on uncertainty rather than diversity/clustering
vs. Random-CoT: Active-Prompt judiciously selects hard questions rather than random ones
vs. Self-consistency: Active-Prompt optimizes the *prompt* (input side), whereas SC optimizes the *decoding* (output side); the paper combines both

Limitations

Requires human annotation, which incurs cost and time compared to fully automated methods like Auto-CoT
Uncertainty estimation requires multiple forward passes (k=10) on the training pool, increasing computational cost
Performance depends on the quality of the specific human annotations provided for the selected hard questions

Reproducibility

Code: https://github.com/shizhediao/active-prompt

Code is publicly available. Datasets are standard public benchmarks (GSM8K, etc.). The method relies on human annotation, which introduces a subjective component, though the authors state the annotator used 'minimum human engineering'.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on reasoning datasets

Benchmarks:

GSM8K (Arithmetic Reasoning)
ASDiv (Arithmetic Reasoning)
SVAMP (Arithmetic Reasoning)
AQuA (Arithmetic Reasoning)
SingleEq (Arithmetic Reasoning)
CSQA (Commonsense Reasoning)
StrategyQA (Commonsense Reasoning)
Letter (4) (Symbolic Reasoning)

Metrics:

Exact Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on code-davinci-002 showing Active-Prompt (using Disagreement metric) outperforms standard Self-Consistency and Auto-CoT across datasets.
GSM8K	Accuracy	60.1	65.6	+5.5
SVAMP	Accuracy	76.4	80.4	+4.0
AQuA	Accuracy	45.3	50.0	+4.7
CSQA	Accuracy	73.5	76.2	+2.7
StrategyQA	Accuracy	74.8	82.1	+7.3
Comparison of different uncertainty metrics on GSM8K using text-davinci-002.
GSM8K	Accuracy	47.1	52.3	+5.2
Comparison on text-davinci-003 showing generalization across models.
GSM8K	Accuracy	79.1	81.0	+1.9
SVAMP	Accuracy	83.6	85.2	+1.6

Experiment Figures

Comparison of different uncertainty metrics (Disagreement, Entropy, Variance, Confidence) on GSM8K and SVAMP.

Main Takeaways

Active-Prompt consistently outperforms Random-CoT, confirming that judicious selection of examples based on uncertainty is superior to random sampling.
Uncertainty metrics (Disagreement, Entropy, Variance) perform similarly well, while self-confidence (asking the model if it's sure) performs poorly due to LLM overconfidence.
The method is effective across different model sizes and families (Code-Davinci, Text-Davinci, GPT-3.5).
Performance gains generally converge around a pool size of 1000 candidates and k=10 uncertainty samples.
Transferability: Exemplars selected from GSM8K work well for other arithmetic datasets (ASDiv, SVAMP, SingleEq).

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-context learning / Few-shot prompting
Active Learning (uncertainty sampling)
Self-consistency decoding

Key Terms

Chain-of-Thought (CoT): A prompting technique where the model is shown examples including intermediate reasoning steps leading to an answer

Active Learning: A machine learning paradigm where the algorithm chooses the data it learns from, typically by selecting the most uncertain or informative examples for annotation

Self-consistency: A decoding strategy where the model generates multiple reasoning paths and selects the most consistent final answer

Disagreement: An uncertainty metric calculated by counting the number of unique answers generated out of k sampled reasoning paths

Entropy: An uncertainty metric measuring the randomness of the distribution of predicted answers

Variance: An uncertainty metric for arithmetic tasks measuring the spread of numerical answers

Zero-shot prompting: Prompting the model without any examples, often using a trigger phrase like 'Let's think step by step'

In-context learning: The ability of LLMs to learn tasks from examples provided in the prompt without parameter updates