Enhancing Chain-of-Thoughts Prompting with Iterative Bootstrapping in Large Language Models

📝 Paper Summary

Chain-of-Thought Prompting In-Context Learning

Iter-CoT selects challenging yet answerable questions and iteratively corrects their reasoning chains using the LLM itself to create high-quality few-shot demonstrations.

Core Problem

Standard automatic Chain-of-Thought (CoT) methods often generate erroneous reasoning chains or select demonstrations with inappropriate difficulty levels (too simple or too complex), hurting LLM performance.

Why it matters:

Incorrect reasoning chains in demonstrations propagate errors during inference, significantly reducing accuracy
Questions that are too simple fail to guide complex reasoning, while overly complex ones confuse the model on simpler tasks
Manual annotation of high-quality reasoning chains is costly, subjective, and difficult to scale across new tasks

Concrete Example: In Date Understanding, a randomly sampled correct exemplar fails to guide the model because its reasoning is trivial. Conversely, Iter-CoT prompts the model to fix an initially incorrect answer to a harder question, producing a robust reasoning chain that correctly guides the test instance.

Key Novelty

Iterative Bootstrapping for CoT Demonstration Selection (Iter-CoT)

Identifies 'challenging yet answerable' questions by finding instances the model initially gets wrong but can self-correct when prompted ('Your answer is not right...')
Uses a summarization step ('Give me a complete solution...') to consolidate the self-correction dialogue into a clean, comprehensive reasoning chain for use as a demonstration

Architecture

The flowchart of Iter-CoT, detailing the Demonstration Pool Construction and Inference stages.

Evaluation Highlights

Achieves 83.8% average accuracy on five arithmetic datasets (GSM8K, AddSub, SingleEq, SVAMP, ASDiv), surpassing Complex-CoT (82.2%) by 1.6%
Outperforms Manual-CoT by 3.8% and Random-CoT by 6.1% on average across ten datasets using GPT-3.5-turbo
Improves Letter Concatenation accuracy by ~7% over previous best methods (Auto-CoT) using labeled data

Breakthrough Assessment

8/10

Significantly improves automatic CoT generation by leveraging self-correction, addressing both correctness and difficulty selection without needing external heavy supervision.

⚙️ Technical Details

Problem Definition

Setting: Few-shot In-Context Learning for reasoning tasks

Inputs: A reasoning question Q and a candidate training set

Outputs: A set of demonstration exemplars (Question, Rationale, Answer) to be prepended to test questions

Pipeline Flow

Zero-Shot-CoT Generation (Identify errors)
Iterative Bootstrapping (Self-correction of errors)
Summarization (Consolidate reasoning)
Inference (Sampling and Prediction)

System Modules

Zero-Shot Generator (Demonstration Construction)

Generate initial answers to training set questions to identify 'challenging' instances (those answered incorrectly)

Model or implementation: GPT-3.5-turbo (or target LLM)

Error Corrector (Bootstrapper) (Demonstration Construction)

Guide the model to fix errors through multi-turn dialogue

Model or implementation: GPT-3.5-turbo (or target LLM)

Summarizer (Demonstration Construction)

Synthesize the successful correction path into a clean rationale

Model or implementation: GPT-3.5-turbo (or target LLM)

Inference Engine

Predict answers for test set using the generated demonstrations

Model or implementation: GPT-3.5-turbo / GPT-4 / Llama-2

Novel Architectural Elements

Use of failed zero-shot attempts as the seed for demonstration selection (targeting 'challenging' questions)
Multi-turn 'correction + summarization' loop to generate rationales, rather than single-pass generation

Modeling

Base Model: Evaluated on GPT-3.5-turbo, GPT-4, Llama-2-70B, Llama-2-70B-Chat

Comparison to Prior Work

vs. Auto-CoT: Iter-CoT specifically targets questions the model initially gets *wrong* (challenging ones) and corrects them, whereas Auto-CoT samples centroids regardless of difficulty.
vs. Complex-CoT: Iter-CoT determines complexity dynamically based on model failure, ensuring questions are 'answerable' rather than just superficially long.
vs. SPL (Self-Play Learning) [not cited in paper]: SPL also uses self-generated data but typically for fine-tuning; Iter-CoT uses it for in-context learning demonstration creation.

Limitations

Relies on the model's ability to self-correct; if the model cannot fix its error even with hints, no demonstration is generated.
Requires ground truth labels for the most reliable version (Iter-CoT w/ label), though a 'w/o label' version using GPT-4 evaluation is proposed.
Inference costs increase during the demonstration construction phase due to multi-turn correction dialogues.

Reproducibility

Code: https://github.com/GasolSun36/Iter-CoT

Code publicly available at https://github.com/GasolSun36/Iter-CoT. Uses OpenAI API for main experiments. Hyperparameters (temperature=0.7 for construction, 0 for inference) provided.

📊 Experiments & Results

Evaluation Setup

Few-shot prompting on 10 reasoning datasets across arithmetic, commonsense, and symbolic tasks.

Benchmarks:

GSM8K (Arithmetic Reasoning)
CSQA (Commonsense Reasoning)
Letter Concatenation (Symbolic Reasoning)
AQuA (Arithmetic Reasoning)
SVAMP (Arithmetic Reasoning)

Metrics:

Exact Match Accuracy
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Iter-CoT outperforms baselines on arithmetic reasoning tasks using GPT-3.5-turbo.
GSM8K	Accuracy	77.6	80.8	+3.2
GSM8K	Accuracy	77.5	80.8	+3.3
AQuA	Accuracy	60.6	68.5	+7.9
Average (10 datasets)	Accuracy	77.7	81.5	+3.8
Ablation studies confirm the necessity of the bootstrapping (correction) and summarization phases.
GSM8K	Accuracy	68.4	80.8	+12.4
GSM8K	Accuracy	78.3	80.8	+2.5

Experiment Figures

Accuracy of Simple-CoT vs. Complex-CoT on questions of varying hop counts (difficulty).

The number of questions GSM8K answered correctly increases as the model is prompted iteratively to correct its errors.

Main Takeaways

Selecting 'challenging yet answerable' questions (those the model initially fails but can correct) creates better demonstrations than random or purely complex selection.
Iterative self-correction combined with summarization produces cleaner, more robust reasoning chains than single-pass generation.
The method generalizes well across model sizes (Llama-2-70B to GPT-4) and task types (Arithmetic, Commonsense, Symbolic).
Even without ground truth labels (using GPT-4 as a judge), Iter-CoT achieves performance competitive with the labeled version.

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
In-Context Learning (ICL)
Zero-shot reasoning with LLMs

Key Terms

Chain-of-Thought (CoT): A prompting technique where the model generates intermediate reasoning steps before the final answer

Iterative Bootstrapping: A process of repeatedly refining the dataset or model outputs; here, iterating through error correction to find and fix hard examples

Zero-Shot-CoT: Prompting the model with just 'Let's think step by step' without providing any examples

Self-Correction: The ability of an LLM to fix its own errors when given a hint that its previous answer was wrong

Demonstration Pool: A collection of high-quality (Question, Rationale, Answer) triplets generated during the construction phase, from which inference prompts are sampled

BM25: A ranking function used in information retrieval to estimate the relevance of documents to a given search query