STaR: Bootstrapping reasoning with reasoning

📝 Paper Summary

Chain-of-Thought Reasoning Self-Improvement / Bootstrapping Reasoning Dataset Generation

STaR iteratively bootstraps a language model's reasoning ability by generating rationales, filtering for correct answers, and retraining on those successful rationales, using hint-based rationalization for failed problems.

Core Problem

Generating high-quality reasoning (rationales) usually requires massive human-annotated datasets or relies on few-shot prompting, which sacrifices accuracy compared to fine-tuning.

Why it matters:

Manual rationale annotation is expensive and scales poorly to new domains
Template-based generation only works when general solutions are already known
Few-shot prompting with rationales underperforms models fine-tuned on large datasets, creating a gap between few-shot inference and fully supervised learning

Concrete Example: In arithmetic, a model might fail to sum two numbers because it cannot generate the intermediate 'scratchpad' steps correctly. Standard training without rationales fails to generalize. Few-shot prompting helps but is limited by context window and lacks the performance of fine-tuning.

Key Novelty

Self-Taught Reasoner (STaR)

Iterative Loop: The model generates its own training data by attempting to solve problems with rationales; only rationales leading to correct final answers are kept for fine-tuning.
Rationalization: For problems the model initially fails to solve, it is given the correct answer as a hint to generate a rationale backwards. This data is then treated as valid training data (without the hint) to teach the model how to solve hard problems.

Architecture

The STaR loop: generating rationales, filtering for correctness, fine-tuning, and the rationalization loop where answers are provided as hints for failed questions.

Evaluation Highlights

+12.5% accuracy improvement on CommonsenseQA compared to a GPT-J baseline fine-tuned to directly predict answers
Performance comparable to a 30x larger GPT-3 model (72.5% vs 73.0%) on CommonsenseQA
Improves 2-digit addition accuracy from <1% to 32% in a single iteration using rationalization

Breakthrough Assessment

8/10

A significant methodology for self-improving reasoning without large human-labeled datasets. The introduction of 'rationalization' to learn from failures is a clever and effective contribution.

⚙️ Technical Details

Problem Definition

Setting: Iterative self-supervised fine-tuning of a Large Language Model (LLM) on reasoning tasks

Inputs: A dataset of questions x and answers y, and a small seed set of few-shot examples with rationales

Outputs: An enhanced LLM capable of generating intermediate rationales r followed by correct answers y

Pipeline Flow

Rationale Generation (Attempt to solve problems)
Filtering (Keep rationales yielding correct answers)
Rationalization (Retry failed problems with answer hints)
Fine-tuning (Train on combined successful rationales)

System Modules

Rationale Generator (Generation)

Generates reasoning steps and answers for the dataset using the current model state

Model or implementation: GPT-J (6B parameters)

Rationalizer (Generation)

Generates rationales for problems the model failed to solve, using the correct answer as a hint

Model or implementation: GPT-J (6B parameters)

Filter

Selects only rationales that lead to the correct ground-truth answer

Model or implementation: Deterministic comparison

Novel Architectural Elements

The 'Rationalization' loop: feeding ground truth answers as hints to generate training data for hard examples, then removing the hint for fine-tuning

Modeling

Base Model: GPT-J (6B)

Training Method: Iterative Supervised Fine-Tuning (SFT) on self-generated data

Objective Functions:

Purpose: Maximize likelihood of generating the correct rationale and answer.

Formally: Standard language modeling loss (cross-entropy) on filtered rationales.

Adaptation: Full fine-tuning

Trainable Parameters: All parameters (re-initialized from pre-trained GPT-J at each outer loop iteration)

Training Data:

Arithmetic: 50,000 random questions; CommonsenseQA: 9,741 train questions; GSM8K: 7,473 train questions
Augmented iteratively with self-generated rationales

Key Hyperparameters:

learning_rate_warmup: 100 steps
outer_loop_iterations: Up to ~40 (varies by task)
initial_training_steps: 40 (increasing by 20% per loop)
+ 2 more
batch_size: Not reported in the paper
few_shot_prompt_size: 10 examples

Compute: Google TPU Research Cloud (specific TPU count not reported)

Comparison to Prior Work

vs. Chain of Thought: STaR fine-tunes the model to internalize reasoning, rather than relying solely on prompting
vs. Scratchpads: STaR generates its own training data from a few examples, rather than requiring a massive pre-existing dataset
vs. Expert Iteration: STaR uses the ground truth answer as the 'expert' signal via rationalization, rather than a separate expert model or value function

Limitations

Requires few-shot performance to be better than random chance to start the bootstrapping process
Relies on the final answer being a proxy for rationale correctness (can learn false reasoning that accidentally gets the right answer)
Not applicable to tasks where chance performance is high (e.g., binary classification) without better filtering
Rationalization requires a way to validly 'hint' the answer, which may be non-trivial for some tasks

Reproducibility

Code: https://github.com/kingoflolz/mesh-transformer-jax

Code for GPT-J fine-tuning is publicly available. The STaR logic itself (the loop and filtering) is described algorithmically. Specific prompts for CommonsenseQA are provided in Appendix B. Hyperparameters like batch size are missing.

📊 Experiments & Results

Evaluation Setup

Evaluation on symbolic (Arithmetic) and natural language reasoning (CommonsenseQA, GSM8K)

Benchmarks:

Arithmetic (n-digit addition)
CommonsenseQA (CQA) (Multiple-choice commonsense reasoning)
GSM8K (Grade school math word problems)

Metrics:

Accuracy
Statistical methodology: p-values reported for human evaluation of rationales; no explicit significance tests for model performance differences

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
CommonsenseQA (Dev Set)	Accuracy	60.0	72.5	+12.5
CommonsenseQA (Dev Set)	Accuracy	36.6	72.5	+35.9
CommonsenseQA (Dev Set)	Accuracy	73.0	72.5	-0.5
GSM8K	Accuracy	5.8	10.7	+4.9
CommonsenseQA	Accuracy	68.8	72.5	+3.7

Experiment Figures

Accuracy on arithmetic (summation) over iterations, comparing STaR with and without rationalization.

Comparison of calculator steps generated by the model vs. human ground truth on GSM8K.

Main Takeaways

Rationalization allows the model to learn from problems it initially got wrong by 'thinking backward' from the answer.
Including few-shot prompts during fine-tuning prevents 'drift' where the model's rationales diverge from the desired format.
STaR can fail by learning 'red herring' reasoning or logical fallacies if they happen to lead to the correct answer.
In arithmetic, rationalization enables learning multiple digit lengths simultaneously, whereas standard STaR learns them stagewise.

📚 Prerequisite Knowledge

Prerequisites

Large Language Models (LLMs) and fine-tuning
Chain-of-Thought (CoT) prompting
Reinforcement Learning (policy gradient concepts)

Key Terms

rationale generation: The process where a model generates intermediate reasoning steps (a chain of thought) before producing a final answer

rationalization: A technique where the model is provided with the ground-truth answer as a hint to help it generate a valid rationale for a problem it initially failed to solve

scratchpad: A specific format of intermediate reasoning used in arithmetic tasks, where the model writes out calculation steps

Policy Gradient: An optimization technique in Reinforcement Learning where the model's parameters are updated to maximize the expected reward (here, approximated by filtering for correct answers)

STaR: Self-Taught Reasoner—the proposed method of iterative bootstrapping and rationalization

ConceptNet: A semantic graph of concepts and relationships used to construct the CommonsenseQA dataset

chain-of-thought: A prompting technique where the model is encouraged to produce a series of intermediate reasoning steps