Iterative Reasoning Preference Optimization

📝 Paper Summary

Chain-of-Thought (CoT) Reasoning Preference Optimization Iterative Training

Iterative Reasoning Preference Optimization (Iterative RPO) improves LLM reasoning by repeatedly generating Chain-of-Thought candidates, constructing preference pairs based on answer correctness, and training with a combined DPO and NLL objective.

Core Problem

Standard iterative preference optimization methods (like Self-Rewarding LLMs or SPIN) improve general instruction following but often fail to improve, or even degrade, performance on complex reasoning tasks.

Why it matters:

Reasoning tasks require generating correct intermediate steps (Chain-of-Thought), which general alignment methods often overlook
Existing iterative methods for reasoning (like STaR) rely on Supervised Fine-Tuning (SFT), missing the signal provided by negative/incorrect reasoning paths
Verifying the correctness of reasoning steps is difficult without human annotation, limiting the scalability of methods that require step-by-step rewards

Concrete Example: In GSM8K, a model might generate a reasoning chain that leads to the wrong answer. Standard SFT only trains on the correct 'gold' chain. Iterative RPO uses the incorrect chain as a 'loser' in a preference pair against a correct 'winner', explicitly teaching the model what *not* to do.

Key Novelty

Iterative Reasoning Preference Optimization (Iterative RPO)

Generates multiple Chain-of-Thought (CoT) candidates per prompt using the current model
Constructs preference pairs where 'winners' result in the correct final answer (verified against gold labels) and 'losers' result in incorrect answers
Trains the next iteration's model using a specific loss combining DPO (Direct Preference Optimization) with a Negative Log-Likelihood (NLL) term on the winning response to prevent probability degradation

Architecture

The iterative training loop of Iterative RPO.

Evaluation Highlights

Improves Llama-2-70B-Chat zero-shot accuracy on GSM8K from 55.6% to 81.6% (greedy decoding)
Achieves 88.7% accuracy on GSM8K with majority voting (32 samples), up from 70.7% baseline
Increases accuracy on ARC-Challenge from 77.8% to 86.7% without using the ARC training corpus

Breakthrough Assessment

8/10

Significant gains on established reasoning benchmarks (GSM8K, MATH) using only training set prompts. The generated improvements are substantial (+26% on GSM8K) and the method is simpler than concurrent approaches requiring separate reward models.

⚙️ Technical Details

Problem Definition

Setting: Iterative training of a generative language model on reasoning tasks using self-generated preference data

Inputs: A dataset of questions x and correct final answers y (gold reasoning steps c are available but used primarily for bootstrapping or fallback)

Outputs: A generated chain-of-thought c followed by a final answer y

Pipeline Flow

Generation: Model M_t generates N responses (CoT + Answer) for each training input
Reward Assignment: Check correctness of final answers against gold labels (Binary Reward)
Pair Construction: Create pairs (Winner, Loser) where Winner is correct and Loser is incorrect
Optimization: Train model M_{t+1} using DPO + NLL loss on the generated pairs

System Modules

Generator

Generate reasoning chains and answers for training prompts

Model or implementation: Llama-2-70B-Chat (iteratively updated)

Scorer/Pair Constructor

Evaluate answers and form preference pairs

Model or implementation: Exact Match (Deterministic function)

Novel Architectural Elements

Integration of NLL loss explicitly into the DPO update step for iterative reasoning (Loss = L_DPO + alpha * L_NLL)
Iterative loop where the model acts as both generator and student, specifically optimizing for reasoning correctness via preference pairs rather than just SFT

Modeling

Base Model: Llama-2-70B-Chat

Training Method: Iterative DPO with NLL regularization

Objective Functions:

Purpose: Optimize preference for correct reasoning over incorrect reasoning while maintaining generation likelihood.

Formally: L(theta) = -log sigma(beta * log(pi_theta(yw|x)/pi_ref(yw|x)) - beta * log(pi_theta(yl|x)/pi_ref(yl|x))) - alpha * log pi_theta(yw|x)

Training Data:

Uses only the training sets of GSM8K, MATH, ARC-Challenge
Generates N=30 candidates per prompt per iteration
Selects K=10 pairs per input
Approx 55-60k pairs for training per iteration

Key Hyperparameters:

learning_rate: 7e-7
batch_size: 16
optimizer: AdamW
+ 3 more
alpha: 1.0 (coefficient for NLL term)
beta: 0.1 (coefficient in DPO loss)
iterations: 4

Compute: Generation: One node with 8 V100 GPUs (32G). Training: Eight nodes each with 8 A100 GPUs (80G).

Comparison to Prior Work

vs. STaR: Uses preference optimization (winners vs. losers) instead of just SFT on winners; leverages negative examples.
vs. Self-Rewarding LLMs: Uses ground-truth labels for answer verification instead of an LLM-as-a-judge; incorporates NLL loss term essential for reasoning stability.
vs. Standard DPO: Applies iteratively; adds NLL loss term to prevent degradation seen in reasoning tasks.
+ 1 more
vs. V-STaR: Trains the generative model directly via preference optimization rather than training a separate verifier model.

Limitations

Relies on the availability of gold answers for final verification (cannot be easily applied where ground truth is unknown)
Performance gains saturate after a few iterations (diminishing returns observed by iteration 4)
Computational cost increases with iterations due to repeated generation and training cycles
Limited to tasks where final answer correctness is a good proxy for reasoning quality (binary reward)

Reproducibility

Code availability is not explicitly provided in the paper text. Hyperparameters are detailed (learning rate, batch size, alpha, beta). Prompts are mentioned to be in Subsection B.1. Datasets are standard public benchmarks.

📊 Experiments & Results

Evaluation Setup

Zero-shot Chain-of-Thought reasoning on math and science datasets

Benchmarks:

GSM8K (Grade school math word problems)
MATH (Challenging mathematics problems)
ARC-Challenge (Science question answering)

Metrics:

Accuracy (Exact Match)
Majority Voting Accuracy (@32 samples)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main comparison on GSM8K shows Iterative RPO significantly outperforming baselines including Zero-Shot CoT, SFT, and standard DPO.
GSM8K	Accuracy (Greedy)	55.6	81.6	+26.0
GSM8K	Accuracy (Greedy)	63.5	81.6	+18.1
GSM8K	Accuracy (Greedy)	61.8	73.1	+11.3
GSM8K	Accuracy (Maj@32)	70.7	88.7	+18.0
ARC-Challenge	Accuracy	77.8	86.7	+8.9
MATH	Accuracy (Greedy)	12.5	20.8	+8.3
Ablation studies demonstrate the critical role of the NLL loss term in the training objective.
GSM8K	Accuracy (Greedy)	61.8	73.1	+11.3

Experiment Figures

Log-probability changes of chosen and rejected sequences during training for SFT vs. DPO vs. DPO+NLL.

Main Takeaways

Iterative training is effective: Performance improves consistently across iterations (e.g., GSM8K: 73.1% -> 78.0% -> 81.1% -> 81.6%), though gains saturate.
Negative examples matter: Preference optimization (using losers) outperforms SFT (STaR-like approaches) which only use positive examples.
NLL loss is crucial: Adding Negative Log-Likelihood to the DPO objective prevents the model from deviating too far from the likelihood of correct reasoning chains, which is essential for reasoning tasks.
Data quantity vs. Iteration: Two iterations of training are more effective than one iteration with double the data, suggesting the model update step is key to generating better training signals.

📚 Prerequisite Knowledge

Prerequisites

Understanding of Chain-of-Thought (CoT) prompting
Familiarity with Reinforcement Learning from Human Feedback (RLHF) concepts
Knowledge of Direct Preference Optimization (DPO)

Key Terms

CoT: Chain-of-Thought—a prompting technique where the model generates intermediate reasoning steps before the final answer

DPO: Direct Preference Optimization—an algorithm that optimizes a language model to adhere to preferences without explicitly training a reward model

NLL: Negative Log-Likelihood—a standard loss function used in language modeling that minimizes the negative log-probability of the correct token sequence

SFT: Supervised Fine-Tuning—training a model on labeled examples (input-output pairs) using standard log-likelihood maximization

Iterative RPO: The authors' proposed method: Iterative Reasoning Preference Optimization

STaR: Self-Taught Reasoning—a prior method that iteratively fine-tunes a model on its own correct reasoning generations using SFT

GSM8K: A benchmark dataset of high quality grade school math word problems

Majority Voting: A decoding strategy where the model generates multiple solutions and the most frequent final answer is selected