Chain of Preference Optimization: Improving Chain-of-Thought Reasoning in LLMs

📝 Paper Summary

Reasoning in Large Language Models Alignment / Preference Optimization

CPO fine-tunes LLMs to internalize the optimal reasoning paths found by Tree-of-Thought search using Direct Preference Optimization on intermediate steps, boosting Chain-of-Thought performance without the inference cost.

Core Problem

Standard Chain-of-Thought (CoT) often overlooks optimal reasoning paths due to its single-path focus, while Tree-of-Thought (ToT) finds better paths but is computationally too expensive for practical inference.

Why it matters:

ToT improves reasoning quality significantly but increases inference complexity by over 50x, making it impractical for real-time applications.
Existing distillation methods only train on the final 'best' path, ignoring the valuable negative feedback (bad branches) explored during tree search.
Current approaches overlook the preference information inherent in the tree structure: knowing which intermediate thoughts were discarded is as important as knowing which were kept.

Concrete Example: In an arithmetic reasoning task, CoT might pursue a single incorrect calculation path. ToT would explore multiple branches, discard the incorrect calculation, and find the right one. Standard fine-tuning would just show the model the correct path. CPO explicitly teaches the model to prefer the correct step *over* the incorrect one at that specific junction.

Key Novelty

Chain of Preference Optimization (CPO)

Constructs step-wise preference pairs from Tree-of-Thought search logs: thoughts in the final successful path are 'preferred' (winners), while rejected alternative branches at the same step are 'dispreferred' (losers).
Applies Direct Preference Optimization (DPO) sequentially at each reasoning step, teaching the model not just *what* to think, but *which* thought to choose over alternatives.
Allows the model to generate ToT-quality reasoning paths using standard greedy CoT decoding at inference time, effectively distilling the search tree into the model weights.

Architecture

The CPO pipeline: Synthesizing preference thoughts from ToT and training via DPO.

Evaluation Highlights

Achieves an average accuracy improvement of up to 4.3% compared to base models (LLaMA-2-7B/13B, Mistral-7B) across seven reasoning datasets.
Matches or outperforms the heavy Tree-of-Thought (ToT) method while being >50x faster during inference (standard CoT decoding vs. tree search).
Outperforms standard Supervised Fine-Tuning (SFT) and Rejection Sampling Fine-Tuning (RFT) on complex tasks like GSM8K and strategyQA.

Breakthrough Assessment

7/10

Cleverly combines ToT and DPO to solve the efficiency-performance trade-off in reasoning. It effectively extracts more signal from search trees (negatives) than prior distillation work.

⚙️ Technical Details

Problem Definition

Setting: Multi-step reasoning where an LLM generates a sequence of thoughts z_1...z_n to reach an answer y for input x.

Inputs: Input question x

Outputs: Final answer y (via a chain of thoughts)

Pipeline Flow

Data Generation (Training Phase): ToT Search -> Preference Pair Construction
Training (Training Phase): DPO Fine-tuning on pairs
Inference (Deployment): Standard CoT Decoding

System Modules

ToT Generator (Training only) (Data Construction)

Generates k candidate thoughts for the next step given current state

Model or implementation: Base LLM (e.g., LLaMA-2-7B)

ToT Evaluator (Training only) (Data Construction)

Scores candidate thoughts as 'likely' (10) or 'impossible' (1)

Model or implementation: Base LLM (self-evaluation)

CPO Trainer

Updates model weights to increase likelihood of 'winning' thoughts over 'losing' thoughts

Model or implementation: Target LLM

Inference Engine

Solves problems using simple greedy decoding (Chain-of-Thought)

Model or implementation: Fine-tuned LLM

Novel Architectural Elements

Step-wise preference construction: Instead of preferring whole paths, preferences are constructed at every reasoning node (z_winner vs z_loser at step i).
Integration of ToT structure into DPO: Using the discarded branches of BFS as explicit negative samples for optimization.

Modeling

Base Model: LLaMA-2-7B, LLaMA-2-13B, Mistral-7B

Training Method: Direct Preference Optimization (DPO)

Objective Functions:

Purpose: Maximize probability of preferred thought while minimizing dispreferred thought, regularized by reference model.

Formally: L_DPO = -E [log σ(β log(π_θ(z_w|x, s)/π_ref(z_w|x, s)) - β log(π_θ(z_l|x, s)/π_ref(z_l|x, s)))]

Adaptation: Full fine-tuning (implied by lack of LoRA mention)

Training Data:

Generated using ToT search (BFS) on training sets of 7 datasets (GSM8K, StrategyQA, etc.)
Winning thoughts: Those in the final selected path
Losing thoughts: Child nodes of the winning path's parents that were NOT selected

Key Hyperparameters:

beta: 0.1 to 0.5 (depending on dataset)
learning_rate: 1e-5 to 5e-6
batch_size: 8 to 16
+ 2 more
epochs: 1 to 3
max_length: 1024 to 2048

Compute: Inference time comparison: ToT takes >50x longer than CPO (which equals CoT time).

Comparison to Prior Work

vs. ToT: CPO distills ToT into the model weights, removing the need for search at inference time.
vs. SFT/RFT: CPO utilizes negative examples (dispreferred thoughts) from the search tree, whereas SFT/RFT only use positive paths.
vs. Step-Level Value Methods (e.g., creating a value model): CPO optimizes the policy directly without training a separate reward/value model.
+ 1 more
vs. AlphaZero/MCTS-based LLM approaches: CPO avoids the complexity of training separate policy and value networks, using the LLM itself as the implicit reward model via DPO.

Limitations

Dependence on the quality of the base model's self-evaluation (ToT relies on the model scoring itself).
Requires generating a search tree for training data, which is computationally expensive during the data preparation phase.
Experiments limited to reasoning tasks (arithmetic, logic); applicability to creative writing or open-ended generation is untested.

Reproducibility

Code: https://github.com/sail-sg/CPO

Code is publicly available at https://github.com/sail-sg/CPO. Datasets are standard (GSM8K, StrategyQA, etc.). Hyperparameters for each dataset are listed in the appendix.

📊 Experiments & Results

Evaluation Setup

Reasoning tasks across arithmetic, commonsense, and symbolic domains.

Benchmarks:

GSM8K (Arithmetic Reasoning)
StrategyQA (Commonsense Reasoning)
ProntoQA (Logical Reasoning)
Date Understanding (Symbolic Reasoning (BigBench))
Coin Flip (Symbolic Reasoning (BigBench))

Metrics:

Accuracy (Exact Match)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Main results comparing CPO against base models (Zero-shot CoT) and standard fine-tuning baselines (SFT) using LLaMA-2-7B.
GSM8K	Accuracy	14.48	20.70	+6.22
StrategyQA	Accuracy	62.45	65.50	+3.05
ProntoQA	Accuracy	51.25	55.05	+3.80
Comparison against the heavy inference-time method Tree-of-Thought (ToT) showing CPO achieves similar performance with drastically less compute.
GSM8K	Accuracy	20.62	20.70	+0.08
Inference Latency	Seconds/sample	50.0	1.0	-49.0
Performance on Mistral-7B to show generalization across architectures.
GSM8K	Accuracy	36.47	41.62	+5.15

Experiment Figures

Performance comparison (Accuracy) of CPO vs. Baselines (Zero-Shot, SFT, RFT, ToT) across multiple datasets on LLaMA-2-7B.

Main Takeaways

CPO consistently outperforms baselines (SFT, RFT, Zero-Shot CoT) across diverse reasoning tasks (Arithmetic, Commonsense, Logical).
The method successfully transfers the reasoning power of Tree-of-Thought (ToT) search into the model's weights, eliminating the need for expensive search during inference.
Utilizing 'dispreferred' thoughts (negative branches from the search tree) provides critical supervision that standard fine-tuning on 'correct' paths misses.
Performance improvements are robust across different model families (LLaMA-2, Mistral) and sizes (7B, 13B).

📚 Prerequisite Knowledge

Prerequisites

Chain-of-Thought (CoT) prompting
Tree-of-Thought (ToT) search algorithms
Direct Preference Optimization (DPO)
Reinforcement Learning from Human Feedback (RLHF) concepts

Key Terms

CoT: Chain-of-Thought—prompting an LLM to generate intermediate reasoning steps before the final answer.

ToT: Tree-of-Thought—a method where LLMs explore multiple reasoning branches (thoughts) at each step, using self-evaluation to prune bad branches.

CPO: Chain of Preference Optimization—the proposed method that trains LLMs using preference pairs (good vs. bad thoughts) derived from ToT search trees.

DPO: Direct Preference Optimization—an algorithm that optimizes LLMs to align with preferences by minimizing a specific loss function on winner/loser pairs, without a separate reward model.

SFT: Supervised Fine-Tuning—training a model on high-quality examples (input-output pairs).

BFS: Breadth-First Search—a tree search algorithm that explores all nodes at the present depth level before moving on to nodes at the next depth level.

MCTS: Monte Carlo Tree Search—a heuristic search algorithm used to find optimal decisions by simulating many random future outcomes.

RFT: Rejection Sampling Fine-Tuning—generating multiple samples, filtering for correct answers, and fine-tuning on those correct paths.