Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences

📝 Paper Summary

LLM Alignment Direct Preference Optimization (DPO)

Curri-DPO aligns LLMs by deriving multiple preference pairs from single prompts and training on them sequentially from easiest (large quality gap) to hardest (small quality gap).

Core Problem

Standard DPO wastes data by using only a single chosen/rejected pair per prompt, ignoring the richer signal available in multiple ranked responses.

Why it matters:

Existing methods discard valid preference signals (e.g., 2nd best vs. worst) which could act as data augmentation
Learning contrastive signals is inefficient when models are immediately exposed to hard-to-distinguish pairs (similar quality) without a curriculum
High-quality response curation is expensive; maximizing utility from existing ranked responses improves efficiency

Concrete Example: For a prompt with responses ranked R1 > R2 > R3 > R4, standard DPO only uses (R1, R4). It ignores that distinguishing R1 from R4 is 'easy', while distinguishing R1 from R2 (both good) is 'hard' but crucial for fine-grained alignment.

Key Novelty

Curriculum-based Direct Preference Optimization (Curri-DPO)

Decomposes a ranked list of responses (R1>R2>R3>R4) into multiple pairwise comparisons (e.g., R1 vs R4, R1 vs R3, R1 vs R2)
Orders training data by difficulty: starts with 'easy' pairs (large quality gap, R1 vs R4) and progresses to 'hard' pairs (small quality gap, R1 vs R2)
Iteratively updates the reference model: the model from iteration 'i' becomes the reference for iteration 'i+1', allowing progressive alignment

Architecture

Illustration of the Curriculum DPO process showing how multiple preference pairs are created from ranked responses and ordered by difficulty.

Evaluation Highlights

7.43 score on MT-Bench with Zephyr-7B, outperforming the majority of existing LLMs with similar parameter sizes
Achieves 90.7% win rate on Vicuna bench (Zephyr-7B), showing strong alignment performance
Notable gains of up to 7.5% on Vicuna, WizardLM, and UltraFeedback test sets compared to standard single-pair DPO

Breakthrough Assessment

7/10

A simple yet effective extension to DPO that leverages curriculum learning and data augmentation from ranked lists, showing consistent gains over standard DPO without changing the underlying loss function.

⚙️ Technical Details

Problem Definition

Setting: Aligning a Supervised Fine-Tuned (SFT) model to human preferences using offline pairwise preference data

Inputs: Prompt x and a set of ranked responses {y1, y2, ..., yK}

Outputs: Optimized policy π_theta aligned with preferences

Pipeline Flow

Input Prompt
Aligned LLM
Output Response

System Modules

Aligned LLM

Generate aligned responses to user prompts

Model or implementation: Zephyr-7B or Mistral-7B

Novel Architectural Elements

Iterative Reference Model Update: Unlike standard DPO which keeps the SFT model as the fixed reference, Curri-DPO updates the reference model to be the policy from the previous curriculum stage (Eq. 2).

Modeling

Base Model: Zephyr-7B (fine-tuned on UltraChat) and Mistral-7B (fine-tuned on OpenAssistant)

Training Method: Curriculum-based Direct Preference Optimization (Curri-DPO)

Objective Functions:

Purpose: Optimize policy to prefer chosen responses over rejected ones while staying close to a reference.

Formally: DPO loss L_DPO(π_theta; π_ref) = -E[log σ(β * log(π_theta(yw)/π_ref(yw)) - β * log(π_theta(yl)/π_ref(yl)))]
Purpose: Iteratively update the reference model for curriculum stages.

Formally: π_ref for iteration i+1 is set to π_theta from iteration i.

Training Data:

UltraFeedback: 5K randomly sampled prompts, 4 responses each (ranked by GPT-4)
OpenAssistant: 5K sampled conversation trees, top-4 responses (ranked by humans)
Pairs constructed by fixing best response as 'chosen' and iterating 'rejected' from worst (R4) to second-best (R2)

Key Hyperparameters:

global_batch_size: 32
learning_rate: 5e-7 (max)
scheduler: linear with 10% warmup
+ 2 more
optimizer: Adam (beta1=0.9, beta2=0.999)
precision: bfloat16

Compute: Not reported in the paper

Comparison to Prior Work

vs. RRHF/LiPO: Curri-DPO retains the pairwise DPO formulation but structures the data feed via curriculum, rather than changing the loss to listwise ranking
vs. Standard DPO: Utilizes multiple pairs per prompt and updates reference model iteratively
vs. SPIN: Uses existing ranked responses for curriculum rather than generating new responses via self-play

Limitations

Reliance on high-quality ranked data (requires at least 3-4 responses per prompt to form a curriculum)
Computationally more expensive than standard DPO due to multiple training iterations over the dataset
Performance depends on the accuracy of the ranking source (GPT-4 or human annotators)

Reproducibility

Code: https://github.com/ServiceNow-AI/Curriculum_DPO_preferences

Preference pairs are released at ServiceNow-AI/Curriculum_DPO_preferences. Code is publicly available. Training relies on specific subsets (5K samples) of UltraFeedback and OpenAssistant.

📊 Experiments & Results

Evaluation Setup

Evaluation of aligned 7B models on standard chat benchmarks using GPT-4 as a judge

Benchmarks:

MT-Bench (Multi-turn conversation across 8 domains)
Vicuna Bench (Single-turn questions (knowledge, reasoning, etc.))
WizardLM (Instruction following on complex evol-instruct questions)
UltraFeedback Test Set (Helpfulness and honesty evaluation)

Metrics:

MT-Bench Score (1-10)
Adjusted Win Rate (vs. SFT or DPO baselines)
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
Performance on conversational benchmarks showing improvements over baselines. (Note: Baseline absolute values for standard DPO were not explicitly extracted from the text, so only Curri-DPO values are listed here where explicit comparisons are made in text).
MT-Bench	Score (1-10)	Not reported in the paper	7.43	Not reported in the paper
Vicuna Bench	Win Rate	Not reported in the paper	90.7%	Not reported in the paper
WizardLM	Win Rate	Not reported in the paper	87.1%	Not reported in the paper

Main Takeaways

Curriculum learning ordering (Easy to Hard) consistently outperforms random shuffling of multiple preference pairs.
Iteratively updating the reference model (setting ref = model from previous curriculum stage) is crucial for performance gains.
Utilizing multiple preference pairs per prompt acts as effective data augmentation, improving over single-pair DPO.
The method is effective even with much less training data (5K prompts) compared to full dataset training (64K prompts) used in baselines like Zephyr-beta.

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Curriculum Learning
Reinforcement Learning from Human Feedback (RLHF)

Key Terms

DPO: Direct Preference Optimization—a stable method for aligning language models to preferences by optimizing a classification loss rather than using reinforcement learning

Curriculum Learning: A training strategy where examples are presented in a meaningful order (e.g., easy to hard) to improve convergence and generalization

SFT: Supervised Fine-Tuning—the initial phase of training on high-quality instruction-response pairs before preference alignment

Reference Model: The baseline model (usually the SFT model) used in DPO to regularize the training and prevent the new model from drifting too far (via KL divergence)

LogP: Log Probability—the logarithm of the probability assigned by the model to a token or sequence; used here as a proxy for model confidence or response quality