Provably Convergent Primal-Dual DPO for Constrained LLM Alignment

📝 Paper Summary

LLM Safety Alignment Constrained Optimization Reinforcement Learning from Human Feedback (RLHF)

PD-DPO is a constrained alignment method that fine-tunes LLMs to maximize helpfulness while satisfying safety constraints by training only two models instead of the standard three.

Core Problem

Existing constrained alignment methods for LLMs (like Safe RLHF) require training three separate large models (reward model, cost model, policy), incurring prohibitive memory costs.

Why it matters:

Training three LLM-sized models simultaneously is memory-intensive and computationally expensive, limiting accessibility for researchers with limited hardware
Alternative methods either require prior knowledge of the optimal solution (Lagrange multiplier) or suffer from inefficient learning, leading to poor safety-helpfulness trade-offs

Concrete Example: Safe RLHF trains a reward model, a cost model, and a policy model. PD-DPO eliminates the explicit cost model by using a rearranged DPO objective, needing only a reward-aligned model and the final policy.

Key Novelty

Primal-Dual DPO (PD-DPO)

Uses a pre-trained reward-aligned DPO model to implicitly provide reward information, avoiding the need for a separate explicit reward model during the safety tuning phase
Formulates a rearranged Lagrangian DPO objective that directly optimizes the policy using cost preference data, conditioned on the reward information from the first model
Updates the Lagrange multiplier (balancing safety vs. reward) via projected subgradient descent using cost estimates from the current policy

Architecture

The PD-DPO algorithm flow comprising two main phases: Reward Learning and Primal-Dual Updates.

Evaluation Highlights

Achieves higher reward (helpfulness) while maintaining lower cost (safety) than Safe RLHF and C-DPO baselines on PKU-SafeRLHF dataset
Reduces memory footprint by ~33% compared to Safe RLHF (requiring 2 models instead of 3)
Provides rigorous theoretical guarantees for suboptimality and constraint violation, unlike most heuristic safety alignment methods

Breakthrough Assessment

8/10

Strong theoretical grounding with provable convergence coupled with a practical reduction in memory requirements. Sets a new state-of-the-art for efficient constrained alignment.

⚙️ Technical Details

Problem Definition

Setting: Constrained Alignment Problem: Maximize expected reward subject to expected cost being below a threshold.

Inputs: Prompt x, Reward preference dataset D^r, Cost preference dataset D^c

Outputs: Policy π satisfying cost constraints

Pipeline Flow

Standard DPO Training (Reward-Aligned Model)
Primal-Dual DPO Fine-tuning (Safety-Aligned Model)

System Modules

Reward-Aligned Model (π*_r)

Learns to maximize helpfulness/reward from reward preference data

Model or implementation: LLaMA-2-7B-base or similar

Safety-Aligned Model (π_k)

Fine-tunes the policy to satisfy safety constraints while maintaining helpfulness

Model or implementation: LLaMA-2-7B-base or similar

Novel Architectural Elements

Dual-model pipeline: Instead of (Reward Model + Cost Model + Policy), uses (Reward-Aligned Policy + Final Policy)

Modeling

Base Model: Beaver-7B (based on LLaMA-2-7B)

Training Method: Primal-Dual Direct Preference Optimization (PD-DPO)

Objective Functions:

Purpose: Maximize reward information.

Formally: Standard DPO loss on reward preference data D^r.
Purpose: Optimize Lagrangian (Reward - λ*Cost).

Formally: Rearranged DPO loss on cost preference data D^c, using π*_r (from step 1) to substitute the unknown reward function.

Key Hyperparameters:

beta: 0.1
learning_rate: 1e-5 (DPO), 5e-6 (PD-DPO)
batch_size: 64
+ 3 more
epochs: 1 (DPO), 2 (PD-DPO)
scheduler: cosine
max_length: 512

Compute: Requires training 2 models sequentially (reduced from 3 models in Safe RLHF). Experiments run on A800 GPUs.

Comparison to Prior Work

vs. Safe RLHF: PD-DPO requires 2 models instead of 3; avoids explicit reward/cost modeling
vs. C-DPO: PD-DPO handles constraints via explicit Lagrangian primal-dual updates rather than heuristic data reordering
vs. SACPO: PD-DPO adapts λ dynamically and does not require knowing optimal λ beforehand

Limitations

Relies on the existence of a safe policy (feasibility assumption)
Requires two sequential training stages (cannot be fully parallelized)
Performance depends on the quality of the initial reward-aligned model

Reproducibility

Code availability is not provided in the paper. Datasets (PKU-SafeRLHF, TruthfulQA) are public. Base models (Beaver-7B) are open-source.

📊 Experiments & Results

Evaluation Setup

Constrained alignment on safety and helpfulness tasks

Benchmarks:

PKU-SafeRLHF (Safety and Helpfulness Dialogue)
TruthfulQA (Truthfulness Evaluation)

Metrics:

Reward Score (Helpfulness)
Cost Score (Harmlessness/Safety)
Win Rate vs SFT
Statistical methodology: Not explicitly reported in the paper

Key Results

Benchmark	Metric	Baseline	This Paper	Δ
PKU-SafeRLHF	Cost Score (lower is safer)	18.3	13.6	-4.7
PKU-SafeRLHF	Reward Score (higher is better)	2.1	2.9	+0.8
TruthfulQA	MC1 (Accuracy)	28.5	30.4	+1.9
PKU-SafeRLHF	Helpfulness Win Rate vs SFT	51.0	56.2	+5.2

Experiment Figures

Pareto frontier curves of Reward vs. Cost on PKU-SafeRLHF for PD-DPO and baselines.

Main Takeaways

PD-DPO consistently achieves a better trade-off between helpfulness (reward) and safety (cost) than Safe RLHF and C-DPO.
The adaptive Lagrangian variant (PD-DPO-adaLag) effectively regulates cost without manual tuning of λ.
Requires significantly less memory (2 models vs 3) than Safe RLHF, making it more practical for resource-constrained settings.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Lagrangian Duality / Primal-Dual Optimization
Bradley-Terry Model

Key Terms

_comment: REQUIRED: Define ALL technical terms, acronyms, and method names used ANYWHERE in the entire summary. After drafting the summary, perform a MANDATORY POST-DRAFT SCAN: check every section individually (Core.one_sentence_thesis, evaluation_highlights, core_problem, Technical_details, Experiments.key_results notes, Figures descriptions and key_insights). HIGH-VISIBILITY RULE: Terms appearing in one_sentence_thesis, evaluation_highlights, or figure key_insights MUST be defined—these are the first things readers see. COMMONLY MISSED: PPO, DPO, MARL, dense retrieval, silver labels, cosine schedule, clipped surrogate objective, Top-k, greedy decoding, beam search, logit, ViT, CLIP, Pareto improvement, BLEU, ROUGE, perplexity, attention heads, parameter sharing, warm start, convex combination, sawtooth profile, length-normalized attention ratio, NTP. If in doubt, define it.

DPO: Direct Preference Optimization—a method to align language models to preferences without explicitly training a reward model

RLHF: Reinforcement Learning from Human Feedback—a framework involving training a reward model and then optimizing a policy using RL (e.g., PPO)

PPO: Proximal Policy Optimization—a standard RL algorithm used in RLHF

Lagrange multiplier: A variable (λ) used in constrained optimization to weigh the constraint violation (cost) against the objective (reward)

Primal-Dual: An optimization approach that simultaneously updates the policy (primal variable) and the Lagrange multiplier (dual variable)

SFT: Supervised Fine-Tuning—the initial training phase of LLMs on high-quality instruction data

Bradley-Terry model: A statistical model predicting the probability that one item is preferred over another based on their latent scores

Safe RLHF: A baseline framework that trains distinct reward and cost models and uses PPO-Lagrangian to align LLMs

C-DPO: Constrained DPO—a baseline method that modifies DPO for safety constraints, often by reordering preferences