Robust Multi-Objective Preference Alignment with Online DPO

📝 Paper Summary

Multi-objective alignment Steerable generation

MO-ODPO aligns a single language model to multiple conflicting user preferences by combining prompt-based weight conditioning with online Direct Preference Optimization, enabling dynamic inference-time steerability without retraining.

Core Problem

Aligning LLMs to diverse, conflicting human preferences (e.g., helpfulness vs. harmlessness) typically requires training separate models for each trade-off or using computationally expensive parameter merging at inference.

Why it matters:

User preferences vary widely and change over time, making it infeasible to train a specialist policy for every possible combination of objective weights
Existing offline methods suffer from distribution shift and overfitting, while parameter-based steering methods (souping) incur high computational costs during inference
Current approaches struggle to provide fine-grained control over model behavior without sacrificing performance or efficiency

Concrete Example: A user might require a summary that is '80% concise and 20% detailed', while another wants '20% concise and 80% detailed'. Standard methods force a single fixed trade-off (e.g., 50/50) or require loading completely different model weights for each user. MO-ODPO allows passing these weights in the prompt to a single model.

Key Novelty

Multi-Objective Online Direct Preference Optimization (MO-ODPO)

Conditions a single policy on numerical preference weights inserted directly into the system prompt, allowing the model to 'read' the desired trade-off
Uses an online training loop where the model generates its own responses, scores them against the current weights using multiple reward models, and updates via DPO to learn the Pareto frontier dynamically

Architecture

The training loop for a single step of MO-ODPO.

Breakthrough Assessment

7/10

Proposes a logical and efficient extension of Online DPO to multi-objective settings. While the components (DPO, prompt conditioning) exist, the combination addresses a critical efficiency gap in personalized alignment.

⚙️ Technical Details

Problem Definition

Setting: Multi-objective reinforcement learning from human feedback (RLHF) for language models

Inputs: Input prompt x and a set of objective weights w (summing to 1)

Outputs: Generated response y that maximizes the weighted combination of K rewards

Pipeline Flow

Prompt Formatter (Injects objective weights)
Policy Model (Generates response)

System Modules

Prompt Formatter

Constructs the input prefix containing the sampled objective weights

Model or implementation: Deterministic rule

Policy Model

Generates the response conditioned on the specific trade-off requested

Model or implementation: PaLM 2 XS or XXS

Novel Architectural Elements

Integration of scalar objective weights directly into the prompt structure for online preference learning

Modeling

Base Model: PaLM 2 XS (Otter) and PaLM 2 XXS (Gecko)

Training Method: Multi-Objective Online Direct Preference Optimization (MO-ODPO)

Objective Functions:

Purpose: Optimize the policy to prefer responses with higher weighted rewards while staying close to the reference model.

Formally: DPO loss minimized over pairs (y_winner, y_loser) determined by weighted sum of K reward model scores.

Training Data:

Anthropic-HH dataset (Helpfulness vs Harmlessness)
Reddit TL;DR dataset (Summary Quality vs Entailment/Factuality)

Key Hyperparameters:

batch_size: 128
learning_rate: 1e-5
kl_regularization_beta: 0.01
+ 2 more
dirichlet_alpha: 1.0 (Anthropic), 0.7 (TL;DR)
epochs: 2

Compute: Trained on TPU-v5e

Comparison to Prior Work

vs. Rewarded Soups: MO-ODPO uses a single model with prompt steering rather than expensive parameter merging at inference
vs. P-MORL: MO-ODPO uses preference optimization (DPO) rather than RL loops (PPO/REINFORCE), avoiding reward model training instability
vs. MODPO: MO-ODPO is steerable via prompts, whereas MODPO requires retraining for new weight combinations
+ 1 more
vs. CPO: MO-ODPO creates preference pairs on-the-fly via reward models, whereas CPO requires pre-existing multi-objective preference labels

Limitations

Relies on the availability of accurate reward models for all objectives to generate online training signals
Evaluation limited to two objectives per benchmark; scaling to many objectives (e.g., >5) is untested
Performance depends on the sampling strategy (Dirichlet alpha) used during training

Reproducibility

No replication artifacts mentioned in the paper. Code and model weights are not provided. Uses proprietary PaLM 2 models.

📊 Experiments & Results

Evaluation Setup

Multi-objective alignment on dialogue and summarization tasks

Benchmarks:

Anthropic-HH (Dialogue Safety vs Helpfulness)
Reddit TL;DR (Summarization Quality vs Factuality/Conciseness)

Metrics:

Reward Scores (per objective)
Pareto Dominance
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

MO-ODPO reportedly Pareto-dominates existing baselines (Rewarded Soups, P-MORL) on both benchmarks, offering better trade-offs between conflicting objectives.
The method achieves effective steerability, allowing precise control over the output characteristics (e.g., helpfulness vs. harmlessness) simply by changing the prompt weights at inference.
Using prompt-based conditioning eliminates the inference-time computational overhead associated with parameter-merging methods like Model Soups.
Online sampling prevents the distribution shift and overfitting issues common in offline preference optimization methods.

📚 Prerequisite Knowledge

Prerequisites

Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Multi-objective optimization (Pareto optimality)
Bradley-Terry preference model

Key Terms

DPO: Direct Preference Optimization—an algorithm that optimizes language models to satisfy preferences without explicitly training a reward model in the loop

Online DPO: A variant of DPO where the model generates its own training data (responses) during training, which are then scored/labeled, reducing distribution shift compared to offline data

Pareto frontier: The set of optimal trade-offs between conflicting objectives where no objective can be improved without degrading another

Dirichlet sampling: A probability distribution used here to sample valid combinations of objective weights (percentages) during training to ensure diverse coverage of the trade-off space

Steerable policy: A single model capable of adjusting its behavior at inference time based on an input control signal (like a weight vector)

Model souping: A technique of averaging the weights of multiple models trained on different objectives to create a new model that balances those objectives

KL regularization: A penalty term that prevents the trained model from diverging too far from a reference model (usually the pre-trained base model)