Preference Robustness for DPO with Applications to Public Health

📝 Paper Summary

Robust Preference Alignment Reward Design for Public Health

DPO-PRO improves the robustness of language model fine-tuning by incorporating uncertainty in human preference labels into the Direct Preference Optimization loss, specifically for designing reward functions in public health resource allocation.

Core Problem

In high-stakes domains like public health, human preference data for reward functions is inherently noisy and ambiguous, leading standard alignment methods like DPO to overfit or reward-hack.

Why it matters:

Public health objectives (e.g., 'prioritize vulnerable mothers') are subjective and ambiguous, making consistent annotation difficult
Assessing reward functions requires complex reasoning about long-term policy outcomes, increasing label noise compared to standard text tasks
Standard DPO assumes reliable preference labels; noise can lead to misaligned policies that waste limited resources in real-world deployments

Concrete Example: When an annotator is unsure whether Reward Function A (prioritizing age) or Reward Function B (prioritizing income) better meets a vague goal like 'help the most vulnerable', standard DPO forces the model to fully commit to the noisy label. DPO-PRO detects this uncertainty (preference probability near 0.5) and reduces the gradient update, preventing the model from learning confidently from ambiguous signals.

Key Novelty

Distributionally Robust DPO with Preference Uncertainty (DPO-PRO)

Models the preference label not as a fixed ground truth but as a distribution with uncertainty, applying Distributionally Robust Optimization (DRO) to hedge against worst-case deviations
Unlike prior methods that robustify the entire joint distribution (prompts + responses), DPO-PRO focuses solely on the preference probability, resulting in a lightweight, closed-form analytic solution
Can be interpreted as a regularized DPO loss that dynamically penalizes the model for being overconfident when the underlying preference signal is weak or ambiguous

Architecture

Comparison of inference-time costs and performance between DPO-PRO and self-reflection baselines

Evaluation Highlights

Achieves comparable performance to DLM (a self-reflection baseline) on the ARMMAN maternal health task while reducing inference cost significantly (no iterative simulation needed)
Outperforms vanilla DPO and prior DRO baselines (Dr.DPO) in robustness to noisy preference labels on both UltraFeedback and public health simulation tasks
Demonstrates consistent improvement in alignment accuracy across various noise levels in preference annotations

Breakthrough Assessment

7/10

Offers a mathematically elegant, computationally negligible fix for a critical practical problem (noisy preferences) with strong application to high-stakes public health, though the core algorithmic innovation is a specific application of DRO.

⚙️ Technical Details

Problem Definition

Setting: Fine-tuning an LLM to generate Python code (reward functions) for Restless Multi-Armed Bandit (RMAB) problems based on natural language objectives

Inputs: Natural language prompt x describing a resource allocation goal (e.g., 'Prioritize engagement for low-income mothers')

Outputs: Python code y representing a reward function R(s, a) that guides an RL agent

Pipeline Flow

Prompt Input
Reward Function Generation
RMAB Simulation (Training only)
Preference Annotation (Training only)

System Modules

Prompt Input

Receives natural language description of public health objective

Model or implementation: Not applicable

Reward Function Generator

Generates Python code implementing a reward function

Model or implementation: Fine-tuned LLM (e.g., based on open-source models)

RMAB Simulator

Trains an RL agent using the generated reward function and evaluates the resulting policy

Model or implementation: Whittle Index Policy / PPO

Novel Architectural Elements

Modification of the DPO loss function to include an analytical DRO term derived from Chi-squared divergence on preference probabilities

Modeling

Base Model: Gemma-2B / Gemma-7B (implied from typical DPO setups and focus on open-source, though exact base model for ARMMAN experiments not explicitly named in excerpt)

Training Method: Direct Preference Optimization (DPO) with robust loss adjustment

Objective Functions:

Purpose: Robustify DPO against noisy preference labels.

Formally: min_theta max_p L_DPO(pi_theta) s.t. Chi^2(p || q) <= rho
Purpose: Closed-form per-sample loss.

Formally: Replaces preference probability p* with min(1, q + sqrt(rho * q * (1-q))) in the gradient update

Adaptation: Fine-tuning

Key Hyperparameters:

rho: Robustness radius (controls size of ambiguity set)

Compute: Negligible overhead over standard DPO (closed-form scalar calculation per sample)

Comparison to Prior Work

vs. Dr.DPO: DPO-PRO focuses only on preference noise (not prompt/response noise) and uses Chi-squared divergence, leading to a simpler, less conservative analytic solution
vs. Standard DPO: Adds a regularization term that penalizes overconfidence when preference signals are ambiguous (near 0.5)
vs. Self-reflection: DPO-PRO moves the computational burden to training time, enabling fast inference without repeated simulations

Limitations

Relies on the availability of a 'soft' preference score or probability q(y1 > y2 | x) to define the ambiguity set center
Evaluation on public health tasks is simulation-based; real-world deployment outcomes are not reported
Effectiveness depends on the choice of the robustness radius rho

Reproducibility

Code availability is not explicitly provided in the text. The method is mathematically self-contained with closed-form solutions provided. The ARMMAN dataset is proprietary/sensitive, but the simulation environment is described.

📊 Experiments & Results

Evaluation Setup

Reward function generation for RMABs in maternal health (ARMMAN) and general alignment (UltraFeedback)

Benchmarks:

ARMMAN Maternal Health Simulation (Public Health Resource Allocation (RMAB))
UltraFeedback (General Instruction Following / Alignment)

Metrics:

Robustness to label noise
Policy performance (cumulative engagement)
Inference cost
Statistical methodology: Not explicitly reported in the paper

Main Takeaways

DPO-PRO consistently outperforms vanilla DPO and Dr.DPO in the presence of preference noise.
The method matches the performance of computationally expensive self-reflection methods (DLM) while requiring only a single inference pass.
The robust loss acts as a regularizer, penalizing the model for high confidence when the preference signal itself is uncertain (probability near 0.5).

📚 Prerequisite Knowledge

Prerequisites

Direct Preference Optimization (DPO)
Distributionally Robust Optimization (DRO)
Restless Multi-Armed Bandits (RMAB)
Chi-squared divergence

Key Terms

DPO: Direct Preference Optimization—an algorithm that aligns language models to human preferences by directly optimizing the policy on pairwise preference data without a separate reward model

RMAB: Restless Multi-Armed Bandit—a type of resource allocation problem where multiple processes (arms) evolve over time and a limited number can be acted upon at once

DRO: Distributionally Robust Optimization—an optimization framework that minimizes the worst-case loss over a set of possible distributions (ambiguity set) rather than just the empirical average

Ambiguity Set: A set of probability distributions considered possible around the observed data distribution; the model optimizes against the worst distribution in this set

Self-reflection: An inference-time technique where an LLM critiques and refines its own outputs, often using feedback from a simulator

Chi-squared divergence: A statistical measure of the difference between two probability distributions, used here to define the size of the ambiguity set

DLM: Decision Language Model—a baseline method that uses an LLM with self-reflection and iterative feedback to design reward functions

Reward Hacking: When an agent exploits flaws in the reward function to get a high score without actually achieving the intended goal