← Back to Paper List

Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum
University of Massachusetts Amherst
arXiv (2026)
RL Reasoning

📝 Paper Summary

Safe Reinforcement Learning AI Alignment Risk-Sensitive Control
RAD replaces standard expected cost constraints in Safe RLHF with First-Order Stochastic Dominance constraints to control the entire cost distribution and mitigate tail risks.
Core Problem
Standard Safe RLHF constrains only the expected cost, failing to account for distributional uncertainty and heavy tails where rare but catastrophic harms occur.
Why it matters:
  • High-stakes domains like medical advice or legal reasoning require strict limits on worst-case outcomes, not just average safety
  • Expectation-based constraints allow models to 'pay for' severe safety violations (e.g., toxicity) by being very safe on easy prompts, masking tail risks
  • Current methods lack a principled mechanism to tune risk sensitivity (e.g., distinguishing between a risk-neutral chatbot and a risk-averse medical assistant)
Concrete Example: A medical chatbot trained with expected cost constraints might occasionally output severe misinformation or toxic advice, as long as its average behavior on common questions is safe enough to satisfy the mean constraint. RAD ensures the entire distribution of costs is suppressed relative to a safe reference.
Key Novelty
Risk-sensitive Alignment via Dominance (RAD)
  • Enforces First-Order Stochastic Dominance (FSD) to ensure the learned policy's cost distribution is stochastically smaller (better) than a reference policy across all quantiles
  • Reformulates the dominance constraint as an Optimal Transport problem solvable via differentiable Sinkhorn iterations for end-to-end optimization
  • Connects quantile-weighted dominance to Spectral Risk Measures, allowing users to control specific risk profiles (e.g., worst-case focus vs. average focus) via weighting functions
Evaluation Highlights
  • RAD improves harmlessness (safety) over Safe RLHF baselines while maintaining competitive helpfulness scores
  • Demonstrates greater robustness on out-of-distribution harmlessness evaluations compared to expectation-based constraints
  • Universally controls Spectral Risk Measures: improvements in weighted FSD imply guaranteed improvements in corresponding risk metrics (like CVaR)
Breakthrough Assessment
7/10
Offers a mathematically rigorous upgrade to Safe RLHF by moving from expected value to distributional control. The connection to Spectral Risk Measures provides a unified framework for risk tuning.
×