← Back to Paper List

Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

Elle
University of Oxford, Department of Computer Science
arXiv (2025)
RL P13N Benchmark

📝 Paper Summary

AI Safety and Alignment Reward Modeling Sociodemographic Bias
The paper formalizes a framework to quantify sociodemographic biases in Reward Models, finding they consistently favor specific demographics and cannot be reliably steered via in-context prompting.
Core Problem
Reward Models (RMs) act as proxies for human values in alignment training, but their inherent sociodemographic biases and opinion distributions remain opaque and unmeasured.
Why it matters:
  • RMs define the optimization landscape for Language Models (LMs); if the RM is biased, the resulting LM will inevitably propagate those social biases
  • Current evaluations focus on LMs (which suffer from refusals and instability), neglecting the upstream RMs that actually drive preference learning
  • Blindly relying on RMs for safety and alignment risks reinforcing stereotypes if the models implicitly favor harmful or non-representative perspectives
Concrete Example: When evaluated on the OpinionQA dataset, RMs consistently assigned higher rewards to opinions held by people from the American South with lower formal education, rather than representing a balanced global perspective. Furthermore, BeaverRM consistently preferred 'Stereotyped' choices in the BBQ dataset, while Pythia1B preferred 'Unknown' options.
Key Novelty
Reward Model Perspectives (RMPs) Framework
  • Audits alignment by treating the Reward Model (RM) as a classifier: feeding it multiple-choice survey questions and interpreting the output rewards as an opinion probability distribution
  • Decouples alignment measurement from text generation, bypassing issues like model refusals or formatting errors common in Language Model evaluations
  • Introduces metrics to compare RM opinion distributions against specific human demographic groups using distance functions like Jensen-Shannon and Wasserstein
Evaluation Highlights
  • Steering via prompting (Bio, Portray, QA) failed to meaningfully shift RM opinions, with negligible effect sizes (e.g., 0.086 for Bio steering) compared to un-steered baselines
  • High consistency in relative bias: RMs showed a Spearman's rank correlation of 0.67 across demographic preferences, indicating they favor the same groups regardless of architecture
  • Absolute alignment varies by model capability: Pythia7BRM achieved 0.930 alignment with overall human respondents on PRISM, whereas BeaverRM reached only 0.732
Breakthrough Assessment
8/10
Significant methodological contribution by shifting scrutiny from LMs to RMs. The finding that RMs are essentially 'unsteerable' via prompting is a critical negative result for safety alignment.
×